Robots.txt in Practice

Learn how to read and use robots.txt files to know what you are allowed to scrape and build ethical scrapers.

David Miller

December 17, 2025

1.7k72

Every serious scraper must respect robots.txt.

It tells bots which parts of a site are allowed.

What is robots.txt

It lives at:
https://site.com/robots.txt

Example:
User-agent: *
Disallow: /admin
Allow: /products

Means:

all bots allowed on /products
no bots on /admin

Why you should follow it

legal safety
professional practice
avoid being banned
show respect

Check robots.txt using Python

import urllib.robotparser as robotparser

rp = robotparser.RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()

print(rp.can_fetch("*", "https://example.com/products"))

Graph: robots decision

flowchart TD
  A[URL to scrape] --> B[Check robots.txt]
  B -->|Allowed| C[Scrape]
  B -->|Disallowed| D[Stop]

Remember

Always check robots.txt first
Never scrape disallowed paths
Ethics matter in scraping

#Python#Advanced#Ethics

Robots.txt in Practice

What is robots.txt

Why you should follow it

Check robots.txt using Python

Graph: robots decision

Remember

More on Web Scraping

Web Scraping Intro

How Websites Work

HTTP Requests Basics