Web Scraping24 min read
Robots.txt in Practice
Learn how to read and use robots.txt files to know what you are allowed to scrape and build ethical scrapers.
David Miller
December 17, 2025
1.7k72
Every serious scraper must respect robots.txt.
It tells bots which parts of a site are allowed.
What is robots.txt
It lives at:
https://site.com/robots.txt
Example:
User-agent: *
Disallow: /admin
Allow: /products
Means:
- all bots allowed on /products
- no bots on /admin
Why you should follow it
- legal safety
- professional practice
- avoid being banned
- show respect
Check robots.txt using Python
import urllib.robotparser as robotparser
rp = robotparser.RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()
print(rp.can_fetch("*", "https://example.com/products"))
Graph: robots decision
flowchart TD
A[URL to scrape] --> B[Check robots.txt]
B -->|Allowed| C[Scrape]
B -->|Disallowed| D[Stop]
Remember
- Always check robots.txt first
- Never scrape disallowed paths
- Ethics matter in scraping
#Python#Advanced#Ethics