Web Scraping24 min read
Robots.txt in Practice
Learn how to read and use robots.txt files to know what you are allowed to scrape and build ethical scrapers.
David Miller
December 21, 2025
0.0k0
Every serious scraper must respect robots.txt.
It tells bots which parts of a site are allowed.
What is robots.txt It lives at: https://site.com/robots.txt
Example: User-agent: * Disallow: /admin Allow: /products
Means: - all bots allowed on /products - no bots on /admin
Why you should follow it - legal safety - professional practice - avoid being banned - show respect
Check robots.txt using Python ```python import urllib.robotparser as robotparser
rp = robotparser.RobotFileParser() rp.set_url("https://example.com/robots.txt") rp.read()
print(rp.can_fetch("*", "https://example.com/products")) ```
Graph: robots decision ```mermaid flowchart TD A[URL to scrape] --> B[Check robots.txt] B -->|Allowed| C[Scrape] B -->|Disallowed| D[Stop] ```
Remember - Always check robots.txt first - Never scrape disallowed paths - Ethics matter in scraping
#Python#Advanced#Ethics