Web Scraping24 min read

Robots.txt in Practice

Learn how to read and use robots.txt files to know what you are allowed to scrape and build ethical scrapers.

David Miller
December 21, 2025
0.0k0

Every serious scraper must respect robots.txt.

It tells bots which parts of a site are allowed.

What is robots.txt It lives at: https://site.com/robots.txt

Example: User-agent: * Disallow: /admin Allow: /products

Means: - all bots allowed on /products - no bots on /admin

Why you should follow it - legal safety - professional practice - avoid being banned - show respect

Check robots.txt using Python ```python import urllib.robotparser as robotparser

rp = robotparser.RobotFileParser() rp.set_url("https://example.com/robots.txt") rp.read()

print(rp.can_fetch("*", "https://example.com/products")) ```

Graph: robots decision ```mermaid flowchart TD A[URL to scrape] --> B[Check robots.txt] B -->|Allowed| C[Scrape] B -->|Disallowed| D[Stop] ```

Remember - Always check robots.txt first - Never scrape disallowed paths - Ethics matter in scraping

#Python#Advanced#Ethics