Web Scraping24 min read

Robots.txt in Practice

Learn how to read and use robots.txt files to know what you are allowed to scrape and build ethical scrapers.

David Miller
December 17, 2025
1.7k72

Every serious scraper must respect robots.txt.

It tells bots which parts of a site are allowed.

What is robots.txt

It lives at:
https://site.com/robots.txt

Example:
User-agent: *
Disallow: /admin
Allow: /products

Means:

  • all bots allowed on /products
  • no bots on /admin

Why you should follow it

  • legal safety
  • professional practice
  • avoid being banned
  • show respect

Check robots.txt using Python

import urllib.robotparser as robotparser

rp = robotparser.RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()

print(rp.can_fetch("*", "https://example.com/products"))

Graph: robots decision

flowchart TD
  A[URL to scrape] --> B[Check robots.txt]
  B -->|Allowed| C[Scrape]
  B -->|Disallowed| D[Stop]

Remember

  • Always check robots.txt first
  • Never scrape disallowed paths
  • Ethics matter in scraping
#Python#Advanced#Ethics