Python26 min read

Python Web Scraping

Learn ethical web scraping using requests and BeautifulSoup: extract titles, links, tables, handle pagination safely, and store data for analysis.

David Miller
August 1, 2025
12.4k551

Web scraping means collecting data from web pages automatically.

        Examples:
        - product prices
        - job listings
        - news headlines
        - research datasets
        
        ## Before you scrape (very important)
        
        1) Check **website terms** and **robots.txt**  
        2) Add delays to avoid overloading servers  
        3) Do not scrape private or login-protected data without permission
        
        This keeps your scraping ethical and safe.
        
        ## Install libraries
        
        ```bash
        pip install beautifulsoup4 requests
        ```
        
        ## Step 1: Download a page (requests)
        
        ```python
        import requests
        
        url = "https://example.com"
        response = requests.get(url, timeout=10)
        
        print(response.status_code)
        print(response.text[:200])
        ```
        
        ## Step 2: Parse HTML (BeautifulSoup)
        
        ```python
        from bs4 import BeautifulSoup
        
        html = "<html><title>Hello</title></html>"
        soup = BeautifulSoup(html, "html.parser")
        
        print(soup.title.text)
        ```
        
        ## Real scraping example: title + all links
        
        ```python
        import requests
        from bs4 import BeautifulSoup
        
        url = "https://example.com"
        response = requests.get(url, timeout=10)
        soup = BeautifulSoup(response.text, "html.parser")
        
        print("Title:", soup.title.text)
        
        for link in soup.find_all("a"):
            print(link.get("href"))
        ```
        
        ## Finding specific elements (id, class, tag)
        
        ```python
        from bs4 import BeautifulSoup
        
        html = """
        <div class="container">
          <h1 id="main-title">Welcome</h1>
          <p class="text">Hello World</p>
        </div>
        """
        soup = BeautifulSoup(html, "html.parser")
        
        print(soup.find("h1").text)
        print(soup.find(id="main-title").text)
        print(soup.find(class_="text").text)
        ```
        
        ## Extracting table data (common use case)
        
        ```python
        import requests
        from bs4 import BeautifulSoup
        
        url = "https://example.com/table"
        response = requests.get(url, timeout=10)
        soup = BeautifulSoup(response.text, "html.parser")
        
        table = soup.find("table")
        rows = table.find_all("tr")
        
        for row in rows:
            cols = [c.get_text(strip=True) for c in row.find_all(["td", "th"])]
            if cols:
                print(cols)
        ```
        
        ## Pagination (scraping multiple pages)
        
        ```python
        import requests
        import time
        from bs4 import BeautifulSoup
        
        base_url = "https://example.com/page/"
        
        for page in range(1, 6):
            url = f"{base_url}{page}"
            response = requests.get(url, timeout=10)
            soup = BeautifulSoup(response.text, "html.parser")
        
            items = soup.find_all(class_="item")
            for item in items:
                print(item.get_text(strip=True))
        
            time.sleep(2)  # respectful delay
        ```
        
        ## Graph: scraping pipeline
        
        ```mermaid
        flowchart LR
          A[URL list] --> B[requests.get()]
          B --> C[HTML response]
          C --> D[BeautifulSoup parse]
          D --> E[Extract data]
          E --> F[Save to CSV/DB]
        ```
        
        ## Remember
        
        - Always add timeouts and delays
        - Handle errors gracefully
        - robots.txt and website rules matter
        - For JavaScript-heavy sites, you may need browser automation (Playwright/Selenium)
        
#Python#Advanced#Web Scraping