Python26 min read
Python Web Scraping
Learn ethical web scraping using requests and BeautifulSoup: extract titles, links, tables, handle pagination safely, and store data for analysis.
David Miller
August 1, 2025
12.4k551
Web scraping means collecting data from web pages automatically.
Examples:
- product prices
- job listings
- news headlines
- research datasets
## Before you scrape (very important)
1) Check **website terms** and **robots.txt**
2) Add delays to avoid overloading servers
3) Do not scrape private or login-protected data without permission
This keeps your scraping ethical and safe.
## Install libraries
```bash
pip install beautifulsoup4 requests
```
## Step 1: Download a page (requests)
```python
import requests
url = "https://example.com"
response = requests.get(url, timeout=10)
print(response.status_code)
print(response.text[:200])
```
## Step 2: Parse HTML (BeautifulSoup)
```python
from bs4 import BeautifulSoup
html = "<html><title>Hello</title></html>"
soup = BeautifulSoup(html, "html.parser")
print(soup.title.text)
```
## Real scraping example: title + all links
```python
import requests
from bs4 import BeautifulSoup
url = "https://example.com"
response = requests.get(url, timeout=10)
soup = BeautifulSoup(response.text, "html.parser")
print("Title:", soup.title.text)
for link in soup.find_all("a"):
print(link.get("href"))
```
## Finding specific elements (id, class, tag)
```python
from bs4 import BeautifulSoup
html = """
<div class="container">
<h1 id="main-title">Welcome</h1>
<p class="text">Hello World</p>
</div>
"""
soup = BeautifulSoup(html, "html.parser")
print(soup.find("h1").text)
print(soup.find(id="main-title").text)
print(soup.find(class_="text").text)
```
## Extracting table data (common use case)
```python
import requests
from bs4 import BeautifulSoup
url = "https://example.com/table"
response = requests.get(url, timeout=10)
soup = BeautifulSoup(response.text, "html.parser")
table = soup.find("table")
rows = table.find_all("tr")
for row in rows:
cols = [c.get_text(strip=True) for c in row.find_all(["td", "th"])]
if cols:
print(cols)
```
## Pagination (scraping multiple pages)
```python
import requests
import time
from bs4 import BeautifulSoup
base_url = "https://example.com/page/"
for page in range(1, 6):
url = f"{base_url}{page}"
response = requests.get(url, timeout=10)
soup = BeautifulSoup(response.text, "html.parser")
items = soup.find_all(class_="item")
for item in items:
print(item.get_text(strip=True))
time.sleep(2) # respectful delay
```
## Graph: scraping pipeline
```mermaid
flowchart LR
A[URL list] --> B[requests.get()]
B --> C[HTML response]
C --> D[BeautifulSoup parse]
D --> E[Extract data]
E --> F[Save to CSV/DB]
```
## Remember
- Always add timeouts and delays
- Handle errors gracefully
- robots.txt and website rules matter
- For JavaScript-heavy sites, you may need browser automation (Playwright/Selenium)
#Python#Advanced#Web Scraping