Performance and Scaling

Learn how to speed up scraping safely using concurrency, batching, and efficient storage without getting blocked.

As targets grow, slow scrapers become a problem.

Goals: - faster scraping - no bans - stable memory usage

---

Key ideas - reuse connections - limit concurrency - batch DB writes - respect delays

---

```python from concurrent.futures import ThreadPoolExecutor import requests

def fetch(url): return requests.get(url).text

urls = ["https://example.com/1", "https://example.com/2"]

with ThreadPoolExecutor(max_workers=5) as ex: pages = list(ex.map(fetch, urls))

print(len(pages)) ```

---

Instead of committing every row: - collect 50–100 items - insert in one transaction

This is much faster.

---

```mermaid flowchart LR A[URLs] --> B[Thread Pool] B --> C[Parsers] C --> D[Batch Insert] D --> E[Database] ```

---

Always: - keep delays - rotate headers - respect robots.txt

---