Web Scraping30 min read

Data Cleaning and Validation

Learn how to clean messy scraped data, remove noise, validate fields, and prepare it for analysis or storage.

David Miller
December 21, 2025
0.0k0

Scraped data is almost never clean.

You will see: - extra spaces - missing values - broken text - wrong formats

Cleaning is mandatory.

---

Example raw data

```python raw = { "name": " Book A \n", "price": " $10.00 ", "rating": "N/A" } ```

---

Clean text

```python name = raw["name"].strip() price = raw["price"].replace("$", "").strip() rating = raw["rating"]

if rating == "N/A": rating = None ```

---

Convert types safely

```python def to_float(val): try: return float(val) except: return None

price = to_float(price) ```

---

Validate fields

```python if not name: raise ValueError("Name missing")

if price is None: print("Price invalid") ```

---

Example clean record

```python clean = { "name": name, "price": price, "rating": rating } print(clean) ```

---

Graph: cleaning flow

```mermaid flowchart LR A[Raw HTML Data] --> B[Extract] B --> C[Clean] C --> D[Validate] D --> E[Store] ```

Remember - Always strip and normalize text - Convert strings to numbers - Handle missing values - Bad data breaks analysis later

#Python#Advanced#DataCleaning