Data Cleaning and Validation
Learn how to clean messy scraped data, remove noise, validate fields, and prepare it for analysis or storage.
Scraped data is almost never clean.
You will see: - extra spaces - missing values - broken text - wrong formats
Cleaning is mandatory.
---
Example raw data
```python raw = { "name": " Book A \n", "price": " $10.00 ", "rating": "N/A" } ```
---
Clean text
```python name = raw["name"].strip() price = raw["price"].replace("$", "").strip() rating = raw["rating"]
if rating == "N/A": rating = None ```
---
Convert types safely
```python def to_float(val): try: return float(val) except: return None
price = to_float(price) ```
---
Validate fields
```python if not name: raise ValueError("Name missing")
if price is None: print("Price invalid") ```
---
Example clean record
```python clean = { "name": name, "price": price, "rating": rating } print(clean) ```
---
Graph: cleaning flow
```mermaid flowchart LR A[Raw HTML Data] --> B[Extract] B --> C[Clean] C --> D[Validate] D --> E[Store] ```