Web Scraping30 min read

Data Cleaning and Validation

Learn how to clean messy scraped data, remove noise, validate fields, and prepare it for analysis or storage.

David Miller
December 3, 2025
1.8k62

Scraped data is almost never clean.

You will see:

  • extra spaces
  • missing values
  • broken text
  • wrong formats

Cleaning is mandatory.


Example raw data

raw = {
  "name": "  Book A \n",
  "price": " $10.00 ",
  "rating": "N/A"
}

Clean text

name = raw["name"].strip()
price = raw["price"].replace("$", "").strip()
rating = raw["rating"]

if rating == "N/A":
    rating = None

Convert types safely

def to_float(val):
    try:
        return float(val)
    except:
        return None

price = to_float(price)

Validate fields

if not name:
    raise ValueError("Name missing")

if price is None:
    print("Price invalid")

Example clean record

clean = {
  "name": name,
  "price": price,
  "rating": rating
}
print(clean)

Graph: cleaning flow

flowchart LR
  A[Raw HTML Data] --> B[Extract]
  B --> C[Clean]
  C --> D[Validate]
  D --> E[Store]

Remember

  • Always strip and normalize text
  • Convert strings to numbers
  • Handle missing values
  • Bad data breaks analysis later
#Python#Advanced#DataCleaning