Web Scraping30 min read
Data Cleaning and Validation
Learn how to clean messy scraped data, remove noise, validate fields, and prepare it for analysis or storage.
David Miller
December 3, 2025
1.8k62
Scraped data is almost never clean.
You will see:
- extra spaces
- missing values
- broken text
- wrong formats
Cleaning is mandatory.
Example raw data
raw = {
"name": " Book A \n",
"price": " $10.00 ",
"rating": "N/A"
}
Clean text
name = raw["name"].strip()
price = raw["price"].replace("$", "").strip()
rating = raw["rating"]
if rating == "N/A":
rating = None
Convert types safely
def to_float(val):
try:
return float(val)
except:
return None
price = to_float(price)
Validate fields
if not name:
raise ValueError("Name missing")
if price is None:
print("Price invalid")
Example clean record
clean = {
"name": name,
"price": price,
"rating": rating
}
print(clean)
Graph: cleaning flow
flowchart LR
A[Raw HTML Data] --> B[Extract]
B --> C[Clean]
C --> D[Validate]
D --> E[Store]
Remember
- Always strip and normalize text
- Convert strings to numbers
- Handle missing values
- Bad data breaks analysis later
#Python#Advanced#DataCleaning