Features and Labels: The Building Blocks of ML
Understand features and labels - the fundamental concepts you need before building any ML model.
Features and Labels: ML Building Blocks
Every ML problem boils down to: given these features, predict this label.
What Are Features?
Features are the input variables—the information you give the model to make predictions.
Example: Predicting house prices
Features might be:
- Square footage
- Number of bedrooms
- Location
- Year built
- Has garage?
Each feature is a piece of information that might help predict the price.
What Are Labels?
Labels are what you're trying to predict—the output.
| Problem | Label |
|---|---|
| House price prediction | Price ($) |
| Email spam detection | Spam or Not Spam |
| Disease diagnosis | Disease type |
| Customer churn | Will leave? Yes/No |
Features vs Labels
Features (X) Label (y)
───────────── ─────────
[sqft, beds, location] → [price]
[email_text, sender] → [spam/not_spam]
[age, symptoms, tests] → [diagnosis]
In code:
X = data[['sqft', 'bedrooms', 'location']] # Features
y = data['price'] # Label
model.fit(X, y) # Learn: features → label
Good Features Matter More Than Fancy Algorithms
A simple model with great features beats a complex model with poor features.
Feature Engineering = Creating good features from raw data
Example - Predicting flight delays:
- Raw data: departure_time = "2025-03-15 14:30:00"
- Better features:
- hour_of_day = 14
- day_of_week = Saturday
- is_holiday = False
- month = March
Types of Features
Numerical
Numbers that have mathematical meaning.
age = 25
temperature = 72.5
income = 50000
Categorical
Categories or groups.
color = "red"
country = "USA"
size = "medium"
Binary
Yes/No, True/False.
is_member = True
has_insurance = False
What Makes a Good Feature?
1. Predictive Power
Does it actually help predict the label?
- Height probably helps predict basketball skill
- Shoe size probably doesn't
2. Available at Prediction Time
You need the feature when making predictions!
- Predicting "will customer buy?"
- Can't use "did customer buy" as a feature 😅
3. Not Too Many Missing Values
Features with 50% missing data cause problems.
4. Not Redundant
Don't include both "age" and "birth_year"—same information.
Common Mistakes
Mistake 1: Data Leakage
# Predicting if patient has diabetes
# BAD: insulin_dosage as feature (reveals the answer!)
# GOOD: age, weight, family_history
Mistake 2: Using Future Information
# Predicting tomorrow's stock price
# BAD: tomorrow's trading volume (you don't have it yet!)
# GOOD: historical prices, today's volume
Quick Vocab
- Feature Matrix (X): All features for all samples
- Target Vector (y): All labels
- Feature Engineering: Creating new features
- Feature Selection: Choosing which features to use
Summary
| Term | What It Is | Example |
|---|---|---|
| Feature | Input variable | Age, income, location |
| Label | Output to predict | Price, category |
| Sample | One data point | One house, one customer |
Remember: Garbage features in = Garbage predictions out
Spend time on your features. They're often more important than which algorithm you choose.