Features and Labels: The Building Blocks of ML
Understand features and labels - the fundamental concepts you need before building any ML model.
Features and Labels: ML Building Blocks
Every ML problem boils down to: given these **features**, predict this **label**.
What Are Features?
Features are the input variables—the information you give the model to make predictions.
**Example: Predicting house prices**
Features might be: - Square footage - Number of bedrooms - Location - Year built - Has garage?
Each feature is a piece of information that might help predict the price.
What Are Labels?
Labels are what you're trying to predict—the output.
| Problem | Label | |---------|-------| | House price prediction | Price ($) | | Email spam detection | Spam or Not Spam | | Disease diagnosis | Disease type | | Customer churn | Will leave? Yes/No |
Features vs Labels
``` Features (X) Label (y) ───────────── ───────── [sqft, beds, location] → [price] [email_text, sender] → [spam/not_spam] [age, symptoms, tests] → [diagnosis] ```
In code: ```python X = data[['sqft', 'bedrooms', 'location']] # Features y = data['price'] # Label
model.fit(X, y) # Learn: features → label ```
Good Features Matter More Than Fancy Algorithms
A simple model with great features beats a complex model with poor features.
**Feature Engineering** = Creating good features from raw data
Example - Predicting flight delays: - Raw data: departure_time = "2025-03-15 14:30:00" - Better features: - hour_of_day = 14 - day_of_week = Saturday - is_holiday = False - month = March
Types of Features
### Numerical Numbers that have mathematical meaning. ```python age = 25 temperature = 72.5 income = 50000 ```
### Categorical Categories or groups. ```python color = "red" country = "USA" size = "medium" ```
### Binary Yes/No, True/False. ```python is_member = True has_insurance = False ```
What Makes a Good Feature?
### 1. Predictive Power Does it actually help predict the label? - Height probably helps predict basketball skill - Shoe size probably doesn't
### 2. Available at Prediction Time You need the feature when making predictions! - Predicting "will customer buy?" - Can't use "did customer buy" as a feature 😅
### 3. Not Too Many Missing Values Features with 50% missing data cause problems.
### 4. Not Redundant Don't include both "age" and "birth_year"—same information.
Common Mistakes
### Mistake 1: Data Leakage ```python # Predicting if patient has diabetes # BAD: insulin_dosage as feature (reveals the answer!) # GOOD: age, weight, family_history ```
### Mistake 2: Using Future Information ```python # Predicting tomorrow's stock price # BAD: tomorrow's trading volume (you don't have it yet!) # GOOD: historical prices, today's volume ```
Quick Vocab
- **Feature Matrix (X)**: All features for all samples - **Target Vector (y)**: All labels - **Feature Engineering**: Creating new features - **Feature Selection**: Choosing which features to use
Summary
| Term | What It Is | Example | |------|-----------|---------| | Feature | Input variable | Age, income, location | | Label | Output to predict | Price, category | | Sample | One data point | One house, one customer |
Remember: **Garbage features in = Garbage predictions out**
Spend time on your features. They're often more important than which algorithm you choose.