Encoding Categorical Variables for ML
Learn how to convert categorical data into numbers for machine learning models.
Encoding Categorical Variables
ML models need numbers. But real data has categories like "red", "blue", "green" or "small", "medium", "large". Let's convert them properly.
Types of Categorical Data
Nominal (No Order)
- Colors: red, blue, green
- Countries: USA, UK, India
- Email domains: gmail, yahoo, outlook
Ordinal (Has Order)
- Size: small < medium < large
- Education: high school < bachelor's < master's
- Rating: poor < average < good < excellent
The encoding method depends on the type!
Method 1: Label Encoding
Convert categories to numbers: 0, 1, 2, ...
from sklearn.preprocessing import LabelEncoder
colors = ['red', 'blue', 'green', 'blue', 'red']
encoder = LabelEncoder()
encoded = encoder.fit_transform(colors)
print(encoded) # [2, 0, 1, 0, 2]
# Decode back
original = encoder.inverse_transform(encoded)
print(original) # ['red', 'blue', 'green', 'blue', 'red']
Use for:
- Target variable (y)
- Ordinal data (but order might be wrong!)
- Tree-based models (they can handle it)
Problem for nominal data:
The model might think red (2) > green (1) > blue (0), but colors have no order!
Method 2: One-Hot Encoding
Create a binary column for each category:
Original: One-Hot Encoded:
color red blue green
───── ─── ──── ─────
red 1 0 0
blue 0 1 0
green 0 0 1
blue 0 1 0
import pandas as pd
df = pd.DataFrame({'color': ['red', 'blue', 'green', 'blue']})
# Pandas get_dummies
encoded = pd.get_dummies(df, columns=['color'])
print(encoded)
# Or sklearn
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse=False)
encoded = encoder.fit_transform(df[['color']])
Use for:
- Nominal data (no order)
- Most algorithms
Problem:
- Many categories = Many columns (high dimensionality)
- Rare categories might have very few 1s
Drop First Column?
# For linear models, drop one to avoid multicollinearity
pd.get_dummies(df, columns=['color'], drop_first=True)
# If red=0 and blue=0, it must be green!
Method 3: Ordinal Encoding
Like label encoding, but YOU specify the order:
from sklearn.preprocessing import OrdinalEncoder
sizes = [['small'], ['large'], ['medium'], ['large'], ['small']]
# Specify correct order
encoder = OrdinalEncoder(categories=[['small', 'medium', 'large']])
encoded = encoder.fit_transform(sizes)
print(encoded) # [[0], [2], [1], [2], [0]]
Use for: Data with meaningful order (size, education, rating)
Method 4: Binary Encoding
Converts categories to binary code. Good for many categories.
Label Binary
0 → 0 0
1 → 0 1
2 → 1 0
3 → 1 1
Fewer columns than one-hot encoding!
# Install: pip install category_encoders
import category_encoders as ce
encoder = ce.BinaryEncoder(cols=['city'])
encoded = encoder.fit_transform(df)
Use when: Many categories (>10), want fewer columns
Method 5: Target Encoding
Replace category with mean of target variable:
import category_encoders as ce
encoder = ce.TargetEncoder(cols=['city'])
encoder.fit(X_train, y_train)
X_encoded = encoder.transform(X_train)
city target encoded
──── ────── ───────
NYC 100k 120k (mean of NYC)
LA 150k 110k (mean of LA)
NYC 140k 120k (mean of NYC)
Use for: High cardinality (many categories)
Warning: High risk of data leakage! Use cross-validation encoding.
Method 6: Frequency/Count Encoding
Replace with how often each category appears:
counts = df['city'].value_counts()
df['city_count'] = df['city'].map(counts)
Use for: When frequency is meaningful
Quick Reference
| Method | Use Case | Categories | Creates Many Columns? |
|---|---|---|---|
| Label | Target, tree models | Any | No |
| One-Hot | Nominal, most models | <10 | Yes |
| Ordinal | Has meaningful order | Any | No |
| Binary | Many categories | >10 | Less than one-hot |
| Target | High cardinality | Many | No |
Code Example: Full Pipeline
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
# Define column types
numerical = ['age', 'income']
nominal = ['city', 'gender']
ordinal = ['education']
# Preprocessor
preprocessor = ColumnTransformer([
('num', StandardScaler(), numerical),
('nom', OneHotEncoder(drop='first'), nominal),
('ord', OrdinalEncoder(categories=[['high school', 'bachelor', 'master']]), ordinal)
])
# Full pipeline
pipeline = Pipeline([
('prep', preprocessor),
('model', LogisticRegression())
])
pipeline.fit(X_train, y_train)
Important: Handle Unknown Categories
# What if test set has a category not seen in training?
encoder = OneHotEncoder(handle_unknown='ignore') # Will create all zeros
Key Takeaways
- Nominal data → One-Hot Encoding (or Binary for many categories)
- Ordinal data → Ordinal Encoding with correct order
- High cardinality → Target/Binary/Frequency Encoding
- Tree models → Can use Label Encoding (they don't assume order)
- Always fit on train, transform on test!