Encoding Categorical Variables for ML
Learn how to convert categorical data into numbers for machine learning models.
Encoding Categorical Variables
ML models need numbers. But real data has categories like "red", "blue", "green" or "small", "medium", "large". Let's convert them properly.
Types of Categorical Data
### Nominal (No Order) - Colors: red, blue, green - Countries: USA, UK, India - Email domains: gmail, yahoo, outlook
### Ordinal (Has Order) - Size: small < medium < large - Education: high school < bachelor's < master's - Rating: poor < average < good < excellent
The encoding method depends on the type!
Method 1: Label Encoding
Convert categories to numbers: 0, 1, 2, ...
```python from sklearn.preprocessing import LabelEncoder
colors = ['red', 'blue', 'green', 'blue', 'red']
encoder = LabelEncoder() encoded = encoder.fit_transform(colors) print(encoded) # [2, 0, 1, 0, 2]
Decode back original = encoder.inverse_transform(encoded) print(original) # ['red', 'blue', 'green', 'blue', 'red'] ```
**Use for:** - Target variable (y) - Ordinal data (but order might be wrong!) - Tree-based models (they can handle it)
**Problem for nominal data:** The model might think red (2) > green (1) > blue (0), but colors have no order!
Method 2: One-Hot Encoding
Create a binary column for each category:
``` Original: One-Hot Encoded: color red blue green ───── ─── ──── ───── red 1 0 0 blue 0 1 0 green 0 0 1 blue 0 1 0 ```
```python import pandas as pd
df = pd.DataFrame({'color': ['red', 'blue', 'green', 'blue']})
Pandas get_dummies encoded = pd.get_dummies(df, columns=['color']) print(encoded)
Or sklearn from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse=False) encoded = encoder.fit_transform(df[['color']]) ```
**Use for:** - Nominal data (no order) - Most algorithms
**Problem:** - Many categories = Many columns (high dimensionality) - Rare categories might have very few 1s
### Drop First Column?
```python # For linear models, drop one to avoid multicollinearity pd.get_dummies(df, columns=['color'], drop_first=True)
If red=0 and blue=0, it must be green! ```
Method 3: Ordinal Encoding
Like label encoding, but YOU specify the order:
```python from sklearn.preprocessing import OrdinalEncoder
sizes = [['small'], ['large'], ['medium'], ['large'], ['small']]
Specify correct order encoder = OrdinalEncoder(categories=[['small', 'medium', 'large']]) encoded = encoder.fit_transform(sizes) print(encoded) # [[0], [2], [1], [2], [0]] ```
**Use for:** Data with meaningful order (size, education, rating)
Method 4: Binary Encoding
Converts categories to binary code. Good for many categories.
``` Label Binary 0 → 0 0 1 → 0 1 2 → 1 0 3 → 1 1 ```
Fewer columns than one-hot encoding!
```python # Install: pip install category_encoders import category_encoders as ce
encoder = ce.BinaryEncoder(cols=['city']) encoded = encoder.fit_transform(df) ```
**Use when:** Many categories (>10), want fewer columns
Method 5: Target Encoding
Replace category with mean of target variable:
```python import category_encoders as ce
encoder = ce.TargetEncoder(cols=['city']) encoder.fit(X_train, y_train) X_encoded = encoder.transform(X_train) ```
``` city target encoded ──── ────── ─────── NYC 100k 120k (mean of NYC) LA 150k 110k (mean of LA) NYC 140k 120k (mean of NYC) ```
**Use for:** High cardinality (many categories) **Warning:** High risk of data leakage! Use cross-validation encoding.
Method 6: Frequency/Count Encoding
Replace with how often each category appears:
```python counts = df['city'].value_counts() df['city_count'] = df['city'].map(counts) ```
**Use for:** When frequency is meaningful
Quick Reference
| Method | Use Case | Categories | Creates Many Columns? | |--------|----------|------------|----------------------| | Label | Target, tree models | Any | No | | One-Hot | Nominal, most models | <10 | Yes | | Ordinal | Has meaningful order | Any | No | | Binary | Many categories | >10 | Less than one-hot | | Target | High cardinality | Many | No |
Code Example: Full Pipeline
```python import pandas as pd from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler from sklearn.pipeline import Pipeline from sklearn.linear_model import LogisticRegression
Define column types numerical = ['age', 'income'] nominal = ['city', 'gender'] ordinal = ['education']
Preprocessor preprocessor = ColumnTransformer([ ('num', StandardScaler(), numerical), ('nom', OneHotEncoder(drop='first'), nominal), ('ord', OrdinalEncoder(categories=[['high school', 'bachelor', 'master']]), ordinal) ])
Full pipeline pipeline = Pipeline([ ('prep', preprocessor), ('model', LogisticRegression()) ])
pipeline.fit(X_train, y_train) ```
Important: Handle Unknown Categories
```python # What if test set has a category not seen in training? encoder = OneHotEncoder(handle_unknown='ignore') # Will create all zeros ```
Key Takeaways
1. **Nominal data** → One-Hot Encoding (or Binary for many categories) 2. **Ordinal data** → Ordinal Encoding with correct order 3. **High cardinality** → Target/Binary/Frequency Encoding 4. **Tree models** → Can use Label Encoding (they don't assume order) 5. **Always fit on train, transform on test!**