ML8 min read

Encoding Categorical Variables for ML

Learn how to convert categorical data into numbers for machine learning models.

Sarah Chen
December 19, 2025
0.0k0

Encoding Categorical Variables

ML models need numbers. But real data has categories like "red", "blue", "green" or "small", "medium", "large". Let's convert them properly.

Types of Categorical Data

Nominal (No Order)

  • Colors: red, blue, green
  • Countries: USA, UK, India
  • Email domains: gmail, yahoo, outlook

Ordinal (Has Order)

  • Size: small < medium < large
  • Education: high school < bachelor's < master's
  • Rating: poor < average < good < excellent

The encoding method depends on the type!

Method 1: Label Encoding

Convert categories to numbers: 0, 1, 2, ...

from sklearn.preprocessing import LabelEncoder

colors = ['red', 'blue', 'green', 'blue', 'red']

encoder = LabelEncoder()
encoded = encoder.fit_transform(colors)
print(encoded)  # [2, 0, 1, 0, 2]

# Decode back
original = encoder.inverse_transform(encoded)
print(original)  # ['red', 'blue', 'green', 'blue', 'red']

Use for:

  • Target variable (y)
  • Ordinal data (but order might be wrong!)
  • Tree-based models (they can handle it)

Problem for nominal data:
The model might think red (2) > green (1) > blue (0), but colors have no order!

Method 2: One-Hot Encoding

Create a binary column for each category:

Original:    One-Hot Encoded:
color        red  blue  green
─────        ───  ────  ─────
red          1    0     0
blue         0    1     0
green        0    0     1
blue         0    1     0
import pandas as pd

df = pd.DataFrame({'color': ['red', 'blue', 'green', 'blue']})

# Pandas get_dummies
encoded = pd.get_dummies(df, columns=['color'])
print(encoded)

# Or sklearn
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse=False)
encoded = encoder.fit_transform(df[['color']])

Use for:

  • Nominal data (no order)
  • Most algorithms

Problem:

  • Many categories = Many columns (high dimensionality)
  • Rare categories might have very few 1s

Drop First Column?

# For linear models, drop one to avoid multicollinearity
pd.get_dummies(df, columns=['color'], drop_first=True)

# If red=0 and blue=0, it must be green!

Method 3: Ordinal Encoding

Like label encoding, but YOU specify the order:

from sklearn.preprocessing import OrdinalEncoder

sizes = [['small'], ['large'], ['medium'], ['large'], ['small']]

# Specify correct order
encoder = OrdinalEncoder(categories=[['small', 'medium', 'large']])
encoded = encoder.fit_transform(sizes)
print(encoded)  # [[0], [2], [1], [2], [0]]

Use for: Data with meaningful order (size, education, rating)

Method 4: Binary Encoding

Converts categories to binary code. Good for many categories.

Label    Binary
  0    →  0 0
  1    →  0 1
  2    →  1 0
  3    →  1 1

Fewer columns than one-hot encoding!

# Install: pip install category_encoders
import category_encoders as ce

encoder = ce.BinaryEncoder(cols=['city'])
encoded = encoder.fit_transform(df)

Use when: Many categories (>10), want fewer columns

Method 5: Target Encoding

Replace category with mean of target variable:

import category_encoders as ce

encoder = ce.TargetEncoder(cols=['city'])
encoder.fit(X_train, y_train)
X_encoded = encoder.transform(X_train)
city      target    encoded
────      ──────    ───────
NYC       100k      120k (mean of NYC)
LA        150k      110k (mean of LA)
NYC       140k      120k (mean of NYC)

Use for: High cardinality (many categories)
Warning: High risk of data leakage! Use cross-validation encoding.

Method 6: Frequency/Count Encoding

Replace with how often each category appears:

counts = df['city'].value_counts()
df['city_count'] = df['city'].map(counts)

Use for: When frequency is meaningful

Quick Reference

Method Use Case Categories Creates Many Columns?
Label Target, tree models Any No
One-Hot Nominal, most models <10 Yes
Ordinal Has meaningful order Any No
Binary Many categories >10 Less than one-hot
Target High cardinality Many No

Code Example: Full Pipeline

import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

# Define column types
numerical = ['age', 'income']
nominal = ['city', 'gender']
ordinal = ['education']

# Preprocessor
preprocessor = ColumnTransformer([
    ('num', StandardScaler(), numerical),
    ('nom', OneHotEncoder(drop='first'), nominal),
    ('ord', OrdinalEncoder(categories=[['high school', 'bachelor', 'master']]), ordinal)
])

# Full pipeline
pipeline = Pipeline([
    ('prep', preprocessor),
    ('model', LogisticRegression())
])

pipeline.fit(X_train, y_train)

Important: Handle Unknown Categories

# What if test set has a category not seen in training?
encoder = OneHotEncoder(handle_unknown='ignore')  # Will create all zeros

Key Takeaways

  1. Nominal data → One-Hot Encoding (or Binary for many categories)
  2. Ordinal data → Ordinal Encoding with correct order
  3. High cardinality → Target/Binary/Frequency Encoding
  4. Tree models → Can use Label Encoding (they don't assume order)
  5. Always fit on train, transform on test!
#Machine Learning#Categorical Data#Feature Engineering#Beginner