ML8 min read

Encoding Categorical Variables for ML

Learn how to convert categorical data into numbers for machine learning models.

Sarah Chen
December 19, 2025
0.0k0

Encoding Categorical Variables

ML models need numbers. But real data has categories like "red", "blue", "green" or "small", "medium", "large". Let's convert them properly.

Types of Categorical Data

### Nominal (No Order) - Colors: red, blue, green - Countries: USA, UK, India - Email domains: gmail, yahoo, outlook

### Ordinal (Has Order) - Size: small < medium < large - Education: high school < bachelor's < master's - Rating: poor < average < good < excellent

The encoding method depends on the type!

Method 1: Label Encoding

Convert categories to numbers: 0, 1, 2, ...

```python from sklearn.preprocessing import LabelEncoder

colors = ['red', 'blue', 'green', 'blue', 'red']

encoder = LabelEncoder() encoded = encoder.fit_transform(colors) print(encoded) # [2, 0, 1, 0, 2]

Decode back original = encoder.inverse_transform(encoded) print(original) # ['red', 'blue', 'green', 'blue', 'red'] ```

**Use for:** - Target variable (y) - Ordinal data (but order might be wrong!) - Tree-based models (they can handle it)

**Problem for nominal data:** The model might think red (2) > green (1) > blue (0), but colors have no order!

Method 2: One-Hot Encoding

Create a binary column for each category:

``` Original: One-Hot Encoded: color red blue green ───── ─── ──── ───── red 1 0 0 blue 0 1 0 green 0 0 1 blue 0 1 0 ```

```python import pandas as pd

df = pd.DataFrame({'color': ['red', 'blue', 'green', 'blue']})

Pandas get_dummies encoded = pd.get_dummies(df, columns=['color']) print(encoded)

Or sklearn from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse=False) encoded = encoder.fit_transform(df[['color']]) ```

**Use for:** - Nominal data (no order) - Most algorithms

**Problem:** - Many categories = Many columns (high dimensionality) - Rare categories might have very few 1s

### Drop First Column?

```python # For linear models, drop one to avoid multicollinearity pd.get_dummies(df, columns=['color'], drop_first=True)

If red=0 and blue=0, it must be green! ```

Method 3: Ordinal Encoding

Like label encoding, but YOU specify the order:

```python from sklearn.preprocessing import OrdinalEncoder

sizes = [['small'], ['large'], ['medium'], ['large'], ['small']]

Specify correct order encoder = OrdinalEncoder(categories=[['small', 'medium', 'large']]) encoded = encoder.fit_transform(sizes) print(encoded) # [[0], [2], [1], [2], [0]] ```

**Use for:** Data with meaningful order (size, education, rating)

Method 4: Binary Encoding

Converts categories to binary code. Good for many categories.

``` Label Binary 0 → 0 0 1 → 0 1 2 → 1 0 3 → 1 1 ```

Fewer columns than one-hot encoding!

```python # Install: pip install category_encoders import category_encoders as ce

encoder = ce.BinaryEncoder(cols=['city']) encoded = encoder.fit_transform(df) ```

**Use when:** Many categories (>10), want fewer columns

Method 5: Target Encoding

Replace category with mean of target variable:

```python import category_encoders as ce

encoder = ce.TargetEncoder(cols=['city']) encoder.fit(X_train, y_train) X_encoded = encoder.transform(X_train) ```

``` city target encoded ──── ────── ─────── NYC 100k 120k (mean of NYC) LA 150k 110k (mean of LA) NYC 140k 120k (mean of NYC) ```

**Use for:** High cardinality (many categories) **Warning:** High risk of data leakage! Use cross-validation encoding.

Method 6: Frequency/Count Encoding

Replace with how often each category appears:

```python counts = df['city'].value_counts() df['city_count'] = df['city'].map(counts) ```

**Use for:** When frequency is meaningful

Quick Reference

| Method | Use Case | Categories | Creates Many Columns? | |--------|----------|------------|----------------------| | Label | Target, tree models | Any | No | | One-Hot | Nominal, most models | <10 | Yes | | Ordinal | Has meaningful order | Any | No | | Binary | Many categories | >10 | Less than one-hot | | Target | High cardinality | Many | No |

Code Example: Full Pipeline

```python import pandas as pd from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler from sklearn.pipeline import Pipeline from sklearn.linear_model import LogisticRegression

Define column types numerical = ['age', 'income'] nominal = ['city', 'gender'] ordinal = ['education']

Preprocessor preprocessor = ColumnTransformer([ ('num', StandardScaler(), numerical), ('nom', OneHotEncoder(drop='first'), nominal), ('ord', OrdinalEncoder(categories=[['high school', 'bachelor', 'master']]), ordinal) ])

Full pipeline pipeline = Pipeline([ ('prep', preprocessor), ('model', LogisticRegression()) ])

pipeline.fit(X_train, y_train) ```

Important: Handle Unknown Categories

```python # What if test set has a category not seen in training? encoder = OneHotEncoder(handle_unknown='ignore') # Will create all zeros ```

Key Takeaways

1. **Nominal data** → One-Hot Encoding (or Binary for many categories) 2. **Ordinal data** → Ordinal Encoding with correct order 3. **High cardinality** → Target/Binary/Frequency Encoding 4. **Tree models** → Can use Label Encoding (they don't assume order) 5. **Always fit on train, transform on test!**

#Machine Learning#Categorical Data#Feature Engineering#Beginner