Encoding Categorical Variables

ML models need numbers. But real data has categories like "red", "blue", "green" or "small", "medium", "large". Let's convert them properly.

Types of Categorical Data

### Nominal (No Order) - Colors: red, blue, green - Countries: USA, UK, India - Email domains: gmail, yahoo, outlook

### Ordinal (Has Order) - Size: small < medium < large - Education: high school < bachelor's < master's - Rating: poor < average < good < excellent

The encoding method depends on the type!

Method 1: Label Encoding

Convert categories to numbers: 0, 1, 2, ...

```python from sklearn.preprocessing import LabelEncoder

colors = ['red', 'blue', 'green', 'blue', 'red']

encoder = LabelEncoder() encoded = encoder.fit_transform(colors) print(encoded) # [2, 0, 1, 0, 2]

Decode back original = encoder.inverse_transform(encoded) print(original) # ['red', 'blue', 'green', 'blue', 'red'] ```

**Use for:** - Target variable (y) - Ordinal data (but order might be wrong!) - Tree-based models (they can handle it)

**Problem for nominal data:** The model might think red (2) > green (1) > blue (0), but colors have no order!

Method 2: One-Hot Encoding

Create a binary column for each category:

``` Original: One-Hot Encoded: color red blue green ───── ─── ──── ───── red 1 0 0 blue 0 1 0 green 0 0 1 blue 0 1 0 ```

```python import pandas as pd

df = pd.DataFrame({'color': ['red', 'blue', 'green', 'blue']})

Pandas get_dummies encoded = pd.get_dummies(df, columns=['color']) print(encoded)

Or sklearn from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse=False) encoded = encoder.fit_transform(df[['color']]) ```

**Use for:** - Nominal data (no order) - Most algorithms

**Problem:** - Many categories = Many columns (high dimensionality) - Rare categories might have very few 1s

### Drop First Column?

```python # For linear models, drop one to avoid multicollinearity pd.get_dummies(df, columns=['color'], drop_first=True)

If red=0 and blue=0, it must be green! ```

Method 3: Ordinal Encoding

Like label encoding, but YOU specify the order:

```python from sklearn.preprocessing import OrdinalEncoder

sizes = [['small'], ['large'], ['medium'], ['large'], ['small']]

Specify correct order encoder = OrdinalEncoder(categories=[['small', 'medium', 'large']]) encoded = encoder.fit_transform(sizes) print(encoded) # [[0], [2], [1], [2], [0]] ```

**Use for:** Data with meaningful order (size, education, rating)

Method 4: Binary Encoding

Converts categories to binary code. Good for many categories.

``` Label Binary 0 → 0 0 1 → 0 1 2 → 1 0 3 → 1 1 ```

Fewer columns than one-hot encoding!

```python # Install: pip install category_encoders import category_encoders as ce

encoder = ce.BinaryEncoder(cols=['city']) encoded = encoder.fit_transform(df) ```

**Use when:** Many categories (>10), want fewer columns

Method 5: Target Encoding

Replace category with mean of target variable:

```python import category_encoders as ce

encoder = ce.TargetEncoder(cols=['city']) encoder.fit(X_train, y_train) X_encoded = encoder.transform(X_train) ```

``` city target encoded ──── ────── ─────── NYC 100k 120k (mean of NYC) LA 150k 110k (mean of LA) NYC 140k 120k (mean of NYC) ```

**Use for:** High cardinality (many categories) **Warning:** High risk of data leakage! Use cross-validation encoding.

Method 6: Frequency/Count Encoding

Replace with how often each category appears:

```python counts = df['city'].value_counts() df['city_count'] = df['city'].map(counts) ```

**Use for:** When frequency is meaningful

Quick Reference

| Method | Use Case | Categories | Creates Many Columns? | |--------|----------|------------|----------------------| | Label | Target, tree models | Any | No | | One-Hot | Nominal, most models | <10 | Yes | | Ordinal | Has meaningful order | Any | No | | Binary | Many categories | >10 | Less than one-hot | | Target | High cardinality | Many | No |

Code Example: Full Pipeline

```python import pandas as pd from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler from sklearn.pipeline import Pipeline from sklearn.linear_model import LogisticRegression

Define column types numerical = ['age', 'income'] nominal = ['city', 'gender'] ordinal = ['education']

Preprocessor preprocessor = ColumnTransformer([ ('num', StandardScaler(), numerical), ('nom', OneHotEncoder(drop='first'), nominal), ('ord', OrdinalEncoder(categories=[['high school', 'bachelor', 'master']]), ordinal) ])

Full pipeline pipeline = Pipeline([ ('prep', preprocessor), ('model', LogisticRegression()) ])

pipeline.fit(X_train, y_train) ```

Important: Handle Unknown Categories

```python # What if test set has a category not seen in training? encoder = OneHotEncoder(handle_unknown='ignore') # Will create all zeros ```

Key Takeaways

1. **Nominal data** → One-Hot Encoding (or Binary for many categories) 2. **Ordinal data** → Ordinal Encoding with correct order 3. **High cardinality** → Target/Binary/Frequency Encoding 4. **Tree models** → Can use Label Encoding (they don't assume order) 5. **Always fit on train, transform on test!**

Encoding Categorical Variables for ML