Learn KNN - the simplest and most intuitive classification algorithm based on similarity.

K-Nearest Neighbors (KNN)

KNN is beautifully simple: to classify something new, look at its closest neighbors and go with the majority.

The Idea

"Tell me who your friends are, and I'll tell you who you are."

``` New point: ? Nearest neighbors: 🔵 🔵 🔴

If k=3, vote: 🔵 = 2 votes 🔴 = 1 vote

Winner: 🔵 ```

That's it! No training, no learned parameters—just find neighbors and vote.

How It Works

### Step 1: Choose k (number of neighbors) ```python k = 5 # Look at 5 nearest neighbors ```

### Step 2: Calculate distances to all training points ```python # Euclidean distance (most common) distance = sqrt((x1-x2)² + (y1-y2)²) ```

### Step 3: Find k nearest neighbors Sort by distance, take top k.

### Step 4: Vote (classification) or Average (regression) ```python # Classification: majority vote prediction = most_common_class(neighbors)

Regression: average value prediction = mean(neighbor_values) ```

Code Example

```python from sklearn.neighbors import KNeighborsClassifier from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler

Load data iris = load_iris() X, y = iris.data, iris.target

IMPORTANT: Scale features! scaler = StandardScaler() X_scaled = scaler.fit_transform(X)

Split X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3)

Create KNN classifier knn = KNeighborsClassifier(n_neighbors=5) knn.fit(X_train, y_train)

Evaluate accuracy = knn.score(X_test, y_test) print(f"Accuracy: {accuracy:.2%}")

Predict new sample new_flower = scaler.transform([[5.0, 3.4, 1.5, 0.2]]) prediction = knn.predict(new_flower) print(f"Predicted class: {iris.target_names[prediction[0]]}") ```

Choosing k

### k too small (like k=1) - Sensitive to noise - Overfitting - Unstable predictions

### k too large (like k=100) - Over-smoothing - Underfitting - Slow predictions

### Rule of Thumb ```python k = int(sqrt(n_samples)) # Square root of sample size ```

For classification, use **odd k** to avoid ties.

### Find Best k with Cross-Validation

```python from sklearn.model_selection import cross_val_score

k_values = range(1, 31) cv_scores = []

for k in k_values: knn = KNeighborsClassifier(n_neighbors=k) scores = cross_val_score(knn, X_train, y_train, cv=5) cv_scores.append(scores.mean())

best_k = k_values[cv_scores.index(max(cv_scores))] print(f"Best k: {best_k}") ```

Why Feature Scaling Is Critical

KNN uses distances. If features have different scales, larger ones dominate!

``` Without scaling: - Income: 50,000 vs 80,000 (diff: 30,000) - Age: 25 vs 45 (diff: 20)

Income dominates completely!

With scaling (both 0-1): - Income: 0.5 vs 0.8 (diff: 0.3) - Age: 0.25 vs 0.45 (diff: 0.2)

Now both contribute fairly. ```

**Always scale features before KNN!**

Distance Metrics

### Euclidean (default) ```python # Straight-line distance d = sqrt(Σ(xi - yi)²) ```

### Manhattan ```python # Grid distance (like walking city blocks) d = Σ|xi - yi| ```

### Minkowski ```python # Generalization of both knn = KNeighborsClassifier(metric='minkowski', p=2) # p=2 is Euclidean ```

KNN for Regression

```python from sklearn.neighbors import KNeighborsRegressor

Predict house price based on neighbors knn_reg = KNeighborsRegressor(n_neighbors=5) knn_reg.fit(X_train, y_train)

Prediction = average of 5 nearest neighbors' prices predicted_price = knn_reg.predict(new_house) ```

Pros and Cons

### Pros ✅ - Dead simple to understand - No training phase (lazy learner) - Works for classification AND regression - Naturally handles multi-class - Non-parametric (no assumptions)

### Cons ❌ - Slow predictions (must check all points) - Memory hungry (stores all data) - Sensitive to irrelevant features - Requires feature scaling - Struggles with high dimensions ("curse of dimensionality")

When to Use KNN

**Good for:** - Small to medium datasets - When you need quick baseline - Recommendation systems (find similar users/items) - Anomaly detection (point far from all neighbors)

**Avoid when:** - Very large datasets (slow) - High-dimensional data (>20 features) - Features have different importance

Practical Tips

```python # 1. Always scale scaler = StandardScaler() X_scaled = scaler.fit_transform(X)

2. Use cross-validation to find k # 3. Consider weighted voting (closer = more weight) knn = KNeighborsClassifier(n_neighbors=5, weights='distance')

4. For large datasets, use approximate methods from sklearn.neighbors import BallTree, KDTree ```

Key Takeaway

KNN is the ultimate "common sense" algorithm—just look at what's nearby. It's a great starting point and often performs surprisingly well. Just remember: scale your features!

K-Nearest Neighbors (KNN) Algorithm