K-Nearest Neighbors (KNN) Algorithm
Learn KNN - the simplest and most intuitive classification algorithm based on similarity.
K-Nearest Neighbors (KNN)
KNN is beautifully simple: to classify something new, look at its closest neighbors and go with the majority.
The Idea
"Tell me who your friends are, and I'll tell you who you are."
``` New point: ? Nearest neighbors: 🔵 🔵 🔴
If k=3, vote: 🔵 = 2 votes 🔴 = 1 vote
Winner: 🔵 ```
That's it! No training, no learned parameters—just find neighbors and vote.
How It Works
### Step 1: Choose k (number of neighbors) ```python k = 5 # Look at 5 nearest neighbors ```
### Step 2: Calculate distances to all training points ```python # Euclidean distance (most common) distance = sqrt((x1-x2)² + (y1-y2)²) ```
### Step 3: Find k nearest neighbors Sort by distance, take top k.
### Step 4: Vote (classification) or Average (regression) ```python # Classification: majority vote prediction = most_common_class(neighbors)
Regression: average value prediction = mean(neighbor_values) ```
Code Example
```python from sklearn.neighbors import KNeighborsClassifier from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler
Load data iris = load_iris() X, y = iris.data, iris.target
IMPORTANT: Scale features! scaler = StandardScaler() X_scaled = scaler.fit_transform(X)
Split X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3)
Create KNN classifier knn = KNeighborsClassifier(n_neighbors=5) knn.fit(X_train, y_train)
Evaluate accuracy = knn.score(X_test, y_test) print(f"Accuracy: {accuracy:.2%}")
Predict new sample new_flower = scaler.transform([[5.0, 3.4, 1.5, 0.2]]) prediction = knn.predict(new_flower) print(f"Predicted class: {iris.target_names[prediction[0]]}") ```
Choosing k
### k too small (like k=1) - Sensitive to noise - Overfitting - Unstable predictions
### k too large (like k=100) - Over-smoothing - Underfitting - Slow predictions
### Rule of Thumb ```python k = int(sqrt(n_samples)) # Square root of sample size ```
For classification, use **odd k** to avoid ties.
### Find Best k with Cross-Validation
```python from sklearn.model_selection import cross_val_score
k_values = range(1, 31) cv_scores = []
for k in k_values: knn = KNeighborsClassifier(n_neighbors=k) scores = cross_val_score(knn, X_train, y_train, cv=5) cv_scores.append(scores.mean())
best_k = k_values[cv_scores.index(max(cv_scores))] print(f"Best k: {best_k}") ```
Why Feature Scaling Is Critical
KNN uses distances. If features have different scales, larger ones dominate!
``` Without scaling: - Income: 50,000 vs 80,000 (diff: 30,000) - Age: 25 vs 45 (diff: 20)
Income dominates completely!
With scaling (both 0-1): - Income: 0.5 vs 0.8 (diff: 0.3) - Age: 0.25 vs 0.45 (diff: 0.2)
Now both contribute fairly. ```
**Always scale features before KNN!**
Distance Metrics
### Euclidean (default) ```python # Straight-line distance d = sqrt(Σ(xi - yi)²) ```
### Manhattan ```python # Grid distance (like walking city blocks) d = ÎŁ|xi - yi| ```
### Minkowski ```python # Generalization of both knn = KNeighborsClassifier(metric='minkowski', p=2) # p=2 is Euclidean ```
KNN for Regression
```python from sklearn.neighbors import KNeighborsRegressor
Predict house price based on neighbors knn_reg = KNeighborsRegressor(n_neighbors=5) knn_reg.fit(X_train, y_train)
Prediction = average of 5 nearest neighbors' prices predicted_price = knn_reg.predict(new_house) ```
Pros and Cons
### Pros âś… - Dead simple to understand - No training phase (lazy learner) - Works for classification AND regression - Naturally handles multi-class - Non-parametric (no assumptions)
### Cons ❌ - Slow predictions (must check all points) - Memory hungry (stores all data) - Sensitive to irrelevant features - Requires feature scaling - Struggles with high dimensions ("curse of dimensionality")
When to Use KNN
**Good for:** - Small to medium datasets - When you need quick baseline - Recommendation systems (find similar users/items) - Anomaly detection (point far from all neighbors)
**Avoid when:** - Very large datasets (slow) - High-dimensional data (>20 features) - Features have different importance
Practical Tips
```python # 1. Always scale scaler = StandardScaler() X_scaled = scaler.fit_transform(X)
2. Use cross-validation to find k # 3. Consider weighted voting (closer = more weight) knn = KNeighborsClassifier(n_neighbors=5, weights='distance')
4. For large datasets, use approximate methods from sklearn.neighbors import BallTree, KDTree ```
Key Takeaway
KNN is the ultimate "common sense" algorithm—just look at what's nearby. It's a great starting point and often performs surprisingly well. Just remember: scale your features!