ML7 min read

K-Nearest Neighbors (KNN) Algorithm

Learn KNN - the simplest and most intuitive classification algorithm based on similarity.

Sarah Chen
December 19, 2025
0.0k0

K-Nearest Neighbors (KNN)

KNN is beautifully simple: to classify something new, look at its closest neighbors and go with the majority.

The Idea

"Tell me who your friends are, and I'll tell you who you are."

New point: ?
Nearest neighbors: 🔵 🔵 🔴

If k=3, vote:
🔵 = 2 votes
đź”´ = 1 vote

Winner: 🔵

That's it! No training, no learned parameters—just find neighbors and vote.

How It Works

Step 1: Choose k (number of neighbors)

k = 5  # Look at 5 nearest neighbors

Step 2: Calculate distances to all training points

# Euclidean distance (most common)
distance = sqrt((x1-x2)² + (y1-y2)²)

Step 3: Find k nearest neighbors

Sort by distance, take top k.

Step 4: Vote (classification) or Average (regression)

# Classification: majority vote
prediction = most_common_class(neighbors)

# Regression: average value
prediction = mean(neighbor_values)

Code Example

from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# IMPORTANT: Scale features!
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3)

# Create KNN classifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

# Evaluate
accuracy = knn.score(X_test, y_test)
print(f"Accuracy: {accuracy:.2%}")

# Predict new sample
new_flower = scaler.transform([[5.0, 3.4, 1.5, 0.2]])
prediction = knn.predict(new_flower)
print(f"Predicted class: {iris.target_names[prediction[0]]}")

Choosing k

k too small (like k=1)

  • Sensitive to noise
  • Overfitting
  • Unstable predictions

k too large (like k=100)

  • Over-smoothing
  • Underfitting
  • Slow predictions

Rule of Thumb

k = int(sqrt(n_samples))  # Square root of sample size

For classification, use odd k to avoid ties.

Find Best k with Cross-Validation

from sklearn.model_selection import cross_val_score

k_values = range(1, 31)
cv_scores = []

for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X_train, y_train, cv=5)
    cv_scores.append(scores.mean())

best_k = k_values[cv_scores.index(max(cv_scores))]
print(f"Best k: {best_k}")

Why Feature Scaling Is Critical

KNN uses distances. If features have different scales, larger ones dominate!

Without scaling:
- Income: 50,000 vs 80,000 (diff: 30,000)
- Age: 25 vs 45 (diff: 20)

Income dominates completely!

With scaling (both 0-1):
- Income: 0.5 vs 0.8 (diff: 0.3)
- Age: 0.25 vs 0.45 (diff: 0.2)

Now both contribute fairly.

Always scale features before KNN!

Distance Metrics

Euclidean (default)

# Straight-line distance
d = sqrt(Σ(xi - yi)²)

Manhattan

# Grid distance (like walking city blocks)
d = ÎŁ|xi - yi|

Minkowski

# Generalization of both
knn = KNeighborsClassifier(metric='minkowski', p=2)  # p=2 is Euclidean

KNN for Regression

from sklearn.neighbors import KNeighborsRegressor

# Predict house price based on neighbors
knn_reg = KNeighborsRegressor(n_neighbors=5)
knn_reg.fit(X_train, y_train)

# Prediction = average of 5 nearest neighbors' prices
predicted_price = knn_reg.predict(new_house)

Pros and Cons

Pros âś…

  • Dead simple to understand
  • No training phase (lazy learner)
  • Works for classification AND regression
  • Naturally handles multi-class
  • Non-parametric (no assumptions)

Cons ❌

  • Slow predictions (must check all points)
  • Memory hungry (stores all data)
  • Sensitive to irrelevant features
  • Requires feature scaling
  • Struggles with high dimensions ("curse of dimensionality")

When to Use KNN

Good for:

  • Small to medium datasets
  • When you need quick baseline
  • Recommendation systems (find similar users/items)
  • Anomaly detection (point far from all neighbors)

Avoid when:

  • Very large datasets (slow)
  • High-dimensional data (>20 features)
  • Features have different importance

Practical Tips

# 1. Always scale
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 2. Use cross-validation to find k
# 3. Consider weighted voting (closer = more weight)
knn = KNeighborsClassifier(n_neighbors=5, weights='distance')

# 4. For large datasets, use approximate methods
from sklearn.neighbors import BallTree, KDTree

Key Takeaway

KNN is the ultimate "common sense" algorithm—just look at what's nearby. It's a great starting point and often performs surprisingly well. Just remember: scale your features!

#Machine Learning#KNN#Classification#Beginner