ML7 min read
K-Nearest Neighbors (KNN) Algorithm
Learn KNN - the simplest and most intuitive classification algorithm based on similarity.
Sarah Chen
December 19, 2025
0.0k0
K-Nearest Neighbors (KNN)
KNN is beautifully simple: to classify something new, look at its closest neighbors and go with the majority.
The Idea
"Tell me who your friends are, and I'll tell you who you are."
New point: ?
Nearest neighbors: 🔵 🔵 🔴
If k=3, vote:
🔵 = 2 votes
đź”´ = 1 vote
Winner: 🔵
That's it! No training, no learned parameters—just find neighbors and vote.
How It Works
Step 1: Choose k (number of neighbors)
k = 5 # Look at 5 nearest neighbors
Step 2: Calculate distances to all training points
# Euclidean distance (most common)
distance = sqrt((x1-x2)² + (y1-y2)²)
Step 3: Find k nearest neighbors
Sort by distance, take top k.
Step 4: Vote (classification) or Average (regression)
# Classification: majority vote
prediction = most_common_class(neighbors)
# Regression: average value
prediction = mean(neighbor_values)
Code Example
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Load data
iris = load_iris()
X, y = iris.data, iris.target
# IMPORTANT: Scale features!
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3)
# Create KNN classifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
# Evaluate
accuracy = knn.score(X_test, y_test)
print(f"Accuracy: {accuracy:.2%}")
# Predict new sample
new_flower = scaler.transform([[5.0, 3.4, 1.5, 0.2]])
prediction = knn.predict(new_flower)
print(f"Predicted class: {iris.target_names[prediction[0]]}")
Choosing k
k too small (like k=1)
- Sensitive to noise
- Overfitting
- Unstable predictions
k too large (like k=100)
- Over-smoothing
- Underfitting
- Slow predictions
Rule of Thumb
k = int(sqrt(n_samples)) # Square root of sample size
For classification, use odd k to avoid ties.
Find Best k with Cross-Validation
from sklearn.model_selection import cross_val_score
k_values = range(1, 31)
cv_scores = []
for k in k_values:
knn = KNeighborsClassifier(n_neighbors=k)
scores = cross_val_score(knn, X_train, y_train, cv=5)
cv_scores.append(scores.mean())
best_k = k_values[cv_scores.index(max(cv_scores))]
print(f"Best k: {best_k}")
Why Feature Scaling Is Critical
KNN uses distances. If features have different scales, larger ones dominate!
Without scaling:
- Income: 50,000 vs 80,000 (diff: 30,000)
- Age: 25 vs 45 (diff: 20)
Income dominates completely!
With scaling (both 0-1):
- Income: 0.5 vs 0.8 (diff: 0.3)
- Age: 0.25 vs 0.45 (diff: 0.2)
Now both contribute fairly.
Always scale features before KNN!
Distance Metrics
Euclidean (default)
# Straight-line distance
d = sqrt(Σ(xi - yi)²)
Manhattan
# Grid distance (like walking city blocks)
d = ÎŁ|xi - yi|
Minkowski
# Generalization of both
knn = KNeighborsClassifier(metric='minkowski', p=2) # p=2 is Euclidean
KNN for Regression
from sklearn.neighbors import KNeighborsRegressor
# Predict house price based on neighbors
knn_reg = KNeighborsRegressor(n_neighbors=5)
knn_reg.fit(X_train, y_train)
# Prediction = average of 5 nearest neighbors' prices
predicted_price = knn_reg.predict(new_house)
Pros and Cons
Pros âś…
- Dead simple to understand
- No training phase (lazy learner)
- Works for classification AND regression
- Naturally handles multi-class
- Non-parametric (no assumptions)
Cons ❌
- Slow predictions (must check all points)
- Memory hungry (stores all data)
- Sensitive to irrelevant features
- Requires feature scaling
- Struggles with high dimensions ("curse of dimensionality")
When to Use KNN
Good for:
- Small to medium datasets
- When you need quick baseline
- Recommendation systems (find similar users/items)
- Anomaly detection (point far from all neighbors)
Avoid when:
- Very large datasets (slow)
- High-dimensional data (>20 features)
- Features have different importance
Practical Tips
# 1. Always scale
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# 2. Use cross-validation to find k
# 3. Consider weighted voting (closer = more weight)
knn = KNeighborsClassifier(n_neighbors=5, weights='distance')
# 4. For large datasets, use approximate methods
from sklearn.neighbors import BallTree, KDTree
Key Takeaway
KNN is the ultimate "common sense" algorithm—just look at what's nearby. It's a great starting point and often performs surprisingly well. Just remember: scale your features!
#Machine Learning#KNN#Classification#Beginner