ML7 min read
DBSCAN: Density-Based Clustering
Learn DBSCAN clustering which finds clusters of any shape and automatically detects outliers.
Sarah Chen
December 19, 2025
0.0k0
DBSCAN: Density-Based Clustering
K-Means assumes spherical clusters. DBSCAN finds clusters of any shape based on density, and bonus: it automatically identifies outliers.
K-Means vs DBSCAN
K-Means struggles: DBSCAN handles:
●●●● ●●●●
● ● ● ●
● ● ○○ ● ● ○○
● ● ○ ○ ● ● ○ ○
●●●● ○○ ●●●● ○○
(K-Means splits ring) (DBSCAN finds both)
How DBSCAN Works
Two parameters:
- eps (ε): Maximum distance between neighbors
- min_samples: Minimum points to form a dense region
Algorithm:
- For each point, count neighbors within eps distance
- If neighbors >= min_samples, it's a core point
- Core points close together form clusters
- Points near core points but not core themselves are border points
- Points with few neighbors are noise/outliers
Implementation
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
# Scale data first
X_scaled = StandardScaler().fit_transform(X)
# Apply DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
clusters = dbscan.fit_predict(X_scaled)
# -1 means outlier
print(f"Clusters found: {len(set(clusters)) - (1 if -1 in clusters else 0)}")
print(f"Outliers: {sum(clusters == -1)}")
Finding Good Parameters
Method 1: K-Distance Plot
from sklearn.neighbors import NearestNeighbors
import numpy as np
# Find distance to k-th nearest neighbor
k = 5 # Same as min_samples
neighbors = NearestNeighbors(n_neighbors=k)
neighbors.fit(X_scaled)
distances, _ = neighbors.kneighbors(X_scaled)
# Sort distances to k-th neighbor
k_distances = np.sort(distances[:, k-1])
# Plot - look for "elbow"
plt.plot(k_distances)
plt.xlabel('Points')
plt.ylabel(f'Distance to {k}th neighbor')
plt.title('K-Distance Plot (elbow = good eps)')
plt.show()
Method 2: Silhouette Score
from sklearn.metrics import silhouette_score
best_score = -1
best_params = None
for eps in [0.3, 0.5, 0.7, 1.0]:
for min_samples in [3, 5, 10]:
dbscan = DBSCAN(eps=eps, min_samples=min_samples)
labels = dbscan.fit_predict(X_scaled)
# Need at least 2 clusters
if len(set(labels)) > 2:
score = silhouette_score(X_scaled, labels)
if score > best_score:
best_score = score
best_params = (eps, min_samples)
print(f"Best params: eps={best_params[0]}, min_samples={best_params[1]}")
DBSCAN vs K-Means
| Aspect | K-Means | DBSCAN |
|---|---|---|
| Clusters needed upfront | Yes | No |
| Cluster shapes | Spherical | Any shape |
| Outlier detection | No | Built-in |
| Cluster sizes | Similar | Can vary |
| Parameters | k | eps, min_samples |
| Speed | Faster | Slower |
When to Use DBSCAN
Good for:
- Unknown number of clusters
- Non-spherical clusters
- Need outlier detection
- Clusters of varying density
Not ideal for:
- Very high-dimensional data
- Clusters with very different densities
- When you need soft assignments
Handling Outliers
# Get cluster assignments
clusters = dbscan.fit_predict(X_scaled)
# Separate normal points and outliers
normal_mask = clusters != -1
outlier_mask = clusters == -1
X_normal = X[normal_mask]
X_outliers = X[outlier_mask]
print(f"Normal points: {sum(normal_mask)}")
print(f"Outliers: {sum(outlier_mask)}")
Key Takeaway
DBSCAN shines when clusters aren't spherical or when you don't know how many clusters exist. It automatically handles outliers, which K-Means can't do. Use the K-distance plot to find good eps values, and remember that min_samples should scale with your data size. Perfect for anomaly detection!
#Machine Learning#DBSCAN#Clustering#Unsupervised Learning#Intermediate