ML7 min read

DBSCAN: Density-Based Clustering

Learn DBSCAN clustering which finds clusters of any shape and automatically detects outliers.

Sarah Chen
December 19, 2025
0.0k0

DBSCAN: Density-Based Clustering

K-Means assumes spherical clusters. DBSCAN finds clusters of any shape based on density, and bonus: it automatically identifies outliers.

K-Means vs DBSCAN

K-Means struggles:          DBSCAN handles:
    ●●●●                       ●●●●
   ●    ●                     ●    ●
  ●      ●   ○○               ●      ●   ○○
   ●    ●   ○  ○              ●    ●   ○  ○
    ●●●●   ○○                  ●●●●   ○○
    
  (K-Means splits ring)    (DBSCAN finds both)

How DBSCAN Works

Two parameters:

  • eps (ε): Maximum distance between neighbors
  • min_samples: Minimum points to form a dense region

Algorithm:

  1. For each point, count neighbors within eps distance
  2. If neighbors >= min_samples, it's a core point
  3. Core points close together form clusters
  4. Points near core points but not core themselves are border points
  5. Points with few neighbors are noise/outliers

Implementation

from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler

# Scale data first
X_scaled = StandardScaler().fit_transform(X)

# Apply DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
clusters = dbscan.fit_predict(X_scaled)

# -1 means outlier
print(f"Clusters found: {len(set(clusters)) - (1 if -1 in clusters else 0)}")
print(f"Outliers: {sum(clusters == -1)}")

Finding Good Parameters

Method 1: K-Distance Plot

from sklearn.neighbors import NearestNeighbors
import numpy as np

# Find distance to k-th nearest neighbor
k = 5  # Same as min_samples
neighbors = NearestNeighbors(n_neighbors=k)
neighbors.fit(X_scaled)
distances, _ = neighbors.kneighbors(X_scaled)

# Sort distances to k-th neighbor
k_distances = np.sort(distances[:, k-1])

# Plot - look for "elbow"
plt.plot(k_distances)
plt.xlabel('Points')
plt.ylabel(f'Distance to {k}th neighbor')
plt.title('K-Distance Plot (elbow = good eps)')
plt.show()

Method 2: Silhouette Score

from sklearn.metrics import silhouette_score

best_score = -1
best_params = None

for eps in [0.3, 0.5, 0.7, 1.0]:
    for min_samples in [3, 5, 10]:
        dbscan = DBSCAN(eps=eps, min_samples=min_samples)
        labels = dbscan.fit_predict(X_scaled)
        
        # Need at least 2 clusters
        if len(set(labels)) > 2:
            score = silhouette_score(X_scaled, labels)
            if score > best_score:
                best_score = score
                best_params = (eps, min_samples)

print(f"Best params: eps={best_params[0]}, min_samples={best_params[1]}")

DBSCAN vs K-Means

Aspect K-Means DBSCAN
Clusters needed upfront Yes No
Cluster shapes Spherical Any shape
Outlier detection No Built-in
Cluster sizes Similar Can vary
Parameters k eps, min_samples
Speed Faster Slower

When to Use DBSCAN

Good for:

  • Unknown number of clusters
  • Non-spherical clusters
  • Need outlier detection
  • Clusters of varying density

Not ideal for:

  • Very high-dimensional data
  • Clusters with very different densities
  • When you need soft assignments

Handling Outliers

# Get cluster assignments
clusters = dbscan.fit_predict(X_scaled)

# Separate normal points and outliers
normal_mask = clusters != -1
outlier_mask = clusters == -1

X_normal = X[normal_mask]
X_outliers = X[outlier_mask]

print(f"Normal points: {sum(normal_mask)}")
print(f"Outliers: {sum(outlier_mask)}")

Key Takeaway

DBSCAN shines when clusters aren't spherical or when you don't know how many clusters exist. It automatically handles outliers, which K-Means can't do. Use the K-distance plot to find good eps values, and remember that min_samples should scale with your data size. Perfect for anomaly detection!

#Machine Learning#DBSCAN#Clustering#Unsupervised Learning#Intermediate