Learn DBSCAN clustering which finds clusters of any shape and automatically detects outliers.

DBSCAN: Density-Based Clustering

K-Means assumes spherical clusters. DBSCAN finds clusters of any shape based on density, and bonus: it automatically identifies outliers.

K-Means vs DBSCAN

``` K-Means struggles: DBSCAN handles: ●●●● ●●●● ● ● ● ● ● ● ○○ ● ● ○○ ● ● ○ ○ ● ● ○ ○ ●●●● ○○ ●●●● ○○ (K-Means splits ring) (DBSCAN finds both) ```

How DBSCAN Works

Two parameters: - **eps (ε):** Maximum distance between neighbors - **min_samples:** Minimum points to form a dense region

Algorithm: 1. For each point, count neighbors within eps distance 2. If neighbors >= min_samples, it's a **core point** 3. Core points close together form clusters 4. Points near core points but not core themselves are **border points** 5. Points with few neighbors are **noise/outliers**

Implementation

```python from sklearn.cluster import DBSCAN from sklearn.preprocessing import StandardScaler

Scale data first X_scaled = StandardScaler().fit_transform(X)

Apply DBSCAN dbscan = DBSCAN(eps=0.5, min_samples=5) clusters = dbscan.fit_predict(X_scaled)

-1 means outlier print(f"Clusters found: {len(set(clusters)) - (1 if -1 in clusters else 0)}") print(f"Outliers: {sum(clusters == -1)}") ```

Finding Good Parameters

### Method 1: K-Distance Plot

```python from sklearn.neighbors import NearestNeighbors import numpy as np

Find distance to k-th nearest neighbor k = 5 # Same as min_samples neighbors = NearestNeighbors(n_neighbors=k) neighbors.fit(X_scaled) distances, _ = neighbors.kneighbors(X_scaled)

Sort distances to k-th neighbor k_distances = np.sort(distances[:, k-1])

Plot - look for "elbow" plt.plot(k_distances) plt.xlabel('Points') plt.ylabel(f'Distance to {k}th neighbor') plt.title('K-Distance Plot (elbow = good eps)') plt.show() ```

### Method 2: Silhouette Score

```python from sklearn.metrics import silhouette_score

best_score = -1 best_params = None

for eps in [0.3, 0.5, 0.7, 1.0]: for min_samples in [3, 5, 10]: dbscan = DBSCAN(eps=eps, min_samples=min_samples) labels = dbscan.fit_predict(X_scaled) # Need at least 2 clusters if len(set(labels)) > 2: score = silhouette_score(X_scaled, labels) if score > best_score: best_score = score best_params = (eps, min_samples)

print(f"Best params: eps={best_params[0]}, min_samples={best_params[1]}") ```

DBSCAN vs K-Means

| Aspect | K-Means | DBSCAN | |--------|---------|--------| | Clusters needed upfront | Yes | No | | Cluster shapes | Spherical | Any shape | | Outlier detection | No | Built-in | | Cluster sizes | Similar | Can vary | | Parameters | k | eps, min_samples | | Speed | Faster | Slower |

When to Use DBSCAN

**Good for:** - Unknown number of clusters - Non-spherical clusters - Need outlier detection - Clusters of varying density

**Not ideal for:** - Very high-dimensional data - Clusters with very different densities - When you need soft assignments

Handling Outliers

```python # Get cluster assignments clusters = dbscan.fit_predict(X_scaled)

Separate normal points and outliers normal_mask = clusters != -1 outlier_mask = clusters == -1

X_normal = X[normal_mask] X_outliers = X[outlier_mask]

print(f"Normal points: {sum(normal_mask)}") print(f"Outliers: {sum(outlier_mask)}") ```

Key Takeaway

DBSCAN shines when clusters aren't spherical or when you don't know how many clusters exist. It automatically handles outliers, which K-Means can't do. Use the K-distance plot to find good eps values, and remember that min_samples should scale with your data size. Perfect for anomaly detection!

DBSCAN: Density-Based Clustering

DBSCAN: Density-Based Clustering

K-Means vs DBSCAN

How DBSCAN Works

Implementation

Scale data first X_scaled = StandardScaler().fit_transform(X)

Apply DBSCAN dbscan = DBSCAN(eps=0.5, min_samples=5) clusters = dbscan.fit_predict(X_scaled)

-1 means outlier print(f"Clusters found: {len(set(clusters)) - (1 if -1 in clusters else 0)}") print(f"Outliers: {sum(clusters == -1)}") ```

Finding Good Parameters

Find distance to k-th nearest neighbor k = 5 # Same as min_samples neighbors = NearestNeighbors(n_neighbors=k) neighbors.fit(X_scaled) distances, _ = neighbors.kneighbors(X_scaled)

Sort distances to k-th neighbor k_distances = np.sort(distances[:, k-1])

Plot - look for "elbow" plt.plot(k_distances) plt.xlabel('Points') plt.ylabel(f'Distance to {k}th neighbor') plt.title('K-Distance Plot (elbow = good eps)') plt.show() ```

DBSCAN vs K-Means

When to Use DBSCAN

Handling Outliers

Separate normal points and outliers normal_mask = clusters != -1 outlier_mask = clusters == -1

Key Takeaway

More on ML

What is Machine Learning? A Simple Introduction

Supervised vs Unsupervised Learning Explained

Understanding Training, Validation, and Test Sets