Learn how K-Means clustering finds natural groupings in your data without labels.

K-Means Clustering: Finding Groups in Data

No labels? No problem. K-Means finds natural groupings in your data by itself. This is unsupervised learning.

How K-Means Works

1. Pick K (number of clusters) 2. Randomly place K centroids 3. Assign each point to nearest centroid 4. Move centroids to center of their points 5. Repeat steps 3-4 until stable

``` Iteration 1: Iteration 2: Final: ● ● ● ● ★ ● ● ★ ● ● ★ ● ● ● ● ○ ○ ○ ☆ ○ ○ ☆ ○ ○ ○ ○

★ ☆ = centroids (moving toward cluster centers) ```

Implementation

```python from sklearn.cluster import KMeans from sklearn.preprocessing import StandardScaler

Scale your data! scaler = StandardScaler() X_scaled = scaler.fit_transform(X)

Apply K-Means kmeans = KMeans(n_clusters=3, random_state=42, n_init=10) clusters = kmeans.fit_predict(X_scaled)

Add cluster labels to data df['cluster'] = clusters ```

Choosing K: The Elbow Method

How many clusters? Let the data tell you:

```python inertias = [] K_range = range(1, 11)

for k in K_range: kmeans = KMeans(n_clusters=k, random_state=42, n_init=10) kmeans.fit(X_scaled) inertias.append(kmeans.inertia_)

plt.plot(K_range, inertias, 'bo-') plt.xlabel('Number of Clusters (K)') plt.ylabel('Inertia') plt.title('Elbow Method') plt.show() ```

Look for the "elbow" - where adding more clusters stops helping much.

Choosing K: Silhouette Score

Measures how well points fit their cluster vs others:

```python from sklearn.metrics import silhouette_score

scores = [] for k in range(2, 11): kmeans = KMeans(n_clusters=k, random_state=42, n_init=10) labels = kmeans.fit_predict(X_scaled) scores.append(silhouette_score(X_scaled, labels))

best_k = range(2, 11)[scores.index(max(scores))] print(f"Best K by silhouette: {best_k}") ```

Higher silhouette score = better clustering (max is 1).

Analyzing Clusters

```python # Cluster statistics cluster_stats = df.groupby('cluster').agg({ 'age': 'mean', 'income': 'mean', 'spending': 'mean' }) print(cluster_stats)

Cluster sizes print(df['cluster'].value_counts()) ```

Limitations of K-Means

1. **Must specify K beforehand** 2. **Assumes spherical clusters** (bad for elongated shapes) 3. **Sensitive to outliers** 4. **Sensitive to initialization** (use n_init > 1)

When to Use K-Means

**Good for:** - Customer segmentation - Image compression - Data exploration - When clusters are roughly spherical

**Not ideal for:** - Unknown number of clusters (try DBSCAN) - Non-spherical clusters - Varying cluster sizes - Data with many outliers

Real-World Example: Customer Segmentation

```python # Features: recency, frequency, monetary value customer_features = df[['recency', 'frequency', 'monetary']]

Scale scaled = StandardScaler().fit_transform(customer_features)

Cluster kmeans = KMeans(n_clusters=4, random_state=42, n_init=10) df['segment'] = kmeans.fit_predict(scaled)

Analyze segments df.groupby('segment')[['recency', 'frequency', 'monetary']].mean() ```

Key Takeaway

K-Means is simple and fast for finding groups. Scale your data, use the elbow method or silhouette score to choose K, and always run with multiple initializations (n_init). Remember: K-Means finds clusters even if none exist - always validate that your clusters make sense!

K-Means Clustering: Finding Groups in Data

K-Means Clustering: Finding Groups in Data

How K-Means Works

Implementation

Scale your data! scaler = StandardScaler() X_scaled = scaler.fit_transform(X)

Apply K-Means kmeans = KMeans(n_clusters=3, random_state=42, n_init=10) clusters = kmeans.fit_predict(X_scaled)

Add cluster labels to data df['cluster'] = clusters ```

Choosing K: The Elbow Method

Choosing K: Silhouette Score

Analyzing Clusters

Cluster sizes print(df['cluster'].value_counts()) ```

Limitations of K-Means

When to Use K-Means

Real-World Example: Customer Segmentation

Scale scaled = StandardScaler().fit_transform(customer_features)

Cluster kmeans = KMeans(n_clusters=4, random_state=42, n_init=10) df['segment'] = kmeans.fit_predict(scaled)

Analyze segments df.groupby('segment')[['recency', 'frequency', 'monetary']].mean() ```

Key Takeaway

More on ML

What is Machine Learning? A Simple Introduction

Supervised vs Unsupervised Learning Explained

Understanding Training, Validation, and Test Sets