K-Means Clustering: Finding Groups in Data
Learn how K-Means clustering finds natural groupings in your data without labels.
K-Means Clustering: Finding Groups in Data
No labels? No problem. K-Means finds natural groupings in your data by itself. This is unsupervised learning.
How K-Means Works
1. Pick K (number of clusters) 2. Randomly place K centroids 3. Assign each point to nearest centroid 4. Move centroids to center of their points 5. Repeat steps 3-4 until stable
``` Iteration 1: Iteration 2: Final: ● ● ● ● ★ ● ● ★ ● ● ★ ● ● ● ● ○ ○ ○ ☆ ○ ○ ☆ ○ ○ ○ ○
★ ☆ = centroids (moving toward cluster centers) ```
Implementation
```python from sklearn.cluster import KMeans from sklearn.preprocessing import StandardScaler
Scale your data! scaler = StandardScaler() X_scaled = scaler.fit_transform(X)
Apply K-Means kmeans = KMeans(n_clusters=3, random_state=42, n_init=10) clusters = kmeans.fit_predict(X_scaled)
Add cluster labels to data df['cluster'] = clusters ```
Choosing K: The Elbow Method
How many clusters? Let the data tell you:
```python inertias = [] K_range = range(1, 11)
for k in K_range: kmeans = KMeans(n_clusters=k, random_state=42, n_init=10) kmeans.fit(X_scaled) inertias.append(kmeans.inertia_)
plt.plot(K_range, inertias, 'bo-') plt.xlabel('Number of Clusters (K)') plt.ylabel('Inertia') plt.title('Elbow Method') plt.show() ```
Look for the "elbow" - where adding more clusters stops helping much.
Choosing K: Silhouette Score
Measures how well points fit their cluster vs others:
```python from sklearn.metrics import silhouette_score
scores = [] for k in range(2, 11): kmeans = KMeans(n_clusters=k, random_state=42, n_init=10) labels = kmeans.fit_predict(X_scaled) scores.append(silhouette_score(X_scaled, labels))
best_k = range(2, 11)[scores.index(max(scores))] print(f"Best K by silhouette: {best_k}") ```
Higher silhouette score = better clustering (max is 1).
Analyzing Clusters
```python # Cluster statistics cluster_stats = df.groupby('cluster').agg({ 'age': 'mean', 'income': 'mean', 'spending': 'mean' }) print(cluster_stats)
Cluster sizes print(df['cluster'].value_counts()) ```
Limitations of K-Means
1. **Must specify K beforehand** 2. **Assumes spherical clusters** (bad for elongated shapes) 3. **Sensitive to outliers** 4. **Sensitive to initialization** (use n_init > 1)
When to Use K-Means
**Good for:** - Customer segmentation - Image compression - Data exploration - When clusters are roughly spherical
**Not ideal for:** - Unknown number of clusters (try DBSCAN) - Non-spherical clusters - Varying cluster sizes - Data with many outliers
Real-World Example: Customer Segmentation
```python # Features: recency, frequency, monetary value customer_features = df[['recency', 'frequency', 'monetary']]
Scale scaled = StandardScaler().fit_transform(customer_features)
Cluster kmeans = KMeans(n_clusters=4, random_state=42, n_init=10) df['segment'] = kmeans.fit_predict(scaled)
Analyze segments df.groupby('segment')[['recency', 'frequency', 'monetary']].mean() ```
Key Takeaway
K-Means is simple and fast for finding groups. Scale your data, use the elbow method or silhouette score to choose K, and always run with multiple initializations (n_init). Remember: K-Means finds clusters even if none exist - always validate that your clusters make sense!