ML8 min read
K-Means Clustering: Finding Groups in Data
Learn how K-Means clustering finds natural groupings in your data without labels.
Sarah Chen
December 19, 2025
0.0k0
K-Means Clustering: Finding Groups in Data
No labels? No problem. K-Means finds natural groupings in your data by itself. This is unsupervised learning.
How K-Means Works
- Pick K (number of clusters)
- Randomly place K centroids
- Assign each point to nearest centroid
- Move centroids to center of their points
- Repeat steps 3-4 until stable
Iteration 1: Iteration 2: Final:
● ● ●
● ★ ● ● ★ ● ● ★ ●
● ● ●
○ ○ ○ ☆ ○ ○ ☆ ○
○ ○ ○
★ ☆ = centroids (moving toward cluster centers)
Implementation
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
# Scale your data!
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Apply K-Means
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
clusters = kmeans.fit_predict(X_scaled)
# Add cluster labels to data
df['cluster'] = clusters
Choosing K: The Elbow Method
How many clusters? Let the data tell you:
inertias = []
K_range = range(1, 11)
for k in K_range:
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
kmeans.fit(X_scaled)
inertias.append(kmeans.inertia_)
plt.plot(K_range, inertias, 'bo-')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.show()
Look for the "elbow" - where adding more clusters stops helping much.
Choosing K: Silhouette Score
Measures how well points fit their cluster vs others:
from sklearn.metrics import silhouette_score
scores = []
for k in range(2, 11):
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
labels = kmeans.fit_predict(X_scaled)
scores.append(silhouette_score(X_scaled, labels))
best_k = range(2, 11)[scores.index(max(scores))]
print(f"Best K by silhouette: {best_k}")
Higher silhouette score = better clustering (max is 1).
Analyzing Clusters
# Cluster statistics
cluster_stats = df.groupby('cluster').agg({
'age': 'mean',
'income': 'mean',
'spending': 'mean'
})
print(cluster_stats)
# Cluster sizes
print(df['cluster'].value_counts())
Limitations of K-Means
- Must specify K beforehand
- Assumes spherical clusters (bad for elongated shapes)
- Sensitive to outliers
- Sensitive to initialization (use n_init > 1)
When to Use K-Means
Good for:
- Customer segmentation
- Image compression
- Data exploration
- When clusters are roughly spherical
Not ideal for:
- Unknown number of clusters (try DBSCAN)
- Non-spherical clusters
- Varying cluster sizes
- Data with many outliers
Real-World Example: Customer Segmentation
# Features: recency, frequency, monetary value
customer_features = df[['recency', 'frequency', 'monetary']]
# Scale
scaled = StandardScaler().fit_transform(customer_features)
# Cluster
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
df['segment'] = kmeans.fit_predict(scaled)
# Analyze segments
df.groupby('segment')[['recency', 'frequency', 'monetary']].mean()
Key Takeaway
K-Means is simple and fast for finding groups. Scale your data, use the elbow method or silhouette score to choose K, and always run with multiple initializations (n_init). Remember: K-Means finds clusters even if none exist - always validate that your clusters make sense!
#Machine Learning#K-Means#Clustering#Unsupervised Learning#Intermediate