ML8 min read

K-Means Clustering: Finding Groups in Data

Learn how K-Means clustering finds natural groupings in your data without labels.

Sarah Chen
December 19, 2025
0.0k0

K-Means Clustering: Finding Groups in Data

No labels? No problem. K-Means finds natural groupings in your data by itself. This is unsupervised learning.

How K-Means Works

  1. Pick K (number of clusters)
  2. Randomly place K centroids
  3. Assign each point to nearest centroid
  4. Move centroids to center of their points
  5. Repeat steps 3-4 until stable
Iteration 1:        Iteration 2:        Final:
    ●                  ●                  ●
  ● ★ ●             ●  ★  ●            ●  ★  ●
    ●                  ●                  ●
        
  ○   ○             ○  ☆  ○            ○  ☆  ○
    ○                  ○                  ○

★ ☆ = centroids (moving toward cluster centers)

Implementation

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Scale your data!
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply K-Means
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
clusters = kmeans.fit_predict(X_scaled)

# Add cluster labels to data
df['cluster'] = clusters

Choosing K: The Elbow Method

How many clusters? Let the data tell you:

inertias = []
K_range = range(1, 11)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X_scaled)
    inertias.append(kmeans.inertia_)

plt.plot(K_range, inertias, 'bo-')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.show()

Look for the "elbow" - where adding more clusters stops helping much.

Choosing K: Silhouette Score

Measures how well points fit their cluster vs others:

from sklearn.metrics import silhouette_score

scores = []
for k in range(2, 11):
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = kmeans.fit_predict(X_scaled)
    scores.append(silhouette_score(X_scaled, labels))

best_k = range(2, 11)[scores.index(max(scores))]
print(f"Best K by silhouette: {best_k}")

Higher silhouette score = better clustering (max is 1).

Analyzing Clusters

# Cluster statistics
cluster_stats = df.groupby('cluster').agg({
    'age': 'mean',
    'income': 'mean',
    'spending': 'mean'
})
print(cluster_stats)

# Cluster sizes
print(df['cluster'].value_counts())

Limitations of K-Means

  1. Must specify K beforehand
  2. Assumes spherical clusters (bad for elongated shapes)
  3. Sensitive to outliers
  4. Sensitive to initialization (use n_init > 1)

When to Use K-Means

Good for:

  • Customer segmentation
  • Image compression
  • Data exploration
  • When clusters are roughly spherical

Not ideal for:

  • Unknown number of clusters (try DBSCAN)
  • Non-spherical clusters
  • Varying cluster sizes
  • Data with many outliers

Real-World Example: Customer Segmentation

# Features: recency, frequency, monetary value
customer_features = df[['recency', 'frequency', 'monetary']]

# Scale
scaled = StandardScaler().fit_transform(customer_features)

# Cluster
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
df['segment'] = kmeans.fit_predict(scaled)

# Analyze segments
df.groupby('segment')[['recency', 'frequency', 'monetary']].mean()

Key Takeaway

K-Means is simple and fast for finding groups. Scale your data, use the elbow method or silhouette score to choose K, and always run with multiple initializations (n_init). Remember: K-Means finds clusters even if none exist - always validate that your clusters make sense!

#Machine Learning#K-Means#Clustering#Unsupervised Learning#Intermediate