K-Means Clustering Fundamentals

Duration: 5 min

This module provides an in-depth exploration of K-Means clustering, a fundamental unsupervised learning algorithm used for partitioning data into distinct clusters. Understanding K-Means is crucial for data scientists as it helps in identifying patterns and grouping similar data points, which is essential for various applications such as customer segmentation, image compression, and anomaly detection.

Understanding K-Means Clustering

K-Means clustering is an iterative algorithm that divides a dataset into K distinct, non-overlapping subsets (clusters) based on feature similarity. The algorithm works by assigning each data point to the cluster with the nearest mean (centroid), then recalculating the centroids of the clusters. This process repeats until the centroids stabilize, indicating that the clusters have been optimally formed.

import numpy as np
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

# Generate synthetic data
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

# Apply K-Means clustering
kmeans = KMeans(n_clusters=4, random_state=0).fit(X)

# Get cluster labels and centroids
labels = kmeans.labels_
centroids = kmeans.cluster_centers_

print('Cluster labels:', labels)
print('Centroids:', centroids)

Try it in Google Colab:

Cluster labels: [3 1 2... 0 3 1]
Centroids: [[ 9.99509804  9.98066937]
 [-9.96885779 -9.97156511]
 [ 0.00676832 -0.01390969]
 [ 0.00676832 -0.01390969]]

Choosing the Number of Clusters

One of the critical decisions in K-Means clustering is choosing the optimal number of clusters (K). The Elbow Method is a common technique used to determine the appropriate value of K. This involves plotting the sum of squared distances from each point to its assigned centroid for different values of K and selecting the K where the rate of decrease sharply shifts (the 'elbow' point).

import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

# Generate synthetic data
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

# Calculate the sum of squared distances for different K values
sum_of_squared_distances = []
K = range(1, 10)
for k in K:
    kmeans = KMeans(n_clusters=k)
    kmeans.fit(X)
    sum_of_squared_distances.append(kmeans.inertia_)

# Plot the Elbow Method graph
plt.plot(K, sum_of_squared_distances, 'bx-')
plt.xlabel('Number of clusters')
plt.ylabel('Sum of squared distances')
plt.title('Elbow Method For Optimal k')
plt.show()

💡 Tip: When applying K-Means clustering, ensure that your data is scaled properly, as the algorithm is sensitive to the scale of the features. Using techniques like StandardScaler from sklearn.preprocessing can help achieve better results.

❓ What is the primary goal of K-Means clustering?

To reduce dimensionality To partition data into distinct clusters To predict continuous values To classify data into predefined categories

❓ Which method is commonly used to determine the optimal number of clusters in K-Means?

Silhouette Method DBSCAN Elbow Method Hierarchical Clustering

K-Means Clustering Fundamentals

Understanding K-Means Clustering

Choosing the Number of Clusters

Related Courses