Module 21 of 26 · Scikit-Learn Machine Learning · Beginner

Unsupervised Learning: Clustering

Duration: 5 min

This module delves into the realm of unsupervised learning, specifically focusing on clustering techniques. Clustering is a powerful tool for discovering inherent groupings in data without prior knowledge of labels. Understanding clustering is crucial for tasks like customer segmentation, anomaly detection, and data exploration.

K-Means Clustering

K-Means is one of the most popular clustering algorithms. It partitions the data into K distinct clusters by minimizing the variance within each cluster. The algorithm iteratively assigns data points to the nearest cluster centroid and recalculates the centroids until convergence.

from sklearn.cluster import KMeans
import numpy as np

# Generate sample data
X = np.array([[1, 2], [1, 4], [1, 0],
              [4, 2], [4, 4], [4, 0]])

# Apply K-Means clustering
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)

# Get cluster labels and centroids
labels = kmeans.labels_
centroids = kmeans.cluster_centers_

print('Cluster labels:', labels)
print('Cluster centroids:', centroids)

Try it in Google Colab: Open in Colab

Cluster labels: [0 0 0 1 1 1]
Cluster centroids: [[1. 2.]
 [4. 2.]]

Hierarchical Clustering

Hierarchical clustering creates a tree of clusters. It can be agglomerative (bottom-up) or divisive (top-down). Agglomerative clustering starts with each data point as a separate cluster and merges them until a stopping criterion is met. It is useful for visualizing the structure of the data through dendrograms.

from sklearn.cluster import AgglomerativeClustering
import numpy as np
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt

# Generate sample data
X = np.array([[1, 2], [1, 4], [1, 0],
              [4, 2], [4, 4], [4, 0]])

# Apply Agglomerative Clustering
clustering = AgglomerativeClustering(n_clusters=2).fit(X)

# Get cluster labels
labels = clustering.labels_

# Plot dendrogram
linked = linkage(X,'single')
plt.figure(figsize=(10, 7))
dendrogram(linked)
plt.title('Dendrogram')
plt.show()

print('Cluster labels:', labels)

💡 Tip: When choosing the number of clusters for K-Means, use the Elbow Method to find the optimal K by plotting the sum of squared distances from each point to its assigned center.

❓ What is the primary goal of K-Means clustering?

❓ Which type of clustering builds a tree of clusters?

Key Concepts

Concept Description
Centroid Core principle in this module
Distance Metric Core principle in this module
Convergence Core principle in this module
Silhouette Score Core principle in this module

Check Your Understanding

❓ How does Unsupervised handle edge cases?

❓ What is the computational complexity of Unsupervised?

❓ Which hyperparameter is most critical for Unsupervised?

← Previous Continue interactively → Next →

Related Courses