Unsupervised Learning: Clustering

Duration: 5 min

This module delves into the realm of unsupervised learning, specifically focusing on clustering techniques. Clustering is a powerful tool for discovering inherent groupings in data without prior knowledge of labels. Understanding clustering is crucial for tasks like customer segmentation, anomaly detection, and data exploration.

K-Means Clustering

K-Means is one of the most popular clustering algorithms. It partitions the data into K distinct clusters by minimizing the variance within each cluster. The algorithm iteratively assigns data points to the nearest cluster centroid and recalculates the centroids until convergence.

from sklearn.cluster import KMeans
import numpy as np

# Generate sample data
X = np.array([[1, 2], [1, 4], [1, 0],
              [4, 2], [4, 4], [4, 0]])

# Apply K-Means clustering
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)

# Get cluster labels and centroids
labels = kmeans.labels_
centroids = kmeans.cluster_centers_

print('Cluster labels:', labels)
print('Cluster centroids:', centroids)

Try it in Google Colab:

Cluster labels: [0 0 0 1 1 1]
Cluster centroids: [[1. 2.]
 [4. 2.]]

Hierarchical Clustering

Hierarchical clustering creates a tree of clusters. It can be agglomerative (bottom-up) or divisive (top-down). Agglomerative clustering starts with each data point as a separate cluster and merges them until a stopping criterion is met. It is useful for visualizing the structure of the data through dendrograms.

from sklearn.cluster import AgglomerativeClustering
import numpy as np
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt

# Generate sample data
X = np.array([[1, 2], [1, 4], [1, 0],
              [4, 2], [4, 4], [4, 0]])

# Apply Agglomerative Clustering
clustering = AgglomerativeClustering(n_clusters=2).fit(X)

# Get cluster labels
labels = clustering.labels_

# Plot dendrogram
linked = linkage(X,'single')
plt.figure(figsize=(10, 7))
dendrogram(linked)
plt.title('Dendrogram')
plt.show()

print('Cluster labels:', labels)

💡 Tip: When choosing the number of clusters for K-Means, use the Elbow Method to find the optimal K by plotting the sum of squared distances from each point to its assigned center.

❓ What is the primary goal of K-Means clustering?

To maximize variance within clusters To minimize variance within clusters To maximize the distance between clusters To minimize the distance between clusters

❓ Which type of clustering builds a tree of clusters?

K-Means DBSCAN Hierarchical Spectral

Key Concepts

Concept	Description
Centroid	Core principle in this module
Distance Metric	Core principle in this module
Convergence	Core principle in this module
Silhouette Score	Core principle in this module

Check Your Understanding

❓ How does Unsupervised handle edge cases?

Ignores them Applies regularization Removes them Duplicates them

❓ What is the computational complexity of Unsupervised?

O(n) O(n²) O(log n) Depends on implementation

❓ Which hyperparameter is most critical for Unsupervised?

Learning rate Batch size Epochs All equally important

Unsupervised Learning: Clustering

K-Means Clustering

Hierarchical Clustering

Key Concepts

Check Your Understanding

Related Courses