Hierarchical Clustering Fundamentals
Duration: 5 min
This module delves into the fundamentals of hierarchical clustering, a powerful unsupervised learning technique used to group similar data points into clusters. Understanding hierarchical clustering is crucial for tasks like data segmentation, anomaly detection, and exploratory data analysis. This module will cover the principles, types (agglomerative and divisive), and practical implementation using Python.
Agglomerative Hierarchical Clustering
Agglomerative hierarchical clustering is a bottom-up approach where each data point starts in its own cluster and pairs of clusters are merged as one moves up the hierarchy. The process continues until all data points are merged into a single cluster. This method is widely used due to its simplicity and interpretability.
import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.datasets import make_blobs
# Generate sample data
X, _ = make_blobs(n_samples=50, n_features=2, centers=4, cluster_std=0.60, random_state=0)
# Perform agglomerative hierarchical clustering
linked = linkage(X, 'ward')
# Plot dendrogram
plt.figure(figsize=(10, 7))
dendrogram(linked, orientation='top', distances_color_threshold=0, show_leaf_counts=True)
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Sample index')
plt.ylabel('Distance')
plt.show()A dendrogram plot showing the hierarchical clustering of the data points.Divisive Hierarchical Clustering
Divisive hierarchical clustering is a top-down approach where all data points start in one cluster, and clusters are recursively split into smaller clusters. This method is less common than agglomerative clustering but can be useful in certain applications where a top-down approach is more intuitive.
import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage, cut_tree
from sklearn.datasets import make_blobs
# Generate sample data
X, _ = make_blobs(n_samples=50, n_features=2, centers=4, cluster_std=0.60, random_state=0)
# Perform divisive hierarchical clustering
linked = linkage(X, 'ward')
# Cut the dendrogram to form 4 clusters
clusters = cut_tree(linked, n_clusters=4).reshape(-1,)
# Plot the clusters
plt.figure(figsize=(10, 7))
plt.scatter(X[:, 0], X[:, 1], c=clusters, cmap='viridis')
plt.title('Divisive Hierarchical Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()💡 Tip: When choosing the number of clusters in hierarchical clustering, consider using domain knowledge or techniques like the elbow method to determine the optimal number of clusters.
❓ What is the primary difference between agglomerative and divisive hierarchical clustering?
❓ Which linkage method is used in the provided code examples for hierarchical clustering?