Hierarchical Clustering Fundamentals

Duration: 5 min

This module delves into the fundamentals of hierarchical clustering, a powerful unsupervised learning technique used to group similar data points into clusters. Understanding hierarchical clustering is crucial for tasks like data segmentation, anomaly detection, and exploratory data analysis. This module will cover the principles, types (agglomerative and divisive), and practical implementation using Python.

Agglomerative Hierarchical Clustering

Agglomerative hierarchical clustering is a bottom-up approach where each data point starts in its own cluster and pairs of clusters are merged as one moves up the hierarchy. The process continues until all data points are merged into a single cluster. This method is widely used due to its simplicity and interpretability.

import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.datasets import make_blobs

# Generate sample data
X, _ = make_blobs(n_samples=50, n_features=2, centers=4, cluster_std=0.60, random_state=0)

# Perform agglomerative hierarchical clustering
linked = linkage(X, 'ward')

# Plot dendrogram
plt.figure(figsize=(10, 7))
dendrogram(linked, orientation='top', distances_color_threshold=0, show_leaf_counts=True)
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Sample index')
plt.ylabel('Distance')
plt.show()

Try it in Google Colab:

A dendrogram plot showing the hierarchical clustering of the data points.

Divisive Hierarchical Clustering

Divisive hierarchical clustering is a top-down approach where all data points start in one cluster, and clusters are recursively split into smaller clusters. This method is less common than agglomerative clustering but can be useful in certain applications where a top-down approach is more intuitive.

import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage, cut_tree
from sklearn.datasets import make_blobs

# Generate sample data
X, _ = make_blobs(n_samples=50, n_features=2, centers=4, cluster_std=0.60, random_state=0)

# Perform divisive hierarchical clustering
linked = linkage(X, 'ward')

# Cut the dendrogram to form 4 clusters
clusters = cut_tree(linked, n_clusters=4).reshape(-1,)

# Plot the clusters
plt.figure(figsize=(10, 7))
plt.scatter(X[:, 0], X[:, 1], c=clusters, cmap='viridis')
plt.title('Divisive Hierarchical Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

💡 Tip: When choosing the number of clusters in hierarchical clustering, consider using domain knowledge or techniques like the elbow method to determine the optimal number of clusters.

❓ What is the primary difference between agglomerative and divisive hierarchical clustering?

Agglomerative starts with all points in one cluster, divisive starts with each point in its own cluster Agglomerative starts with each point in its own cluster, divisive starts with all points in one cluster Agglomerative uses a top-down approach, divisive uses a bottom-up approach Agglomerative is more complex than divisive

❓ Which linkage method is used in the provided code examples for hierarchical clustering?

Single linkage Complete linkage Average linkage Ward's method

Hierarchical Clustering Fundamentals

Agglomerative Hierarchical Clustering

Divisive Hierarchical Clustering

Related Courses