Comparing Clustering Algorithms

Duration: 7 min

This module delves into the comparison of various clustering algorithms, including K-Means, DBSCAN, and Hierarchical Clustering. Understanding the strengths and weaknesses of each algorithm is crucial for selecting the appropriate method for your specific data and problem domain.

K-Means Clustering

K-Means is a centroid-based algorithm that partitions the dataset into K distinct, non-overlapping subsets. It works by assigning each data point to the nearest cluster centroid and then recalculating the centroids based on the current cluster assignments. This process is repeated until the centroids stabilize.

from sklearn.cluster import KMeans
import numpy as np

# Generate sample data
X = np.array([[1, 2], [1, 4], [1, 0],
              [4, 2], [4, 4], [4, 0]])

# Apply K-Means clustering
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)

# Print cluster labels
print(kmeans.labels_)

Try it in Google Colab:

[0 0 0 1 1 1]

DBSCAN Clustering

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) groups together points that are closely packed, marking as outliers points that lie alone in low-density regions. It requires two parameters: eps (the maximum distance between two samples for them to be considered as in the same neighborhood) and min_samples (the number of samples in a neighborhood for a point to be considered as a core point).

from sklearn.cluster import DBSCAN
import numpy as np

# Generate sample data
X = np.array([[1, 2], [2, 2], [2, 3],
              [8, 7], [8, 8], [25, 80]])

# Apply DBSCAN clustering
dbsc = DBSCAN(eps=3, min_samples=2).fit(X)

# Print cluster labels
print(dbsc.labels_)

💡 Tip: When using DBSCAN, carefully tune the eps and min_samples parameters to avoid over-clustering or under-clustering your data.

❓ What is the primary criterion K-Means uses to assign data points to clusters?

Distance to the nearest data point Density of data points Distance to the nearest centroid Random assignment

❓ Which parameter in DBSCAN controls the maximum distance between two samples for them to be considered as in the same neighborhood?

min_samples max_dist eps density_param

Comparing Clustering Algorithms

K-Means Clustering

DBSCAN Clustering

Related Courses