Comparing Clustering Algorithms
Duration: 7 min
This module delves into the comparison of various clustering algorithms, including K-Means, DBSCAN, and Hierarchical Clustering. Understanding the strengths and weaknesses of each algorithm is crucial for selecting the appropriate method for your specific data and problem domain.
K-Means Clustering
K-Means is a centroid-based algorithm that partitions the dataset into K distinct, non-overlapping subsets. It works by assigning each data point to the nearest cluster centroid and then recalculating the centroids based on the current cluster assignments. This process is repeated until the centroids stabilize.
from sklearn.cluster import KMeans
import numpy as np
# Generate sample data
X = np.array([[1, 2], [1, 4], [1, 0],
[4, 2], [4, 4], [4, 0]])
# Apply K-Means clustering
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
# Print cluster labels
print(kmeans.labels_)[0 0 0 1 1 1]DBSCAN Clustering
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) groups together points that are closely packed, marking as outliers points that lie alone in low-density regions. It requires two parameters: eps (the maximum distance between two samples for them to be considered as in the same neighborhood) and min_samples (the number of samples in a neighborhood for a point to be considered as a core point).
from sklearn.cluster import DBSCAN
import numpy as np
# Generate sample data
X = np.array([[1, 2], [2, 2], [2, 3],
[8, 7], [8, 8], [25, 80]])
# Apply DBSCAN clustering
dbsc = DBSCAN(eps=3, min_samples=2).fit(X)
# Print cluster labels
print(dbsc.labels_)💡 Tip: When using DBSCAN, carefully tune the
epsandmin_samplesparameters to avoid over-clustering or under-clustering your data.
❓ What is the primary criterion K-Means uses to assign data points to clusters?
❓ Which parameter in DBSCAN controls the maximum distance between two samples for them to be considered as in the same neighborhood?