Unsupervised Learning in Practice: Case Studies
Duration: 8 min
This module delves into practical applications of unsupervised learning techniques such as K-Means, DBSCAN, Hierarchical Clustering, PCA, t-SNE, and Autoencoders. Understanding these methods is crucial for identifying patterns and structures in data without predefined labels, making it invaluable for exploratory data analysis and feature extraction.
K-Means Clustering
K-Means is a popular unsupervised learning algorithm used for partitioning a dataset into K distinct, non-overlapping subsets. Each subset represents a cluster that is defined by its centroid. The algorithm iteratively assigns data points to the nearest centroid and then recalculates the centroids based on the current cluster assignments.
import numpy as np
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
# Generate sample data
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)
# Apply K-Means clustering
kmeans = KMeans(n_clusters=4, random_state=0).fit(X)
# Get cluster labels and centroids
labels = kmeans.labels_
centroids = kmeans.cluster_centers_
print('Cluster labels:', labels)
print('Centroids:', centroids)Cluster labels: [3 1 2... 0 3 2]
Centroids: [[ 9.99131907 -0.01737375]
[ 0.03106249 9.98395739]
[-9.98469361 0.02302341]
[-0.0136726 -9.9923783 ]]DBSCAN Clustering
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm. Unlike K-Means, DBSCAN does not require specifying the number of clusters beforehand. It forms clusters based on the density of data points, identifying core points, border points, and noise points.
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons
# Generate sample data
X, _ = make_moons(n_samples=300, noise=0.05, random_state=0)
# Apply DBSCAN clustering
dbsc = DBSCAN(eps=0.3, min_samples=5).fit(X)
# Get cluster labels
labels = dbsc.labels_
print('Cluster labels:', labels)💡 Tip: When using DBSCAN, carefully choose the
eps(epsilon) andmin_samplesparameters to ensure meaningful clusters. Too large anepscan merge distinct clusters, while too small a value can create too many clusters.
❓ What is the primary difference between K-Means and DBSCAN clustering?
❓ In DBSCAN, what does the parameter `eps` control?