Unsupervised Learning in Practice: Case Studies

Duration: 8 min

This module delves into practical applications of unsupervised learning techniques such as K-Means, DBSCAN, Hierarchical Clustering, PCA, t-SNE, and Autoencoders. Understanding these methods is crucial for identifying patterns and structures in data without predefined labels, making it invaluable for exploratory data analysis and feature extraction.

K-Means Clustering

K-Means is a popular unsupervised learning algorithm used for partitioning a dataset into K distinct, non-overlapping subsets. Each subset represents a cluster that is defined by its centroid. The algorithm iteratively assigns data points to the nearest centroid and then recalculates the centroids based on the current cluster assignments.

import numpy as np
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

# Generate sample data
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

# Apply K-Means clustering
kmeans = KMeans(n_clusters=4, random_state=0).fit(X)

# Get cluster labels and centroids
labels = kmeans.labels_
centroids = kmeans.cluster_centers_

print('Cluster labels:', labels)
print('Centroids:', centroids)

Try it in Google Colab:

Cluster labels: [3 1 2... 0 3 2]
Centroids: [[ 9.99131907 -0.01737375]
 [ 0.03106249  9.98395739]
 [-9.98469361  0.02302341]
 [-0.0136726  -9.9923783 ]]

DBSCAN Clustering

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm. Unlike K-Means, DBSCAN does not require specifying the number of clusters beforehand. It forms clusters based on the density of data points, identifying core points, border points, and noise points.

from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons

# Generate sample data
X, _ = make_moons(n_samples=300, noise=0.05, random_state=0)

# Apply DBSCAN clustering
dbsc = DBSCAN(eps=0.3, min_samples=5).fit(X)

# Get cluster labels
labels = dbsc.labels_

print('Cluster labels:', labels)

💡 Tip: When using DBSCAN, carefully choose the eps (epsilon) and min_samples parameters to ensure meaningful clusters. Too large an eps can merge distinct clusters, while too small a value can create too many clusters.

❓ What is the primary difference between K-Means and DBSCAN clustering?

Both require the number of clusters to be specified K-Means requires the number of clusters, DBSCAN does not DBSCAN requires the number of clusters, K-Means does not Both are density-based clustering algorithms

❓ In DBSCAN, what does the parameter `eps` control?

The maximum distance between two samples for them to be considered as in the same neighborhood The number of samples in a neighborhood for a point to be considered as a core point The learning rate of the algorithm The random state for reproducibility

Unsupervised Learning in Practice: Case Studies

K-Means Clustering

DBSCAN Clustering

Related Courses