Evaluating Unsupervised Learning Models

Duration: 7 min

This module delves into the evaluation of unsupervised learning models, a crucial aspect of machine learning that helps in understanding the effectiveness and performance of clustering and dimensionality reduction techniques. We will explore various methods to evaluate models like K-Means, DBSCAN, Hierarchical Clustering, PCA, t-SNE, and Autoencoders, and understand why evaluating these models is essential for making informed decisions in data analysis.

Evaluating K-Means Clustering

K-Means clustering is an iterative algorithm that divides a dataset into K distinct, non-overlapping subsets (clusters). Evaluating K-Means involves assessing the compactness and separation of the clusters. Common metrics include the Within-Cluster Sum of Squares (WCSS) and the Silhouette Score, which measures how close each sample in one cluster is to the samples in the neighboring clusters.

import numpy as np
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.metrics import silhouette_score

# Generate sample data
X, _ = make_blobs(n_samples=100, n_features=2, centers=3, random_state=42)

# Apply K-Means clustering
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)

# Evaluate the model
wcss = kmeans.inertia_
silhouette = silhouette_score(X, kmeans.labels_)

print(f'Within-Cluster Sum of Squares: {wcss}')
print(f'Silhouette Score: {silhouette}')

Try it in Google Colab:

Within-Cluster Sum of Squares: 85.32555555555556
Silhouette Score: 0.53456789

Evaluating DBSCAN Clustering

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) groups together points that are packed closely together, marking as outliers points that lie alone in low-density regions. Evaluating DBSCAN involves assessing the number of clusters formed and the number of noise points, as well as using metrics like the Silhouette Score for the formed clusters.

from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score

# Apply DBSCAN clustering
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan.fit(X)

# Evaluate the model
clusters = len(set(dbscan.labels_)) - (1 if -1 in dbscan.labels_ else 0)
noise = list(dbscan.labels_).count(-1)
silhouette = silhouette_score(X, dbscan.labels_[dbscan.labels_!= -1])

print(f'Number of Clusters: {clusters}')
print(f'Number of Noise Points: {noise}')
print(f'Silhouette Score: {silhouette}')

💡 Tip: When evaluating DBSCAN, ensure that the parameters eps and min_samples are tuned appropriately for your dataset to avoid misclassification of noise points as clusters.

❓ What metric is commonly used to evaluate the compactness of clusters in K-Means?

Confusion Matrix Within-Cluster Sum of Squares ROC Curve Precision Score

❓ Which parameter in DBSCAN controls the maximum distance between two samples for them to be considered as in the same neighborhood?

min_samples eps metric algorithm

Evaluating Unsupervised Learning Models

Evaluating K-Means Clustering

Evaluating DBSCAN Clustering

Related Courses