Project: Applying Unsupervised Learning to a Real-World Dataset

Duration: 10 min

In this module, you will learn how to apply unsupervised learning techniques to a real-world dataset. Unsupervised learning is crucial for discovering hidden patterns and structures in data without labeled outcomes. This module will cover K-Means, DBSCAN, Hierarchical Clustering, PCA, t-SNE, and Autoencoders, providing you with practical skills to analyze complex datasets effectively.

K-Means Clustering

K-Means is a popular clustering algorithm that partitions data into K distinct clusters. Each data point belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This method is effective for spherical clusters and is widely used in market segmentation, image compression, and anomaly detection.

import numpy as np
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Generate synthetic data
X = np.array([[1, 2], [1.5, 1.8], [5, 8], [8, 8], [1, 0.6], [9, 11]])

# Apply K-Means clustering
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)

# Plot the results
plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_, cmap='viridis')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', marker='X')
plt.show()

Try it in Google Colab:

A scatter plot showing two clusters with data points colored according to their cluster labels and cluster centers marked with red 'X'.

DBSCAN Clustering

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm. It groups together points that are packed closely together, marking as outliers points that lie alone in low-density regions. DBSCAN is useful for identifying clusters of various shapes and sizes and detecting noise in the data.

from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt

# Generate synthetic data
X = np.array([[1, 2], [2, 2], [2, 3],
              [8, 7], [8, 8], [25, 80]])

# Apply DBSCAN clustering
dbscan = DBSCAN(eps=3, min_samples=2).fit(X)

# Plot the results
plt.scatter(X[:, 0], X[:, 1], c=dbscan.labels_, cmap='viridis')
plt.show()

💡 Tip: When using DBSCAN, carefully choose the eps (maximum distance between two samples) and min_samples (minimum number of samples in a neighborhood for a point to be considered as a core point) parameters to achieve the desired clustering results.

❓ What is the primary purpose of K-Means clustering?

To reduce dimensionality To partition data into distinct clusters To perform regression analysis To detect anomalies

❓ Which parameter in DBSCAN determines the maximum distance between two samples for them to be considered as in the same neighborhood?

min_samples eps n_clusters random_state

Project: Applying Unsupervised Learning to a Real-World Dataset

K-Means Clustering

DBSCAN Clustering

Related Courses