Unsupervised Learning: Clustering
Duration: 5 min
This module delves into the realm of unsupervised learning, specifically focusing on clustering techniques. You will learn about the fundamental concepts of clustering, various algorithms like K-Means and Hierarchical Clustering, and how to apply them using Python. Understanding clustering is crucial for tasks like customer segmentation, anomaly detection, and data exploration.
K-Means Clustering
K-Means is a popular clustering algorithm that partitions data into K distinct, non-overlapping subsets. Each subset represents a cluster, and the algorithm aims to minimize the variance within each cluster. It works by randomly initializing K centroids, assigning each data point to the nearest centroid, and then recalculating the centroids based on the assigned points. This process iterates until the centroids stabilize.
from sklearn.cluster import KMeans
import numpy as np
# Generate sample data
X = np.array([[1, 2], [1, 4], [1, 0],
[4, 2], [4, 4], [4, 0]])
# Apply K-Means clustering
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
# Print cluster labels and centroids
print('Cluster labels:', kmeans.labels_)
print('Centroids:', kmeans.cluster_centers_)Cluster labels: [0 0 0 1 1 1]
Centroids: [[1. 1.33333333]
[4. 2. ]]Hierarchical Clustering
Hierarchical clustering creates a tree of clusters, which can be visualized using a dendrogram. It does not require specifying the number of clusters beforehand. There are two main types: agglomerative (bottom-up) and divisive (top-down). Agglomerative clustering starts with each data point as a separate cluster and merges them into larger clusters iteratively.
from sklearn.cluster import AgglomerativeClustering
import numpy as np
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt
# Generate sample data
X = np.array([[1, 2], [1, 4], [1, 0],
[4, 2], [4, 4], [4, 0]])
# Apply Agglomerative Clustering
clustering = AgglomerativeClustering(n_clusters=2).fit(X)
# Plot dendrogram
linked = linkage(X, 'ward')
plt.figure(figsize=(10, 7))
dendrogram(linked,
orientation='top',
distance_sort='descending',
show_leaf_counts=True)
plt.show()💡 Tip: When using K-Means, be mindful of the initial placement of centroids as it can affect the final clusters. Using techniques like K-Means++ for initialization can lead to more robust results.
❓ What is the primary goal of K-Means clustering?
❓ Which type of hierarchical clustering starts with each data point as a separate cluster and merges them iteratively?