Unsupervised Learning for Image and Text Data

Duration: 7 min

This module delves into unsupervised learning techniques specifically tailored for image and text data. You'll learn about clustering algorithms like K-Means, DBSCAN, and Hierarchical Clustering, as well as dimensionality reduction techniques like PCA and t-SNE. Additionally, we'll explore Autoencoders for feature learning. Understanding these techniques is crucial for tasks like image segmentation, text clustering, and feature extraction.

K-Means Clustering

K-Means is a popular clustering algorithm that partitions data into K clusters by minimizing the within-cluster sum of squares. It is often used for image segmentation and text clustering. The algorithm iteratively assigns data points to the nearest cluster centroid and updates the centroids based on the current cluster assignments.

from sklearn.cluster import KMeans
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt

# Load Iris dataset
iris = load_iris()
X = iris.data

# Apply K-Means
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)

# Plot the clusters
plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='red')
plt.title('K-Means Clustering')
plt.show()

Try it in Google Colab:

A scatter plot showing three clusters of the Iris dataset with red dots representing the cluster centroids.

t-SNE for Dimensionality Reduction

t-Distributed Stochastic Neighbor Embedding (t-SNE) is a powerful technique for dimensionality reduction that is particularly well suited for the visualization of high-dimensional datasets. It works by converting similarities between data points to joint probabilities and trying to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data.

from sklearn.datasets import load_digits
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import numpy as np

# Load Digits dataset
digits = load_digits()
X = digits.data
y = digits.target

# Apply t-SNE
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X)

# Plot the t-SNE results
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap=plt.cm.get_cmap('tab10', 10))
plt.title('t-SNE Visualization of Digits Dataset')
plt.show()

💡 Tip: When using t-SNE, be mindful of the perplexity parameter. It controls the balance between local and global aspects of the data. A too-small perplexity makes the map look like a cluster, while a too-large perplexity makes it look like a single Gaussian.

❓ What is the primary goal of K-Means clustering?

To maximize variance within clusters To minimize the within-cluster sum of squares To maximize the between-cluster sum of squares To minimize the between-cluster sum of squares

❓ What does t-SNE aim to preserve in its low-dimensional embedding?

Global structure Local structure Both global and local structure Neither global nor local structure

Unsupervised Learning for Image and Text Data

K-Means Clustering

t-SNE for Dimensionality Reduction

Related Courses