Feature Selection and Dimensionality Reduction
Duration: 7 min
This module delves into the essential techniques for feature selection and dimensionality reduction in unsupervised learning. These techniques are crucial for improving model performance, reducing overfitting, and making data more manageable and interpretable.
K-Means Clustering
K-Means is a popular clustering algorithm that partitions data into K distinct clusters. It works by assigning each data point to the nearest cluster centroid and then recalculating the centroids. This process iterates until the centroids stabilize. K-Means is useful for identifying patterns and grouping similar data points together.
from sklearn.cluster import KMeans
import numpy as np
# Generate sample data
X = np.array([[1, 2], [1, 4], [1, 0],
[10, 2], [10, 4], [10, 0]])
# Apply K-Means clustering
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
# Print cluster labels
print(kmeans.labels_)[1 1 1 0 0 0]Principal Component Analysis (PCA)
PCA is a dimensionality reduction technique that transforms data into a set of orthogonal components that explain the maximum variance. It helps in reducing the number of features while preserving as much information as possible. PCA is widely used for visualization, noise reduction, and feature extraction.
from sklearn.decomposition import PCA
import numpy as np
# Generate sample data
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
# Apply PCA
pca = PCA(n_components=1)
X_pca = pca.fit_transform(X)
# Print transformed data
print(X_pca)💡 Tip: When applying PCA, ensure that your data is centered and scaled to achieve optimal results.
❓ What is the primary goal of K-Means clustering?
❓ What does PCA primarily aim to achieve?