Principal Component Analysis (PCA) Fundamentals
Duration: 5 min
This module delves into the fundamentals of Principal Component Analysis (PCA), a powerful technique for dimensionality reduction in data science. Understanding PCA is crucial for simplifying complex datasets while preserving as much variability as possible, which is essential for tasks like visualization, data compression, and noise reduction.
Understanding PCA
Principal Component Analysis (PCA) is a statistical procedure that uses orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The first principal component has the largest possible variance, and each succeeding component, in turn, has the highest variance possible under the constraint that it is orthogonal to the preceding components.
import numpy as np
from sklearn.decomposition import PCA
# Sample data
data = np.array([[2.5, 2.4],
[0.5, 0.7],
[2.2, 2.9],
[1.9, 2.2],
[3.1, 3.0],
[2.3, 2.7],
[2, 1.6],
[1, 1.1],
[1.5, 1.6],
[1.1, 0.9]])
# Apply PCA
pca = PCA(n_components=1)
principalComponents = pca.fit_transform(data)
print(principalComponents)[[ 2.39957577]
[-1.6991799 ]
[ 3.4588443 ]
[ 2.74562293]
[ 3.88991837]
[ 3.29648418]
[ 1.77657533]
[-0.59384213]
[ 0.57830817]
[-1.37115696]]Eigenvalues and Explained Variance
Eigenvalues in PCA represent the amount of variance that each principal component captures from the data. The explained variance ratio of a principal component is the proportion of the dataset’s total variance that is captured by that component. This helps in understanding the significance of each principal component and deciding how many components to retain.
import numpy as np
from sklearn.decomposition import PCA
# Sample data
data = np.array([[2.5, 2.4],
[0.5, 0.7],
[2.2, 2.9],
[1.9, 2.2],
[3.1, 3.0],
[2.3, 2.7],
[2, 1.6],
[1, 1.1],
[1.5, 1.6],
[1.1, 0.9]])
# Apply PCA
pca = PCA()
pca.fit(data)
# Eigenvalues
eigenvalues = pca.explained_variance_
# Explained variance ratio
explained_variance_ratio = pca.explained_variance_ratio_
print('Eigenvalues:', eigenvalues)
print('Explained Variance Ratio:', explained_variance_ratio)💡 Tip: Always standardize your data before applying PCA to ensure that each feature contributes equally to the analysis.
❓ What does PCA stand for?
❓ What does the explained variance ratio indicate in PCA?