Cross-Validation Techniques
Duration: 5 min
This module delves into cross-validation techniques, a crucial aspect of machine learning model evaluation. Understanding cross-validation helps ensure that your model generalizes well to unseen data, reducing the risk of overfitting and providing a more reliable performance estimate.
K-Fold Cross-Validation
K-Fold Cross-Validation is a resampling procedure used to evaluate machine learning models on a limited data sample. The procedure involves splitting the dataset into 'k' subsets, or folds. For each iteration, one fold is retained as the validation set, and the remaining 'k-1' folds are used as the training set. This process is repeated 'k' times, with each of the 'k' folds used exactly once as the validation data. The 'k' results can then be averaged to produce a single estimation.
from sklearn.model_selection import KFold
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Load dataset
iris = load_iris()
X, y = iris.data, iris.target
# Initialize KFold
kf = KFold(n_splits=5, shuffle=True, random_state=1)
# Initialize model
model = LogisticRegression(max_iter=200)
# List to store scores
scores = []
# Perform K-Fold Cross-Validation
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model.fit(X_train, y_train)
predictions = model.predict(X_test)
score = accuracy_score(y_test, predictions)
scores.append(score)
# Print average score
print(f'Average accuracy: {sum(scores)/len(scores):.2f}')Average accuracy: 0.97Stratified K-Fold Cross-Validation
Stratified K-Fold Cross-Validation is a variation of K-Fold where the folds are made by preserving the percentage of samples for each class. This is particularly useful for classification problems where the target variable is imbalanced. By maintaining the same class distribution in each fold, this method ensures that each fold is a good representative of the whole.
from sklearn.model_selection import StratifiedKFold
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Load dataset
iris = load_iris()
X, y = iris.data, iris.target
# Initialize StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)
# Initialize model
model = LogisticRegression(max_iter=200)
# List to store scores
scores = []
# Perform Stratified K-Fold Cross-Validation
for train_index, test_index in skf.split(X, y):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model.fit(X_train, y_train)
predictions = model.predict(X_test)
score = accuracy_score(y_test, predictions)
scores.append(score)
# Print average score
print(f'Average accuracy: {sum(scores)/len(scores):.2f}')💡 Tip: Always use Stratified K-Fold for classification problems with imbalanced datasets to ensure each fold is representative of the class distribution.
❓ What is the primary purpose of K-Fold Cross-Validation?
❓ Why is Stratified K-Fold Cross-Validation preferred for imbalanced datasets?
Key Concepts
| Concept | Description |
|---|---|
| Fold | Core principle in this module |
| Stratified | Core principle in this module |
| Time Series | Core principle in this module |
| Validation | Core principle in this module |
Check Your Understanding
❓ How does Cross-Validation handle edge cases?
❓ What is the computational complexity of Cross-Validation?
❓ Which hyperparameter is most critical for Cross-Validation?