Module 12 of 26 · Scikit-Learn Machine Learning · Beginner

Cross-Validation Techniques

Duration: 5 min

This module delves into cross-validation techniques, a crucial aspect of machine learning model evaluation. Understanding cross-validation helps ensure that your model generalizes well to unseen data, reducing the risk of overfitting and providing a more reliable performance estimate.

K-Fold Cross-Validation

K-Fold Cross-Validation is a resampling procedure used to evaluate machine learning models on a limited data sample. The procedure involves splitting the dataset into 'k' subsets, or folds. For each iteration, one fold is retained as the validation set, and the remaining 'k-1' folds are used as the training set. This process is repeated 'k' times, with each of the 'k' folds used exactly once as the validation data. The 'k' results can then be averaged to produce a single estimation.

from sklearn.model_selection import KFold
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Initialize KFold
kf = KFold(n_splits=5, shuffle=True, random_state=1)

# Initialize model
model = LogisticRegression(max_iter=200)

# List to store scores
scores = []

# Perform K-Fold Cross-Validation
for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    score = accuracy_score(y_test, predictions)
    scores.append(score)

# Print average score
print(f'Average accuracy: {sum(scores)/len(scores):.2f}')

Try it in Google Colab: Open in Colab

Average accuracy: 0.97

Stratified K-Fold Cross-Validation

Stratified K-Fold Cross-Validation is a variation of K-Fold where the folds are made by preserving the percentage of samples for each class. This is particularly useful for classification problems where the target variable is imbalanced. By maintaining the same class distribution in each fold, this method ensures that each fold is a good representative of the whole.

from sklearn.model_selection import StratifiedKFold
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Initialize StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)

# Initialize model
model = LogisticRegression(max_iter=200)

# List to store scores
scores = []

# Perform Stratified K-Fold Cross-Validation
for train_index, test_index in skf.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    score = accuracy_score(y_test, predictions)
    scores.append(score)

# Print average score
print(f'Average accuracy: {sum(scores)/len(scores):.2f}')

💡 Tip: Always use Stratified K-Fold for classification problems with imbalanced datasets to ensure each fold is representative of the class distribution.

❓ What is the primary purpose of K-Fold Cross-Validation?

❓ Why is Stratified K-Fold Cross-Validation preferred for imbalanced datasets?

Key Concepts

Concept Description
Fold Core principle in this module
Stratified Core principle in this module
Time Series Core principle in this module
Validation Core principle in this module

Check Your Understanding

❓ How does Cross-Validation handle edge cases?

❓ What is the computational complexity of Cross-Validation?

❓ Which hyperparameter is most critical for Cross-Validation?

← Previous Continue interactively → Next →

Related Courses