Overfitting and Underfitting

Duration: 5 min

This module delves into the concepts of overfitting and underfitting in supervised learning models. Understanding these phenomena is crucial for building robust and generalizable machine learning models. We will explore the causes, implications, and methods to detect and mitigate overfitting and underfitting in various algorithms like Linear Regression, Logistic Regression, Decision Trees, Random Forests, SVM, and Gradient Boosting.

Understanding Overfitting

Overfitting occurs when a model learns the training data too well, capturing noise and outliers along with the underlying pattern. This results in high training accuracy but poor generalization to new, unseen data. Overfitting is often caused by model complexity that exceeds the true complexity of the data. To mitigate overfitting, techniques like cross-validation, regularization, and pruning can be employed.

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Generate synthetic data
np.random.seed(0)
x = 2 - 3 * np.random.normal(0, 1, 100)[:, np.newaxis]
y = x**3 + 0.5 * np.random.normal(size=x.shape)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# Fit a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict and evaluate
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
train_error = mean_squared_error(y_train, y_train_pred)
test_error = mean_squared_error(y_test, y_test_pred)

print(f'Training Error: {train_error}')
print(f'Testing Error: {test_error}')

Try it in Google Colab:

Training Error: 0.1515363214971502
Testing Error: 1.435564726928211

Understanding Underfitting

Underfitting occurs when a model is too simple to capture the underlying pattern in the data, resulting in poor performance on both the training and testing sets. This is often due to insufficient model complexity or lack of features. To address underfitting, one can increase model complexity, add more features, or reduce regularization.

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Generate synthetic data
np.random.seed(0)
x = 2 - 3 * np.random.normal(0, 1, 100)[:, np.newaxis]
y = x**3 + 0.5 * np.random.normal(size=x.shape)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# Fit a simple linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict and evaluate
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
train_error = mean_squared_error(y_train, y_train_pred)
test_error = mean_squared_error(y_test, y_test_pred)

print(f'Training Error: {train_error}')
print(f'Testing Error: {test_error}')

💡 Tip: Always use cross-validation to evaluate model performance and avoid relying solely on training or testing error metrics.

❓ What is a common cause of overfitting?

High bias High variance Insufficient data Low complexity

❓ What is a common cause of underfitting?

High variance High bias Excessive data Complex model

Key Concepts

Concept	Description
Regularization	Core principle in this module
Early Stopping	Core principle in this module
Dropout	Core principle in this module
Validation	Core principle in this module

Check Your Understanding

❓ How does Overfitting handle edge cases?

Ignores them Applies regularization Removes them Duplicates them

❓ What is the computational complexity of Overfitting?

O(n) O(n²) O(log n) Depends on implementation

❓ Which hyperparameter is most critical for Overfitting?

Learning rate Batch size Epochs All equally important

Overfitting and Underfitting

Understanding Overfitting

Understanding Underfitting

Key Concepts

Check Your Understanding

Related Courses