Overfitting and Underfitting
Duration: 5 min
This module delves into the concepts of overfitting and underfitting in supervised learning models. Understanding these phenomena is crucial for building robust and generalizable machine learning models. We will explore the causes, implications, and methods to detect and mitigate overfitting and underfitting in various algorithms like Linear Regression, Logistic Regression, Decision Trees, Random Forests, SVM, and Gradient Boosting.
Understanding Overfitting
Overfitting occurs when a model learns the training data too well, capturing noise and outliers along with the underlying pattern. This results in high training accuracy but poor generalization to new, unseen data. Overfitting is often caused by model complexity that exceeds the true complexity of the data. To mitigate overfitting, techniques like cross-validation, regularization, and pruning can be employed.
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
# Generate synthetic data
np.random.seed(0)
x = 2 - 3 * np.random.normal(0, 1, 100)[:, np.newaxis]
y = x**3 + 0.5 * np.random.normal(size=x.shape)
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
# Fit a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)
# Predict and evaluate
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
train_error = mean_squared_error(y_train, y_train_pred)
test_error = mean_squared_error(y_test, y_test_pred)
print(f'Training Error: {train_error}')
print(f'Testing Error: {test_error}')Training Error: 0.1515363214971502
Testing Error: 1.435564726928211Understanding Underfitting
Underfitting occurs when a model is too simple to capture the underlying pattern in the data, resulting in poor performance on both the training and testing sets. This is often due to insufficient model complexity or lack of features. To address underfitting, one can increase model complexity, add more features, or reduce regularization.
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
# Generate synthetic data
np.random.seed(0)
x = 2 - 3 * np.random.normal(0, 1, 100)[:, np.newaxis]
y = x**3 + 0.5 * np.random.normal(size=x.shape)
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
# Fit a simple linear regression model
model = LinearRegression()
model.fit(X_train, y_train)
# Predict and evaluate
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
train_error = mean_squared_error(y_train, y_train_pred)
test_error = mean_squared_error(y_test, y_test_pred)
print(f'Training Error: {train_error}')
print(f'Testing Error: {test_error}')💡 Tip: Always use cross-validation to evaluate model performance and avoid relying solely on training or testing error metrics.
❓ What is a common cause of overfitting?
❓ What is a common cause of underfitting?
Key Concepts
| Concept | Description |
|---|---|
| Regularization | Core principle in this module |
| Early Stopping | Core principle in this module |
| Dropout | Core principle in this module |
| Validation | Core principle in this module |
Check Your Understanding
❓ How does Overfitting handle edge cases?
❓ What is the computational complexity of Overfitting?
❓ Which hyperparameter is most critical for Overfitting?