Module 17 of 28 · Supervised Learning · Beginner

Bias-Variance Tradeoff

Duration: 5 min

This module delves into the Bias-Variance Tradeoff, a fundamental concept in supervised learning that helps balance the complexity of a model with its generalizability. Understanding this tradeoff is crucial for building models that perform well on unseen data, avoiding both underfitting and overfitting.

Understanding Bias and Variance

Bias refers to the error introduced by approximating a real-world problem, which may be complex, by a much simpler model. In statistics, bias is the difference between the expected (or average) prediction of our model and the correct value we are trying to predict. Variance, on the other hand, refers to the amount by which the estimate of the target function will change if different training data was used. High variance can cause an algorithm to model the random noise in the training data, rather than the intended outputs.

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Generate synthetic data
np.random.seed(0)
x = np.random.rand(100, 1)
y = 2 + 3 * x.squeeze() + np.random.randn(100, 1)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# Fit a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Calculate mean squared error
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse}')

Try it in Google Colab: Open in Colab

Mean Squared Error: 1.045

Balancing Bias and Variance

The key to a good model is finding the right balance between bias and variance. A model with low bias but high variance overfits the training data, capturing noise as if it were a part of the underlying pattern. Conversely, a model with high bias but low variance underfits the data, failing to capture the underlying pattern. The goal is to find a model that generalizes well to new, unseen data by minimizing both bias and variance.

from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

# Generate synthetic data
np.random.seed(0)
x = np.random.rand(100, 1)
y = 2 + 3 * x.squeeze() + np.random.randn(100, 1)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# Fit a decision tree model
tree_model = DecisionTreeRegressor(max_depth=1)
tree_model.fit(X_train, y_train)
tree_pred = tree_model.predict(X_test)
mse_tree = mean_squared_error(y_test, tree_pred)

# Fit a random forest model
forest_model = RandomForestRegressor(max_depth=1, n_estimators=10)
forest_model.fit(X_train, y_train)
forest_pred = forest_model.predict(X_test)
mse_forest = mean_squared_error(y_test, forest_pred)

print(f'Decision Tree MSE: {mse_tree}')
print(f'Random Forest MSE: {mse_forest}')

💡 Tip: When tuning hyperparameters, be cautious of overfitting. Use techniques like cross-validation to ensure your model generalizes well to unseen data.

❓ What does high bias in a model indicate?

❓ Which model is more likely to have high variance?

Key Concepts

Concept Description
Bias Core principle in this module
Variance Core principle in this module
Tradeoff Core principle in this module
Generalization Core principle in this module

Check Your Understanding

❓ How does Bias-Variance handle edge cases?

❓ What is the computational complexity of Bias-Variance?

❓ Which hyperparameter is most critical for Bias-Variance?

← Previous Continue interactively → Next →

Related Courses