Bagging: The Basics
Duration: 7 min
This module introduces the fundamental concepts of bagging, a powerful ensemble learning technique. We will explore how bagging works, its advantages, and practical applications using Python. Understanding bagging is crucial for improving model performance and robustness.
Understanding Bagging
Bagging, short for Bootstrap Aggregating, is an ensemble technique that involves training multiple models on different subsets of the training data and then aggregating their predictions. This approach helps reduce variance and avoid overfitting. Each model is trained independently, and the final prediction is typically the average (for regression) or majority vote (for classification) of the individual models.
import numpy as np
from sklearn.datasets import load_iris
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
# Load dataset
iris = load_iris()
X, y = iris.data, iris.target
# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Create a base classifier
base_clf = DecisionTreeClassifier()
# Create a Bagging ensemble
bagging_clf = BaggingClassifier(base_estimator=base_clf, n_estimators=10, random_state=42)
# Train the Bagging ensemble
bagging_clf.fit(X_train, y_train)
# Make predictions
y_pred = bagging_clf.predict(X_test)
# Calculate accuracy
accuracy = np.mean(y_pred == y_test)
print(f'Accuracy: {accuracy:.2f}')Accuracy: 0.97Advantages of Bagging
Bagging offers several advantages, including reduced variance, improved model stability, and enhanced performance on complex datasets. By training multiple models on different data subsets, bagging effectively mitigates the risk of overfitting. Additionally, bagging can handle high-dimensional data and is relatively simple to implement, making it a popular choice for ensemble learning.
import numpy as np
from sklearn.datasets import make_classification
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
# Generate a synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
# Create a base classifier
base_clf = DecisionTreeClassifier()
# Create a Bagging ensemble
bagging_clf = BaggingClassifier(base_estimator=base_clf, n_estimators=50, random_state=42)
# Perform cross-validation
scores = cross_val_score(bagging_clf, X, y, cv=5)
# Calculate average cross-validation score
average_score = np.mean(scores)
print(f'Average Cross-Validation Score: {average_score:.2f}')💡 Tip: When using bagging, ensure that the base estimator is a high-variance model, such as a decision tree, to benefit from variance reduction.
❓ What is the primary goal of bagging?
❓ Which type of model is typically used as a base estimator in bagging?