Introduction to Ensemble Learning

Duration: 5 min

This module introduces the concept of ensemble learning, a powerful machine learning technique that combines multiple models to improve predictive performance. We will explore the fundamental principles behind ensemble methods, their advantages, and how to implement them using Python. Understanding ensemble learning is crucial for building robust and accurate predictive models.

Bagging: Bootstrap Aggregating

Bagging is an ensemble technique that involves training multiple models on different subsets of the training data and then aggregating their predictions. This method helps reduce overfitting and variance in the model. By creating diverse models, bagging improves the overall performance and stability of the predictions.

from sklearn.ensemble import BaggingClassifier
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create a base classifier
base_clf = DecisionTreeClassifier()

# Create a Bagging ensemble
bagging_clf = BaggingClassifier(base_estimator=base_clf, n_estimators=10, random_state=42)

# Train the ensemble
bagging_clf.fit(X_train, y_train)

# Make predictions
y_pred = bagging_clf.predict(X_test)

# Print the accuracy
print(f'Accuracy: {bagging_clf.score(X_test, y_test):.2f}')

Try it in Google Colab:

Accuracy: 0.97

Boosting: Sequential Ensemble Learning

Boosting is another ensemble technique that builds models sequentially, where each new model attempts to correct the errors of the previous one. This method focuses on reducing bias and improving the overall model performance. Boosting is particularly effective for handling complex datasets and improving predictive accuracy.

from sklearn.ensemble import AdaBoostClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create an AdaBoost ensemble
boost_clf = AdaBoostClassifier(n_estimators=50, random_state=42)

# Train the ensemble
boost_clf.fit(X_train, y_train)

# Make predictions
y_pred = boost_clf.predict(X_test)

# Print the accuracy
print(f'Accuracy: {boost_clf.score(X_test, y_test):.2f}')

💡 Tip: When using ensemble methods, ensure that the base models are diverse to maximize the benefits of ensemble learning. Additionally, be cautious of overfitting, especially with boosting methods, by tuning hyperparameters appropriately.

❓ What is the primary goal of bagging?

To increase model complexity To reduce variance and overfitting To improve computational efficiency To simplify model interpretation

❓ How does boosting differ from bagging?

Boosting trains models in parallel Boosting focuses on reducing bias by sequentially training models Boosting uses the same dataset for all models Boosting is less effective than bagging

Introduction to Ensemble Learning

Bagging: Bootstrap Aggregating

Boosting: Sequential Ensemble Learning

Related Courses