Module 19 of 28 · Supervised Learning · Beginner

Ensemble Methods

Duration: 5 min

This module delves into ensemble methods, a powerful approach in machine learning that combines multiple models to improve predictive performance. We will explore the principles behind ensemble methods, their advantages, and how to implement them using Python. Understanding ensemble methods is crucial for building robust and accurate predictive models.

Bagging: Random Forests

Bagging, or Bootstrap Aggregating, is an ensemble technique that combines multiple models to produce a generalized model. Random Forests are a popular implementation of bagging where multiple decision trees are trained on different subsets of the data and their predictions are averaged to produce the final output. This method reduces overfitting and improves model stability.

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Generate a random binary classification dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model
rf.fit(X_train, y_train)

# Make predictions
y_pred = rf.predict(X_test)

# Print the accuracy
accuracy = np.mean(y_pred == y_test)
print(f'Accuracy: {accuracy:.2f}')

Try it in Google Colab: Open in Colab

Accuracy: 0.85

Boosting: Gradient Boosting

Boosting is another ensemble technique where models are built sequentially, each trying to correct the errors of the previous one. Gradient Boosting is a popular boosting method that builds models in a stage-wise fashion, optimizing a differentiable loss function. It is highly effective for both regression and classification tasks.

import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Generate a random binary classification dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the GradientBoostingClassifier
gbc = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)

# Train the model
gbc.fit(X_train, y_train)

# Make predictions
y_pred = gbc.predict(X_test)

# Print the accuracy
accuracy = np.mean(y_pred == y_test)
print(f'Accuracy: {accuracy:.2f}')

💡 Tip: When using Gradient Boosting, be cautious with the learning rate and the number of estimators. A lower learning rate with more estimators often yields better results but requires more computational resources.

❓ What is the primary advantage of using Random Forests over a single decision tree?

❓ Which parameter in Gradient Boosting controls the contribution of each tree to the final model?

Key Concepts

Concept Description
Voting Core principle in this module
Stacking Core principle in this module
Bagging Core principle in this module
Boosting Core principle in this module

Check Your Understanding

❓ How does Ensemble handle edge cases?

❓ What is the computational complexity of Ensemble?

❓ Which hyperparameter is most critical for Ensemble?

← Previous Continue interactively → Next →

Related Courses