Introduction to Ensemble Learning
Duration: 5 min
This module introduces the concept of ensemble learning, a powerful machine learning technique that combines multiple models to improve predictive performance. We will explore the fundamental principles behind ensemble methods, their advantages, and how to implement them using Python. Understanding ensemble learning is crucial for building robust and accurate predictive models.
Bagging: Bootstrap Aggregating
Bagging is an ensemble technique that involves training multiple models on different subsets of the training data and then aggregating their predictions. This method helps reduce overfitting and variance in the model. By creating diverse models, bagging improves the overall performance and stability of the predictions.
from sklearn.ensemble import BaggingClassifier
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
# Load dataset
iris = load_iris()
X, y = iris.data, iris.target
# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Create a base classifier
base_clf = DecisionTreeClassifier()
# Create a Bagging ensemble
bagging_clf = BaggingClassifier(base_estimator=base_clf, n_estimators=10, random_state=42)
# Train the ensemble
bagging_clf.fit(X_train, y_train)
# Make predictions
y_pred = bagging_clf.predict(X_test)
# Print the accuracy
print(f'Accuracy: {bagging_clf.score(X_test, y_test):.2f}')Accuracy: 0.97Boosting: Sequential Ensemble Learning
Boosting is another ensemble technique that builds models sequentially, where each new model attempts to correct the errors of the previous one. This method focuses on reducing bias and improving the overall model performance. Boosting is particularly effective for handling complex datasets and improving predictive accuracy.
from sklearn.ensemble import AdaBoostClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
# Load dataset
iris = load_iris()
X, y = iris.data, iris.target
# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Create an AdaBoost ensemble
boost_clf = AdaBoostClassifier(n_estimators=50, random_state=42)
# Train the ensemble
boost_clf.fit(X_train, y_train)
# Make predictions
y_pred = boost_clf.predict(X_test)
# Print the accuracy
print(f'Accuracy: {boost_clf.score(X_test, y_test):.2f}')💡 Tip: When using ensemble methods, ensure that the base models are diverse to maximize the benefits of ensemble learning. Additionally, be cautious of overfitting, especially with boosting methods, by tuning hyperparameters appropriately.
❓ What is the primary goal of bagging?
❓ How does boosting differ from bagging?