Module 8 of 26 · Scikit-Learn Machine Learning · Beginner

Random Forests

Duration: 5 min

This module delves into Random Forests, an ensemble learning method that operates by constructing a multitude of decision trees during training and outputting the class that is the mode of the classes of the individual trees. Random Forests are powerful because they can handle large datasets with a high degree of accuracy, are less likely to overfit, and provide insights into feature importance.

Understanding Random Forests

Random Forests are a type of ensemble learning method that combines the predictions of multiple decision trees to produce more accurate and stable predictions. Each tree in a Random Forest is built from a sample drawn with replacement (bootstrap sample) from the training set. Additionally, when splitting a node during the construction of a tree, the best split is found either from all input features or a random subset of them. This process of injecting randomness into the model building process ensures that the trees are de-correlated, which is crucial for the performance of the ensemble.

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

# Generate a binary classification dataset.
X, y = make_classification(n_samples=1000, n_features=4,
                            n_informative=2, n_redundant=0,
                            random_state=0, shuffle=False)

# Create a random forest classifier with 100 trees
clf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0)

# Train the classifier using the training data
clf.fit(X, y)

# Predict the response for the training dataset
y_pred = clf.predict(X)

# Print the accuracy of the model
print(f'Accuracy: {np.mean(y == y_pred)}')

Try it in Google Colab: Open in Colab

Accuracy: 1.0

Feature Importance in Random Forests

One of the advantages of Random Forests is their ability to provide insights into feature importance. Feature importance scores can be calculated for each feature in the dataset by measuring the total reduction of the criterion brought by that feature. It is also known as the Gini importance. This can be particularly useful for feature selection and understanding which features contribute most to the predictions of the model.

import matplotlib.pyplot as plt

# Get the feature importances
importances = clf.feature_importances_

# Plot the feature importances of the forest
plt.figure()
plt.title('Feature Importances')
plt.bar(range(X.shape[1]), importances, color='r', align='center')
plt.xticks(range(X.shape[1]), range(X.shape[1]))
plt.xlim([-1, X.shape[1]])
plt.show()

💡 Tip: When tuning a Random Forest, consider adjusting the number of trees (n_estimators) and the maximum depth of the trees (max_depth). Increasing the number of trees will generally increase the accuracy of the model but will also increase the training time. Adjusting the maximum depth can help prevent overfitting.

❓ What is the primary advantage of using Random Forests over a single decision tree?

❓ How does Random Forest determine the importance of features?

Key Concepts

Concept Description
Bootstrap Aggregating Core principle in this module
Feature Importance Core principle in this module
Out-of-Bag Error Core principle in this module
Ensemble Core principle in this module

Check Your Understanding

❓ How does Random handle edge cases?

❓ What is the computational complexity of Random?

❓ Which hyperparameter is most critical for Random?

← Previous Continue interactively → Next →

Related Courses