Random Forests

Duration: 5 min

This module delves into Random Forests, an ensemble learning method that operates by constructing a multitude of decision trees during training and outputting the class that is the mode of the classes of the individual trees. Random Forests are powerful because they can handle large datasets with a high degree of accuracy, are less likely to overfit, and provide insights into feature importance.

Understanding Random Forests

Random Forests are a type of ensemble learning method that combines the predictions of multiple decision trees to produce more accurate and stable predictions. Each tree in a Random Forest is built from a sample drawn with replacement (bootstrap sample) from the training set. Additionally, when splitting a node during the construction of a tree, the best split is found either from all input features or a random subset of them. This process of injecting randomness into the model building process ensures that the trees are de-correlated, which is crucial for the performance of the ensemble.

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

# Generate a binary classification dataset.
X, y = make_classification(n_samples=1000, n_features=4,
                            n_informative=2, n_redundant=0,
                            random_state=0, shuffle=False)

# Create a random forest classifier with 100 trees
clf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0)

# Train the classifier using the training data
clf.fit(X, y)

# Predict the response for the training dataset
y_pred = clf.predict(X)

# Print the accuracy of the model
print(f'Accuracy: {np.mean(y == y_pred)}')

Try it in Google Colab:

Accuracy: 1.0

Feature Importance in Random Forests

One of the advantages of Random Forests is their ability to provide insights into feature importance. Feature importance scores can be calculated for each feature in the dataset by measuring the total reduction of the criterion brought by that feature. It is also known as the Gini importance. This can be particularly useful for feature selection and understanding which features contribute most to the predictions of the model.

import matplotlib.pyplot as plt

# Get the feature importances
importances = clf.feature_importances_

# Plot the feature importances of the forest
plt.figure()
plt.title('Feature Importances')
plt.bar(range(X.shape[1]), importances, color='r', align='center')
plt.xticks(range(X.shape[1]), range(X.shape[1]))
plt.xlim([-1, X.shape[1]])
plt.show()

💡 Tip: When tuning a Random Forest, consider adjusting the number of trees (n_estimators) and the maximum depth of the trees (max_depth). Increasing the number of trees will generally increase the accuracy of the model but will also increase the training time. Adjusting the maximum depth can help prevent overfitting.

❓ What is the primary advantage of using Random Forests over a single decision tree?

Lower computational cost Less prone to overfitting Requires less data Simpler to interpret

❓ How does Random Forest determine the importance of features?

By the frequency of their appearance in the trees By the total reduction of the criterion brought by that feature By the depth of the tree they are used in By the number of trees they appear in

Key Concepts

Concept	Description
Bootstrap Aggregating	Core principle in this module
Feature Importance	Core principle in this module
Out-of-Bag Error	Core principle in this module
Ensemble	Core principle in this module

Check Your Understanding

❓ How does Random handle edge cases?

Ignores them Applies regularization Removes them Duplicates them

❓ What is the computational complexity of Random?

O(n) O(n²) O(log n) Depends on implementation

❓ Which hyperparameter is most critical for Random?

Learning rate Batch size Epochs All equally important

Random Forests

Understanding Random Forests

Feature Importance in Random Forests

Key Concepts

Check Your Understanding

Related Courses