Module 8 of 28 · Supervised Learning · Beginner

Random Forests Basics

Duration: 5 min

This module provides an introduction to Random Forests, an ensemble learning method that combines multiple decision trees to improve predictive accuracy and control over-fitting. Understanding Random Forests is crucial for leveraging ensemble techniques in machine learning projects.

Understanding Ensemble Learning

Ensemble learning involves combining multiple models to produce better predictive performance. Random Forests build on this idea by creating a 'forest' of decision trees, each trained on a different subset of the data and features. The final prediction is made by aggregating the predictions of individual trees, which reduces variance and improves overall model robustness.

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Generate a random dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the Random Forest classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model
rf_classifier.fit(X_train, y_train)

# Make predictions
y_pred = rf_classifier.predict(X_test)

# Print the accuracy
accuracy = np.mean(y_pred == y_test)
print(f'Accuracy: {accuracy:.2f}')

Try it in Google Colab: Open in Colab

Accuracy: 0.85

Key Parameters in Random Forests

Several key parameters influence the performance of Random Forests, including n_estimators, which defines the number of trees in the forest, and max_features, which specifies the number of features to consider when looking for the best split. Tuning these parameters can significantly impact the model's accuracy and generalization capability.

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Generate a random dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the Random Forest classifier with specific parameters
rf_classifier = RandomForestClassifier(n_estimators=200, max_features='sqrt', random_state=42)

# Train the model
rf_classifier.fit(X_train, y_train)

# Make predictions
y_pred = rf_classifier.predict(X_test)

# Print the accuracy
accuracy = np.mean(y_pred == y_test)
print(f'Accuracy: {accuracy:.2f}')

💡 Tip: Always perform hyperparameter tuning using techniques like Grid Search or Random Search to find the optimal parameters for your Random Forest model.

❓ What is the primary advantage of using Random Forests over a single decision tree?

❓ Which parameter in Random Forests controls the number of features considered for splitting a node?

Key Concepts

Concept Description
Bootstrap Aggregating Core principle in this module
Feature Importance Core principle in this module
Out-of-Bag Error Core principle in this module
Ensemble Core principle in this module

Check Your Understanding

❓ What is the main purpose of Random?

❓ Which of these is a key characteristic of Random?

← Previous Continue interactively → Next →

Related Courses