Random Forests Basics
Duration: 5 min
This module provides an introduction to Random Forests, an ensemble learning method that combines multiple decision trees to improve predictive accuracy and control over-fitting. Understanding Random Forests is crucial for leveraging ensemble techniques in machine learning projects.
Understanding Ensemble Learning
Ensemble learning involves combining multiple models to produce better predictive performance. Random Forests build on this idea by creating a 'forest' of decision trees, each trained on a different subset of the data and features. The final prediction is made by aggregating the predictions of individual trees, which reduces variance and improves overall model robustness.
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
# Generate a random dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize the Random Forest classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
# Train the model
rf_classifier.fit(X_train, y_train)
# Make predictions
y_pred = rf_classifier.predict(X_test)
# Print the accuracy
accuracy = np.mean(y_pred == y_test)
print(f'Accuracy: {accuracy:.2f}')Accuracy: 0.85Key Parameters in Random Forests
Several key parameters influence the performance of Random Forests, including n_estimators, which defines the number of trees in the forest, and max_features, which specifies the number of features to consider when looking for the best split. Tuning these parameters can significantly impact the model's accuracy and generalization capability.
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
# Generate a random dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize the Random Forest classifier with specific parameters
rf_classifier = RandomForestClassifier(n_estimators=200, max_features='sqrt', random_state=42)
# Train the model
rf_classifier.fit(X_train, y_train)
# Make predictions
y_pred = rf_classifier.predict(X_test)
# Print the accuracy
accuracy = np.mean(y_pred == y_test)
print(f'Accuracy: {accuracy:.2f}')💡 Tip: Always perform hyperparameter tuning using techniques like Grid Search or Random Search to find the optimal parameters for your Random Forest model.
❓ What is the primary advantage of using Random Forests over a single decision tree?
❓ Which parameter in Random Forests controls the number of features considered for splitting a node?
Key Concepts
| Concept | Description |
|---|---|
| Bootstrap Aggregating | Core principle in this module |
| Feature Importance | Core principle in this module |
| Out-of-Bag Error | Core principle in this module |
| Ensemble | Core principle in this module |
Check Your Understanding
❓ What is the main purpose of Random?
❓ Which of these is a key characteristic of Random?