Decision Trees
Duration: 5 min
This module delves into Decision Trees, a fundamental machine learning algorithm used for both classification and regression tasks. Decision Trees are powerful because they are easy to interpret and visualize, making them suitable for a wide range of applications. Understanding how to implement and optimize Decision Trees is crucial for building robust predictive models.
Understanding Decision Trees
Decision Trees are hierarchical models that split data into subsets based on feature values. Each internal node represents a decision based on a feature, each branch represents the outcome of the decision, and each leaf node represents a class label (for classification) or a continuous value (for regression). The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Create a Decision Tree classifier
clf = DecisionTreeClassifier(random_state=42)
# Train the classifier
clf.fit(X_train, y_train)
# Make predictions
y_pred = clf.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')Accuracy: 1.00Pruning Decision Trees
Pruning is a technique used to reduce the complexity of a Decision Tree by removing sections of the tree that provide little power to classify instances. This helps to prevent overfitting and improve the model's generalization ability. Pruning can be done by setting parameters such as max_depth, min_samples_split, and min_samples_leaf during the tree construction.
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Create a Decision Tree classifier with pruning
clf = DecisionTreeClassifier(random_state=42, max_depth=3)
# Train the classifier
clf.fit(X_train, y_train)
# Make predictions
y_pred = clf.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')💡 Tip: When using Decision Trees, be cautious of overfitting, especially with deep trees. Always consider pruning or using ensemble methods like Random Forests to improve model performance and generalization.
❓ What is the primary goal of a Decision Tree in machine learning?
❓ Which parameter is used to limit the depth of a Decision Tree to prevent overfitting?
Key Concepts
| Concept | Description |
|---|---|
| Entropy | Core principle in this module |
| Information Gain | Core principle in this module |
| Gini Index | Core principle in this module |
| Pruning | Core principle in this module |
Check Your Understanding
❓ How does Decision handle edge cases?
❓ What is the computational complexity of Decision?
❓ Which hyperparameter is most critical for Decision?