Decision Trees

Duration: 5 min

This module delves into Decision Trees, a fundamental machine learning algorithm used for both classification and regression tasks. Decision Trees are powerful because they are easy to interpret and visualize, making them suitable for a wide range of applications. Understanding how to implement and optimize Decision Trees is crucial for building robust predictive models.

Understanding Decision Trees

Decision Trees are hierarchical models that split data into subsets based on feature values. Each internal node represents a decision based on a feature, each branch represents the outcome of the decision, and each leaf node represents a class label (for classification) or a continuous value (for regression). The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.

import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create a Decision Tree classifier
clf = DecisionTreeClassifier(random_state=42)

# Train the classifier
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

Try it in Google Colab:

Accuracy: 1.00

Pruning Decision Trees

Pruning is a technique used to reduce the complexity of a Decision Tree by removing sections of the tree that provide little power to classify instances. This helps to prevent overfitting and improve the model's generalization ability. Pruning can be done by setting parameters such as max_depth, min_samples_split, and min_samples_leaf during the tree construction.

import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create a Decision Tree classifier with pruning
clf = DecisionTreeClassifier(random_state=42, max_depth=3)

# Train the classifier
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

💡 Tip: When using Decision Trees, be cautious of overfitting, especially with deep trees. Always consider pruning or using ensemble methods like Random Forests to improve model performance and generalization.

❓ What is the primary goal of a Decision Tree in machine learning?

To cluster data points To predict the value of a target variable by learning decision rules To reduce dimensionality To perform feature selection

❓ Which parameter is used to limit the depth of a Decision Tree to prevent overfitting?

min_samples_split max_features max_depth criterion

Key Concepts

Concept	Description
Entropy	Core principle in this module
Information Gain	Core principle in this module
Gini Index	Core principle in this module
Pruning	Core principle in this module

Check Your Understanding

❓ How does Decision handle edge cases?

Ignores them Applies regularization Removes them Duplicates them

❓ What is the computational complexity of Decision?

O(n) O(n²) O(log n) Depends on implementation

❓ Which hyperparameter is most critical for Decision?

Learning rate Batch size Epochs All equally important

Decision Trees

Understanding Decision Trees

Pruning Decision Trees

Key Concepts

Check Your Understanding

Related Courses