Module 7 of 26 · Scikit-Learn Machine Learning · Beginner

Decision Trees

Duration: 5 min

This module delves into Decision Trees, a fundamental machine learning algorithm used for both classification and regression tasks. Decision Trees are powerful because they are easy to interpret and visualize, making them suitable for a wide range of applications. Understanding how to implement and optimize Decision Trees is crucial for building robust predictive models.

Understanding Decision Trees

Decision Trees are hierarchical models that split data into subsets based on feature values. Each internal node represents a decision based on a feature, each branch represents the outcome of the decision, and each leaf node represents a class label (for classification) or a continuous value (for regression). The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.

import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create a Decision Tree classifier
clf = DecisionTreeClassifier(random_state=42)

# Train the classifier
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

Try it in Google Colab: Open in Colab

Accuracy: 1.00

Pruning Decision Trees

Pruning is a technique used to reduce the complexity of a Decision Tree by removing sections of the tree that provide little power to classify instances. This helps to prevent overfitting and improve the model's generalization ability. Pruning can be done by setting parameters such as max_depth, min_samples_split, and min_samples_leaf during the tree construction.

import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create a Decision Tree classifier with pruning
clf = DecisionTreeClassifier(random_state=42, max_depth=3)

# Train the classifier
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

💡 Tip: When using Decision Trees, be cautious of overfitting, especially with deep trees. Always consider pruning or using ensemble methods like Random Forests to improve model performance and generalization.

❓ What is the primary goal of a Decision Tree in machine learning?

❓ Which parameter is used to limit the depth of a Decision Tree to prevent overfitting?

Key Concepts

Concept Description
Entropy Core principle in this module
Information Gain Core principle in this module
Gini Index Core principle in this module
Pruning Core principle in this module

Check Your Understanding

❓ How does Decision handle edge cases?

❓ What is the computational complexity of Decision?

❓ Which hyperparameter is most critical for Decision?

← Previous Continue interactively → Next →

Related Courses