Decision Trees Basics

Duration: 5 min

This module introduces the fundamentals of Decision Trees, a powerful supervised learning technique used for both classification and regression tasks. Decision Trees are intuitive, easy to interpret, and can handle both numerical and categorical data. Understanding Decision Trees is crucial as they form the basis for more complex ensemble methods like Random Forests and Gradient Boosting.

Understanding Decision Trees

A Decision Tree is a flowchart-like structure where each internal node represents a feature (or attribute), each branch represents a decision rule, and each leaf node represents an outcome. The topmost node in a tree is called the root node. Decision Trees are built via an algorithmic approach that identifies ways to split a dataset based on different conditions.

import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df.drop('target', axis=1), df['target'], test_size=0.3, random_state=42)

# Create Decision Tree classifier object
clf = DecisionTreeClassifier()

# Train Decision Tree Classifier
clf = clf.fit(X_train,y_train)

# Predict the response for test dataset
y_pred = clf.predict(X_test)

# Model Accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))

Try it in Google Colab:

Accuracy: 1.0

Decision Tree Parameters and Overfitting

Decision Trees have several parameters that can be tuned to improve performance and avoid overfitting. Key parameters include max_depth, min_samples_split, and min_samples_leaf. Overfitting occurs when the tree becomes too complex and captures noise in the data rather than the underlying pattern. Regularization techniques and pruning are used to combat overfitting.

import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df.drop('target', axis=1), df['target'], test_size=0.3, random_state=42)

# Create Decision Tree classifier object with parameters to avoid overfitting
clf = DecisionTreeClassifier(max_depth=3, min_samples_split=10, min_samples_leaf=5)

# Train Decision Tree Classifier
clf = clf.fit(X_train,y_train)

# Predict the response for test dataset
y_pred = clf.predict(X_test)

# Model Accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))

💡 Tip: Always validate your Decision Tree model using a separate test set to ensure it generalizes well to unseen data. Regularly monitor for overfitting by adjusting parameters like max_depth and min_samples_split.

❓ What is the primary purpose of a Decision Tree in machine learning?

To perform unsupervised clustering To make predictions based on a series of decisions To reduce dimensionality To perform time-series forecasting

❓ Which parameter in a Decision Tree helps prevent overfitting by limiting the depth of the tree?

min_samples_split max_features max_depth min_samples_leaf

Key Concepts

Concept	Description
Entropy	Core principle in this module
Information Gain	Core principle in this module
Gini Index	Core principle in this module
Pruning	Core principle in this module

Check Your Understanding

❓ What is the main purpose of Decision?

To classify data To predict values To understand patterns To reduce dimensions

❓ Which of these is a key characteristic of Decision?

Supervised Unsupervised Semi-supervised Reinforcement

Decision Trees Basics

Understanding Decision Trees

Decision Tree Parameters and Overfitting

Key Concepts

Check Your Understanding

Related Courses