Decision Trees Basics
Duration: 5 min
This module introduces the fundamentals of Decision Trees, a powerful supervised learning technique used for both classification and regression tasks. Decision Trees are intuitive, easy to interpret, and can handle both numerical and categorical data. Understanding Decision Trees is crucial as they form the basis for more complex ensemble methods like Random Forests and Gradient Boosting.
Understanding Decision Trees
A Decision Tree is a flowchart-like structure where each internal node represents a feature (or attribute), each branch represents a decision rule, and each leaf node represents an outcome. The topmost node in a tree is called the root node. Decision Trees are built via an algorithmic approach that identifies ways to split a dataset based on different conditions.
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
# Load dataset
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target
# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df.drop('target', axis=1), df['target'], test_size=0.3, random_state=42)
# Create Decision Tree classifier object
clf = DecisionTreeClassifier()
# Train Decision Tree Classifier
clf = clf.fit(X_train,y_train)
# Predict the response for test dataset
y_pred = clf.predict(X_test)
# Model Accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))Accuracy: 1.0Decision Tree Parameters and Overfitting
Decision Trees have several parameters that can be tuned to improve performance and avoid overfitting. Key parameters include max_depth, min_samples_split, and min_samples_leaf. Overfitting occurs when the tree becomes too complex and captures noise in the data rather than the underlying pattern. Regularization techniques and pruning are used to combat overfitting.
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
# Load dataset
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target
# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df.drop('target', axis=1), df['target'], test_size=0.3, random_state=42)
# Create Decision Tree classifier object with parameters to avoid overfitting
clf = DecisionTreeClassifier(max_depth=3, min_samples_split=10, min_samples_leaf=5)
# Train Decision Tree Classifier
clf = clf.fit(X_train,y_train)
# Predict the response for test dataset
y_pred = clf.predict(X_test)
# Model Accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))💡 Tip: Always validate your Decision Tree model using a separate test set to ensure it generalizes well to unseen data. Regularly monitor for overfitting by adjusting parameters like
max_depthandmin_samples_split.
❓ What is the primary purpose of a Decision Tree in machine learning?
❓ Which parameter in a Decision Tree helps prevent overfitting by limiting the depth of the tree?
Key Concepts
| Concept | Description |
|---|---|
| Entropy | Core principle in this module |
| Information Gain | Core principle in this module |
| Gini Index | Core principle in this module |
| Pruning | Core principle in this module |
Check Your Understanding
❓ What is the main purpose of Decision?
❓ Which of these is a key characteristic of Decision?