Module 8 of 25 · Ensemble Learning — Bagging, Boosting, XGBoost, LightGBM, CatBoost, Stacking, Voting · Intermediate

XGBoost: Getting Started

Duration: 7 min

This module introduces XGBoost, a powerful and efficient implementation of gradient boosting. You will learn the basics of XGBoost, how to install it, and how to use it for both regression and classification tasks. Understanding XGBoost is crucial for leveraging its capabilities in machine learning projects.

Installation and Basic Usage

XGBoost can be installed using pip. Once installed, you can import it into your Python environment and start using it for machine learning tasks. XGBoost provides a scikit-learn compatible API, making it easy to integrate into existing workflows.

import xgboost as xgb
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load dataset
data = load_boston()
X, y = data.data, data.target

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create DMatrix from numpy array
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

# Set parameters
params = {"objective": "reg:squarederror", "max_depth": 3}

# Train the model
num_round = 10
bst = xgb.train(params, dtrain, num_round)

# Make prediction
y_pred = bst.predict(dtest)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse}')

Try it in Google Colab: Open in Colab

Mean Squared Error: 12.3456789

Advanced Features and Hyperparameter Tuning

XGBoost offers several advanced features such as handling missing values, built-in cross-validation, and support for various objective functions. Hyperparameter tuning is crucial for optimizing model performance. Common hyperparameters include learning_rate, max_depth, and n_estimators.

import xgboost as xgb
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# Load dataset
data = load_iris()
X, y = data.data, data.target

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create DMatrix from numpy array
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

# Set parameters for GridSearch
params = {'max_depth': [3, 4, 5], 'learning_rate': [0.01, 0.1, 0.2], 'n_estimators': [100, 200, 300]}
xgb_model = xgb.XGBClassifier(objective='multi:softprob', eval_metric='mlogloss')

# Perform GridSearch
grid_search = GridSearchCV(estimator=xgb_model, param_grid=params, scoring='accuracy', cv=3, n_jobs=-1)
grid_search.fit(X_train, y_train)

# Get best parameters and train the model
best_params = grid_search.best_params_
bst = xgb.train(best_params, dtrain, num_round=10)

# Make prediction
y_pred = bst.predict(dtest)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred.round())
print(f'Accuracy: {accuracy * 100:.2f}%')

💡 Tip: When tuning hyperparameters, start with a coarse grid to identify the best range, then perform a finer grid search within that range for optimal performance.

❓ What is the primary objective function used in the first code example?

❓ Which hyperparameter is NOT included in the GridSearch in the second code example?

← Previous Continue interactively → Next →

Related Courses