Module 17 of 26 · Scikit-Learn Machine Learning · Beginner

Pipelines in Scikit-Learn

Duration: 5 min

This module delves into the concept of pipelines in Scikit-Learn, a powerful tool that streamlines the machine learning workflow. Pipelines allow you to chain multiple steps of data processing and model training into a single object, making your code cleaner, more readable, and less prone to errors. Understanding pipelines is crucial for efficient machine learning model development.

Creating a Simple Pipeline

A pipeline in Scikit-Learn is a sequence of data processing steps, each of which is a transformer (like scaling or encoding) or an estimator (like a classifier or regressor). The pipeline applies these steps in order, and the final estimator produces the predictions. This ensures that the transformations are applied consistently to both the training and testing data.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load the iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a pipeline that scales the data and then applies logistic regression
pipeline = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('logistic', LogisticRegression())
])

# Fit the pipeline to the training data
pipeline.fit(X_train, y_train)

# Predict the labels of the test set
y_pred = pipeline.predict(X_test)

# Print the accuracy of the pipeline
print(f'Accuracy: {pipeline.score(X_test, y_test):.2f}')

Try it in Google Colab: Open in Colab

Accuracy: 0.97

Grid Search with Pipelines

Grid search is a technique for hyperparameter tuning that systematically tries every combination of specified parameter values. When used with pipelines, grid search can optimize the parameters of both the transformers and the final estimator. This allows you to find the best combination of preprocessing steps and model parameters.

from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {
    'scaler__with_mean': [True, False],
    'logistic__C': [0.1, 1, 10]
}

# Create a grid search object with the pipeline and parameter grid
grid_search = GridSearchCV(pipeline, param_grid, cv=5)

# Fit the grid search to the data
grid_search.fit(X_train, y_train)

# Print the best parameters and the corresponding score
print(f'Best parameters: {grid_search.best_params_}')
print(f'Best score: {grid_search.best_score_:.2f}')

💡 Tip: When using pipelines with grid search, ensure that the parameter names in the grid are prefixed with the step name and two underscores (e.g., 'scaler__with_mean'). This helps Scikit-Learn identify which step each parameter belongs to.

❓ What is the primary benefit of using pipelines in Scikit-Learn?

❓ Which method is used to find the best parameters in a pipeline using grid search?

Key Concepts

Concept Description
Estimators Core principle in this module
Pipelines Core principle in this module
Cross-validation Core principle in this module
Metrics Core principle in this module

Check Your Understanding

❓ How does Pipelines handle edge cases?

❓ What is the computational complexity of Pipelines?

❓ Which hyperparameter is most critical for Pipelines?

← Previous Continue interactively → Next →

Related Courses