Module 14 of 26 · Scikit-Learn Machine Learning · Beginner

Feature Engineering

Duration: 5 min

Feature engineering is a crucial step in the machine learning pipeline, involving the creation and transformation of input features to improve model performance. This module covers essential techniques for feature engineering, including scaling, encoding, and transformation methods, and demonstrates their implementation using Scikit-Learn.

Scaling Features

Scaling features is important to ensure that all features contribute equally to the model. Common scaling techniques include standardization (z-score normalization) and min-max scaling. Standardization transforms features to have a mean of 0 and a standard deviation of 1, while min-max scaling rescales features to a fixed range, typically [0, 1].

import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Sample data
data = np.array([[1, 2], [2, 3], [3, 4], [4, 5]])

# Standardization
standard_scaler = StandardScaler()
data_standardized = standard_scaler.fit_transform(data)
print('Standardized data:', data_standardized)

# Min-Max Scaling
min_max_scaler = MinMaxScaler()
data_minmax = min_max_scaler.fit_transform(data)
print('Min-Max scaled data:', data_minmax)

Try it in Google Colab: Open in Colab

Standardized data: [[-1.34164079 -1.34164079]
 [-0.4472136  -0.4472136 ]
 [ 0.4472136   0.4472136 ]
 [ 1.34164079  1.34164079]]
Min-Max scaled data: [[0.   0. ]
 [0.25  0.25 ]
 [0.5   0.5  ]
 [0.75  0.75 ]
 [1.    1.  ]]

Encoding Categorical Features

Categorical features need to be encoded into numerical values for machine learning models to process them. Common encoding techniques include one-hot encoding and ordinal encoding. One-hot encoding creates binary columns for each category, while ordinal encoding assigns a unique integer to each category based on some order.

from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

# Sample data
categories = np.array([['cat'], ['dog'], ['cat'], ['bird']])

# One-Hot Encoding
one_hot_encoder = OneHotEncoder()
data_onehot = one_hot_encoder.fit_transform(categories).toarray()
print('One-Hot encoded data:', data_onehot)

# Ordinal Encoding
ordinal_encoder = OrdinalEncoder()
data_ordinal = ordinal_encoder.fit_transform(categories)
print('Ordinal encoded data:', data_ordinal)
One-Hot encoded data: [[1. 0. 0.]
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 0. 1.]]
Ordinal encoded data: [[1.]
 [2.]
 [1.]
 [0.]]

💡 Tip: Avoid multicollinearity when using one-hot encoding by dropping one category to prevent redundant information.

❓ Which scaling technique transforms features to have a mean of 0 and a standard deviation of 1?

❓ Which encoding technique creates binary columns for each category?

Key Concepts

Concept Description
Scaling Core principle in this module
Encoding Core principle in this module
Selection Core principle in this module
Creation Core principle in this module

Check Your Understanding

❓ How does Feature handle edge cases?

❓ What is the computational complexity of Feature?

❓ Which hyperparameter is most critical for Feature?

← Previous Continue interactively → Next →

Related Courses