Feature Engineering
Duration: 5 min
Feature engineering is a crucial step in the machine learning pipeline, involving the creation and transformation of input features to improve model performance. This module covers essential techniques for feature engineering, including scaling, encoding, and transformation methods, and demonstrates their implementation using Scikit-Learn.
Scaling Features
Scaling features is important to ensure that all features contribute equally to the model. Common scaling techniques include standardization (z-score normalization) and min-max scaling. Standardization transforms features to have a mean of 0 and a standard deviation of 1, while min-max scaling rescales features to a fixed range, typically [0, 1].
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler
# Sample data
data = np.array([[1, 2], [2, 3], [3, 4], [4, 5]])
# Standardization
standard_scaler = StandardScaler()
data_standardized = standard_scaler.fit_transform(data)
print('Standardized data:', data_standardized)
# Min-Max Scaling
min_max_scaler = MinMaxScaler()
data_minmax = min_max_scaler.fit_transform(data)
print('Min-Max scaled data:', data_minmax)Standardized data: [[-1.34164079 -1.34164079]
[-0.4472136 -0.4472136 ]
[ 0.4472136 0.4472136 ]
[ 1.34164079 1.34164079]]
Min-Max scaled data: [[0. 0. ]
[0.25 0.25 ]
[0.5 0.5 ]
[0.75 0.75 ]
[1. 1. ]]Encoding Categorical Features
Categorical features need to be encoded into numerical values for machine learning models to process them. Common encoding techniques include one-hot encoding and ordinal encoding. One-hot encoding creates binary columns for each category, while ordinal encoding assigns a unique integer to each category based on some order.
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
# Sample data
categories = np.array([['cat'], ['dog'], ['cat'], ['bird']])
# One-Hot Encoding
one_hot_encoder = OneHotEncoder()
data_onehot = one_hot_encoder.fit_transform(categories).toarray()
print('One-Hot encoded data:', data_onehot)
# Ordinal Encoding
ordinal_encoder = OrdinalEncoder()
data_ordinal = ordinal_encoder.fit_transform(categories)
print('Ordinal encoded data:', data_ordinal)One-Hot encoded data: [[1. 0. 0.]
[0. 1. 0.]
[1. 0. 0.]
[0. 0. 1.]]
Ordinal encoded data: [[1.]
[2.]
[1.]
[0.]]💡 Tip: Avoid multicollinearity when using one-hot encoding by dropping one category to prevent redundant information.
❓ Which scaling technique transforms features to have a mean of 0 and a standard deviation of 1?
❓ Which encoding technique creates binary columns for each category?
Key Concepts
| Concept | Description |
|---|---|
| Scaling | Core principle in this module |
| Encoding | Core principle in this module |
| Selection | Core principle in this module |
| Creation | Core principle in this module |
Check Your Understanding
❓ How does Feature handle edge cases?
❓ What is the computational complexity of Feature?
❓ Which hyperparameter is most critical for Feature?