Feature Engineering

Duration: 5 min

Feature engineering is a crucial step in the machine learning pipeline, involving the creation and transformation of input features to improve model performance. This module covers essential techniques for feature engineering, including scaling, encoding, and transformation methods, and demonstrates their implementation using Scikit-Learn.

Scaling Features

Scaling features is important to ensure that all features contribute equally to the model. Common scaling techniques include standardization (z-score normalization) and min-max scaling. Standardization transforms features to have a mean of 0 and a standard deviation of 1, while min-max scaling rescales features to a fixed range, typically [0, 1].

import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Sample data
data = np.array([[1, 2], [2, 3], [3, 4], [4, 5]])

# Standardization
standard_scaler = StandardScaler()
data_standardized = standard_scaler.fit_transform(data)
print('Standardized data:', data_standardized)

# Min-Max Scaling
min_max_scaler = MinMaxScaler()
data_minmax = min_max_scaler.fit_transform(data)
print('Min-Max scaled data:', data_minmax)

Try it in Google Colab:

Standardized data: [[-1.34164079 -1.34164079]
 [-0.4472136  -0.4472136 ]
 [ 0.4472136   0.4472136 ]
 [ 1.34164079  1.34164079]]
Min-Max scaled data: [[0.   0. ]
 [0.25  0.25 ]
 [0.5   0.5  ]
 [0.75  0.75 ]
 [1.    1.  ]]

Encoding Categorical Features

Categorical features need to be encoded into numerical values for machine learning models to process them. Common encoding techniques include one-hot encoding and ordinal encoding. One-hot encoding creates binary columns for each category, while ordinal encoding assigns a unique integer to each category based on some order.

from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

# Sample data
categories = np.array([['cat'], ['dog'], ['cat'], ['bird']])

# One-Hot Encoding
one_hot_encoder = OneHotEncoder()
data_onehot = one_hot_encoder.fit_transform(categories).toarray()
print('One-Hot encoded data:', data_onehot)

# Ordinal Encoding
ordinal_encoder = OrdinalEncoder()
data_ordinal = ordinal_encoder.fit_transform(categories)
print('Ordinal encoded data:', data_ordinal)

One-Hot encoded data: [[1. 0. 0.]
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 0. 1.]]
Ordinal encoded data: [[1.]
 [2.]
 [1.]
 [0.]]

💡 Tip: Avoid multicollinearity when using one-hot encoding by dropping one category to prevent redundant information.

❓ Which scaling technique transforms features to have a mean of 0 and a standard deviation of 1?

Min-Max Scaling Standardization Normalization Robust Scaling

❓ Which encoding technique creates binary columns for each category?

Ordinal Encoding Label Encoding One-Hot Encoding Binary Encoding

Key Concepts

Concept	Description
Scaling	Core principle in this module
Encoding	Core principle in this module
Selection	Core principle in this module
Creation	Core principle in this module

Check Your Understanding

❓ How does Feature handle edge cases?

Ignores them Applies regularization Removes them Duplicates them

❓ What is the computational complexity of Feature?

O(n) O(n²) O(log n) Depends on implementation

❓ Which hyperparameter is most critical for Feature?

Learning rate Batch size Epochs All equally important

Feature Engineering

Scaling Features

Encoding Categorical Features

Key Concepts

Check Your Understanding

Related Courses