Feature Engineering
Duration: 5 min
This module covers the essential techniques and strategies for feature engineering, a critical step in the data science pipeline. Feature engineering involves creating new features or modifying existing ones to improve model performance. Understanding and implementing effective feature engineering can significantly enhance the predictive power of machine learning models.
Creating New Features
Creating new features can help capture information that is not explicitly present in the original dataset. This can be done by combining existing features, applying mathematical transformations, or extracting new information from the data. For example, if you have date-time data, you can extract features like year, month, day, hour, etc., which might be relevant for your model.
import pandas as pd
# Sample DataFrame
data = {'date': ['2023-01-01', '2023-02-15', '2023-03-10']}
df = pd.DataFrame(data)
# Convert 'date' to datetime
df['date'] = pd.to_datetime(df['date'])
# Extract year, month, day
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
print(df) date year month day
0 2023-01-01 2023 1 1
1 2023-02-15 2023 2 15
2 2023-03-10 2023 3 10Handling Categorical Features
Categorical features often need to be encoded into numerical values for machine learning models to process them. Common techniques include one-hot encoding and label encoding. One-hot encoding creates binary columns for each category, while label encoding assigns a unique integer to each category.
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
# Sample DataFrame with categorical data
data = {'color': ['red', 'blue', 'green','red']}
df = pd.DataFrame(data)
# One-hot encoding
enc = OneHotEncoder()
enc_data = enc.fit_transform(df[['color']]).toarray()
enc_df = pd.DataFrame(enc_data, columns=enc.get_feature_names_out(['color']))
# Concatenate original DataFrame with encoded DataFrame
df = pd.concat([df, enc_df], axis=1)
print(df)💡 Tip: When using one-hot encoding, be mindful of the dimensionality it adds to your dataset. Too many categories can lead to a sparse matrix, which might affect model performance.
❓ What is the purpose of feature engineering in data science?
❓ Which encoding technique creates binary columns for each category?
Key Concepts
| Concept | Description |
|---|---|
| Scaling | Core principle in this module |
| Encoding | Core principle in this module |
| Selection | Core principle in this module |
| Creation | Core principle in this module |
Check Your Understanding
❓ How does Feature handle edge cases?
❓ What is the computational complexity of Feature?
❓ Which hyperparameter is most critical for Feature?