Module 11 of 16 · Maths and Statistics in AI · Beginner

Feature Engineering and Selection

Duration: 5 min

This module delves into the crucial process of feature engineering and selection, which are fundamental to building effective machine learning models. By transforming and selecting the most relevant features, we can significantly improve model performance and efficiency. Understanding these techniques is essential for anyone looking to harness the full potential of artificial intelligence.

Feature Engineering

Feature engineering involves creating new features from existing ones to improve the performance of machine learning models. This process can include transformations such as normalization, encoding categorical variables, and generating polynomial features. By carefully crafting features, we can provide the model with more informative and relevant data.

import pandas as pd
from sklearn.preprocessing import StandardScaler

# Load sample dataset
data = pd.read_csv('sample_data.csv')

# Standardize numerical features
scaler = StandardScaler()
data[['age', 'income']] = scaler.fit_transform(data[['age', 'income']])

# Display the first few rows of the transformed data
print(data.head())

Try it in Google Colab: Open in Colab

Expected output of the above code

Feature Selection

Feature selection is the process of selecting a subset of relevant features for model construction. This can help reduce overfitting, improve model interpretability, and decrease training time. Techniques such as recursive feature elimination (RFE) and feature importance from tree-based models are commonly used to identify and select the most important features.

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel

# Load sample dataset
data = pd.read_csv('sample_data.csv')
X = data.drop('target', axis=1)
y = data['target']

# Train a Random Forest model
model = RandomForestClassifier(n_estimators=100)
model.fit(X, y)

# Select features with importance greater than the mean
sfm = SelectFromModel(model, threshold=model.feature_importance_.mean())
sfm.fit(X, y)

# Display the selected features
selected_features = X.columns[sfm.get_support()]
print(selected_features)

💡 Tip: When performing feature selection, always ensure that the selected features are relevant to the target variable to avoid discarding important information.

❓ What is the primary goal of feature engineering?

❓ Which technique is commonly used for feature selection in tree-based models?

← Previous Continue interactively → Next →

Related Courses