Module 3 of 9 · Real Datasets & Pre-trained Models · Beginner

Cleaning and Preparing Real Data

Duration: 5 min

Real datasets are messy. Missing values, wrong data types, outliers, and skewed distributions are the norm, not the exception. This module covers the essential cleaning steps before any model training.

Handling missing values

import pandas as pd
from sklearn.datasets import fetch_california_housing

df = fetch_california_housing(as_frame=True).frame

# Check missing values
print(df.isnull().sum())

# Option 1: drop rows with any missing value
df_clean = df.dropna()

# Option 2: fill with median (better for skewed data)
df['AveBedrms'] = df['AveBedrms'].fillna(df['AveBedrms'].median())

# Option 3: use sklearn's SimpleImputer (best for pipelines)
from sklearn.impute import SimpleImputer
import numpy as np

imp = SimpleImputer(strategy='median')
df_imputed = pd.DataFrame(imp.fit_transform(df), columns=df.columns)

Try it in Google Colab: Open in Colab

Removing outliers

# California housing caps MedHouseVal at 5.0 ($500k)
# These capped values can confuse models — remove them
df = df[df['MedHouseVal'] < 5.0]
print(f'Rows after removing capped values: {len(df)}')

# Remove extreme AveRooms outliers (some blocks show 50+ rooms)
df = df[df['AveRooms'] < 20]
print(f'Rows after removing room outliers: {len(df)}')

Feature scaling

Most ML algorithms perform better when features are on the same scale. Income ranges from 0.5 to 15, while Population ranges from 3 to 35,000 — without scaling, population dominates.

from sklearn.preprocessing import StandardScaler
import pandas as pd
from sklearn.datasets import fetch_california_housing

df = fetch_california_housing(as_frame=True).frame
X = df.drop('MedHouseVal', axis=1)
y = df['MedHouseVal']

scaler = StandardScaler()
X_scaled = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)
print(X_scaled.describe().round(2))
# All columns now have mean≈0 and std≈1

💡 Tip: Always fit the scaler on training data only, then transform both train and test. Fitting on the full dataset leaks information about the test set.

❓ Why should you fit a StandardScaler only on training data?

← Previous Continue interactively → Next →

Related Courses