Cleaning and Preparing Real Data
Duration: 5 min
Real datasets are messy. Missing values, wrong data types, outliers, and skewed distributions are the norm, not the exception. This module covers the essential cleaning steps before any model training.
Handling missing values
import pandas as pd
from sklearn.datasets import fetch_california_housing
df = fetch_california_housing(as_frame=True).frame
# Check missing values
print(df.isnull().sum())
# Option 1: drop rows with any missing value
df_clean = df.dropna()
# Option 2: fill with median (better for skewed data)
df['AveBedrms'] = df['AveBedrms'].fillna(df['AveBedrms'].median())
# Option 3: use sklearn's SimpleImputer (best for pipelines)
from sklearn.impute import SimpleImputer
import numpy as np
imp = SimpleImputer(strategy='median')
df_imputed = pd.DataFrame(imp.fit_transform(df), columns=df.columns)Removing outliers
# California housing caps MedHouseVal at 5.0 ($500k)
# These capped values can confuse models — remove them
df = df[df['MedHouseVal'] < 5.0]
print(f'Rows after removing capped values: {len(df)}')
# Remove extreme AveRooms outliers (some blocks show 50+ rooms)
df = df[df['AveRooms'] < 20]
print(f'Rows after removing room outliers: {len(df)}')Feature scaling
Most ML algorithms perform better when features are on the same scale. Income ranges from 0.5 to 15, while Population ranges from 3 to 35,000 — without scaling, population dominates.
from sklearn.preprocessing import StandardScaler
import pandas as pd
from sklearn.datasets import fetch_california_housing
df = fetch_california_housing(as_frame=True).frame
X = df.drop('MedHouseVal', axis=1)
y = df['MedHouseVal']
scaler = StandardScaler()
X_scaled = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)
print(X_scaled.describe().round(2))
# All columns now have mean≈0 and std≈1💡 Tip: Always fit the scaler on training data only, then transform both train and test. Fitting on the full dataset leaks information about the test set.
❓ Why should you fit a StandardScaler only on training data?