Data Cleaning & Preprocessing

Why Clean Data Matters

> "Dirty data = Dirty insights = Bad decisions"

80% of data science time is spent cleaning data, but it's the most critical step.

The Cost of Bad Data

Wrong decisions lead to wasted resources

Models trained on dirty data make poor predictions

Time spent fixing problems downstream instead of building

Common Data Quality Issues

1. Missing Values

Detection

import pandas as pddf.isnull().sum()  # Count missing per column
df.isnull().sum() / len(df) * 100  # Percentage missing

Handling Strategies

Remove rows with any missing value
df_clean = df.dropna()
Remove if missing > 50%
df_clean = df.dropna(thresh=len(df)*0.5, axis=1)
Fill with mean (numeric)
df['age'].fillna(df['age'].mean(), inplace=True)
Fill with mode (categorical)
df['country'].fillna(df['country'].mode()[0], inplace=True)
Forward fill (time series)
df['value'].fillna(method='ffill', inplace=True)

2. Duplicates

Find duplicates
duplicates = df[df.duplicated(subset=['customer_id'], keep=False)]
Remove duplicates (keep first occurrence)
df_clean = df.drop_duplicates(subset=['customer_id'])
Remove completely identical rows
df_clean = df.drop_duplicates()

3. Outliers

IQR method
Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['salary'] < Q1 - 1.5IQR) | (df['salary'] > Q3 + 1.5IQR)]
Z-score method (values > 3 std deviations are suspicious)
from scipy import stats
z_scores = stats.zscore(df['salary'])
outliers = df[abs(z_scores) > 3]

4. Inconsistent Values

Standardize text
df['country'] = df['country'].str.lower()
df['country'] = df['country'].str.strip()  # Remove whitespace
Map inconsistent values
mapping = {'USA': 'United States', 'U.S.': 'United States', 'US': 'United States'}
df['country'] = df['country'].map(mapping)
Date consistency
df['date'] = pd.to_datetime(df['date'])

Data Validation

Create a validation checklist:

def validate_data(df):
    checks = {
        'no_nulls': df.isnull().sum().sum() == 0,
        'no_duplicates': df.duplicated().sum() == 0,
        'age_positive': (df['age'] > 0).all(),
        'salary_numeric': pd.api.types.is_numeric_dtype(df['salary']),
        'date_sorted': df['date'].is_monotonic_increasing,
    }
    
    for check, passed in checks.items():
        status = '✓' if passed else '✗'
        print(f"{status} {check}")
    
    return all(checks.values())validate_data(df)

Normalization & Scaling

Some algorithms need features on similar scales (e.g., neural networks):

from sklearn.preprocessing import StandardScaler, MinMaxScaler
Standardization: mean=0, std=1 (good for normal distributions)
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df[['age', 'salary']])
Min-Max: scale to [0,1] (good for bounded data)
scaler = MinMaxScaler()
df_scaled = scaler.fit_transform(df[['age', 'salary']])

Handling Categorical Variables

One-hot encoding (for categories with few unique values)
df_encoded = pd.get_dummies(df, columns=['country', 'product_type'])
Label encoding (for ordinal categories like Low/Medium/High)
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['priority_encoded'] = le.fit_transform(df['priority'])
Binary encoding for many categories
df = pd.get_dummies(df, columns=['city'], drop_first=True)

Data Type Optimization

Reduce memory usage
df['age'] = df['age'].astype('int8')  # Instead of int64
df['salary'] = df['salary'].astype('float32')  # Instead of float64
df['country'] = df['country'].astype('category')  # For low-cardinality strings
Check memory before/after
print(df.memory_usage(deep=True).sum() / 10242)  # MB

Complete Cleaning Pipeline

def clean_dataset(df):
    """Complete data cleaning workflow"""
    
    # 1. Remove duplicates
    df = df.drop_duplicates()
    
    # 2. Handle missing values
    df['age'].fillna(df['age'].median(), inplace=True)
    df['country'].fillna(df['country'].mode()[0], inplace=True)
    
    # 3. Fix inconsistencies
    df['country'] = df['country'].str.lower().str.strip()
    
    # 4. Remove outliers (IQR method)
    for col in ['age', 'salary']:
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        df = df[(df[col] >= Q1 - 1.5IQR) & (df[col] <= Q3 + 1.5IQR)]
    
    # 5. Encode categorical
    df = pd.get_dummies(df, columns=['country'])
    
    # 6. Scale numeric features
    scaler = StandardScaler()
    numeric_cols = df.select_dtypes(include=['int64', 'float64']).columns
    df[numeric_cols] = scaler.fit_transform(df[numeric_cols])
    
    return dfdf_clean = clean_dataset(df)

Key Takeaways

✓ Clean data is foundation of good data science ✓ Use appropriate strategies for different quality issues ✓ Validate data quality with automated checks ✓ Scale and encode features for model readiness

Practice

1. Load a real dataset from Kaggle 2. Check data quality: df.info(), df.describe(), df.isnull().sum() 3. Clean using techniques above 4. Validate with a checklist 5. Compare before/after statistics

---

Next: Exploratory Data Analysis—understanding patterns in clean data.