Data Cleaning & Preprocessing

Duration: 15 min

Data Cleaning & Preprocessing

Why Clean Data Matters

> "Dirty data = Dirty insights = Bad decisions"

80% of data science time is spent cleaning data, but it's the most critical step.

The Cost of Bad Data

  • Wrong decisions lead to wasted resources
  • Models trained on dirty data make poor predictions
  • Time spent fixing problems downstream instead of building

Common Data Quality Issues

1. Missing Values

Detection

import pandas as pd

df.isnull().sum() # Count missing per column df.isnull().sum() / len(df) * 100 # Percentage missing

Handling Strategies

Remove rows with any missing value

df_clean = df.dropna()

Remove if missing > 50%

df_clean = df.dropna(thresh=len(df)*0.5, axis=1)

Fill with mean (numeric)

df['age'].fillna(df['age'].mean(), inplace=True)

Fill with mode (categorical)

df['country'].fillna(df['country'].mode()[0], inplace=True)

Forward fill (time series)

df['value'].fillna(method='ffill', inplace=True)

2. Duplicates

Find duplicates

duplicates = df[df.duplicated(subset=['customer_id'], keep=False)]

Remove duplicates (keep first occurrence)

df_clean = df.drop_duplicates(subset=['customer_id'])

Remove completely identical rows

df_clean = df.drop_duplicates()

3. Outliers

IQR method

Q1 = df['salary'].quantile(0.25) Q3 = df['salary'].quantile(0.75) IQR = Q3 - Q1

outliers = df[(df['salary'] < Q1 - 1.5IQR) | (df['salary'] > Q3 + 1.5IQR)]

Z-score method (values > 3 std deviations are suspicious)

from scipy import stats z_scores = stats.zscore(df['salary']) outliers = df[abs(z_scores) > 3]

4. Inconsistent Values

Standardize text

df['country'] = df['country'].str.lower() df['country'] = df['country'].str.strip() # Remove whitespace

Map inconsistent values

mapping = {'USA': 'United States', 'U.S.': 'United States', 'US': 'United States'} df['country'] = df['country'].map(mapping)

Date consistency

df['date'] = pd.to_datetime(df['date'])

Data Validation

Create a validation checklist:

def validate_data(df):
    checks = {
        'no_nulls': df.isnull().sum().sum() == 0,
        'no_duplicates': df.duplicated().sum() == 0,
        'age_positive': (df['age'] > 0).all(),
        'salary_numeric': pd.api.types.is_numeric_dtype(df['salary']),
        'date_sorted': df['date'].is_monotonic_increasing,
    }
    
    for check, passed in checks.items():
        status = '✓' if passed else '✗'
        print(f"{status} {check}")
    
    return all(checks.values())

validate_data(df)

Normalization & Scaling

Some algorithms need features on similar scales (e.g., neural networks):

from sklearn.preprocessing import StandardScaler, MinMaxScaler

Standardization: mean=0, std=1 (good for normal distributions)

scaler = StandardScaler() df_scaled = scaler.fit_transform(df[['age', 'salary']])

Min-Max: scale to [0,1] (good for bounded data)

scaler = MinMaxScaler() df_scaled = scaler.fit_transform(df[['age', 'salary']])

Handling Categorical Variables

One-hot encoding (for categories with few unique values)

df_encoded = pd.get_dummies(df, columns=['country', 'product_type'])

Label encoding (for ordinal categories like Low/Medium/High)

from sklearn.preprocessing import LabelEncoder le = LabelEncoder() df['priority_encoded'] = le.fit_transform(df['priority'])

Binary encoding for many categories

df = pd.get_dummies(df, columns=['city'], drop_first=True)

Data Type Optimization

Reduce memory usage

df['age'] = df['age'].astype('int8') # Instead of int64 df['salary'] = df['salary'].astype('float32') # Instead of float64 df['country'] = df['country'].astype('category') # For low-cardinality strings

Check memory before/after

print(df.memory_usage(deep=True).sum() / 10242) # MB

Complete Cleaning Pipeline

def clean_dataset(df):
    """Complete data cleaning workflow"""
    
    # 1. Remove duplicates
    df = df.drop_duplicates()
    
    # 2. Handle missing values
    df['age'].fillna(df['age'].median(), inplace=True)
    df['country'].fillna(df['country'].mode()[0], inplace=True)
    
    # 3. Fix inconsistencies
    df['country'] = df['country'].str.lower().str.strip()
    
    # 4. Remove outliers (IQR method)
    for col in ['age', 'salary']:
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        df = df[(df[col] >= Q1 - 1.5IQR) & (df[col] <= Q3 + 1.5IQR)]
    
    # 5. Encode categorical
    df = pd.get_dummies(df, columns=['country'])
    
    # 6. Scale numeric features
    scaler = StandardScaler()
    numeric_cols = df.select_dtypes(include=['int64', 'float64']).columns
    df[numeric_cols] = scaler.fit_transform(df[numeric_cols])
    
    return df

df_clean = clean_dataset(df)

Key Takeaways

✓ Clean data is foundation of good data science ✓ Use appropriate strategies for different quality issues ✓ Validate data quality with automated checks ✓ Scale and encode features for model readiness

Practice

1. Load a real dataset from Kaggle 2. Check data quality: df.info(), df.describe(), df.isnull().sum() 3. Clean using techniques above 4. Validate with a checklist 5. Compare before/after statistics

---

Next: Exploratory Data Analysis—understanding patterns in clean data.