Data Cleaning & Preprocessing
Duration: 15 min
Data Cleaning & Preprocessing
Why Clean Data Matters
> "Dirty data = Dirty insights = Bad decisions"
80% of data science time is spent cleaning data, but it's the most critical step.
The Cost of Bad Data
- Wrong decisions lead to wasted resources
- Models trained on dirty data make poor predictions
- Time spent fixing problems downstream instead of building
Common Data Quality Issues
1. Missing Values
Detection
import pandas as pddf.isnull().sum() # Count missing per column
df.isnull().sum() / len(df) * 100 # Percentage missing
Handling Strategies
Remove rows with any missing value
df_clean = df.dropna()Remove if missing > 50%
df_clean = df.dropna(thresh=len(df)*0.5, axis=1)Fill with mean (numeric)
df['age'].fillna(df['age'].mean(), inplace=True)Fill with mode (categorical)
df['country'].fillna(df['country'].mode()[0], inplace=True)Forward fill (time series)
df['value'].fillna(method='ffill', inplace=True)
2. Duplicates
Find duplicates
duplicates = df[df.duplicated(subset=['customer_id'], keep=False)]Remove duplicates (keep first occurrence)
df_clean = df.drop_duplicates(subset=['customer_id'])Remove completely identical rows
df_clean = df.drop_duplicates()
3. Outliers
IQR method
Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1outliers = df[(df['salary'] < Q1 - 1.5IQR) | (df['salary'] > Q3 + 1.5IQR)]
Z-score method (values > 3 std deviations are suspicious)
from scipy import stats
z_scores = stats.zscore(df['salary'])
outliers = df[abs(z_scores) > 3]
4. Inconsistent Values
Standardize text
df['country'] = df['country'].str.lower()
df['country'] = df['country'].str.strip() # Remove whitespaceMap inconsistent values
mapping = {'USA': 'United States', 'U.S.': 'United States', 'US': 'United States'}
df['country'] = df['country'].map(mapping)Date consistency
df['date'] = pd.to_datetime(df['date'])
Data Validation
Create a validation checklist:
def validate_data(df):
checks = {
'no_nulls': df.isnull().sum().sum() == 0,
'no_duplicates': df.duplicated().sum() == 0,
'age_positive': (df['age'] > 0).all(),
'salary_numeric': pd.api.types.is_numeric_dtype(df['salary']),
'date_sorted': df['date'].is_monotonic_increasing,
}
for check, passed in checks.items():
status = '✓' if passed else '✗'
print(f"{status} {check}")
return all(checks.values())validate_data(df)
Normalization & Scaling
Some algorithms need features on similar scales (e.g., neural networks):
from sklearn.preprocessing import StandardScaler, MinMaxScalerStandardization: mean=0, std=1 (good for normal distributions)
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df[['age', 'salary']])Min-Max: scale to [0,1] (good for bounded data)
scaler = MinMaxScaler()
df_scaled = scaler.fit_transform(df[['age', 'salary']])
Handling Categorical Variables
One-hot encoding (for categories with few unique values)
df_encoded = pd.get_dummies(df, columns=['country', 'product_type'])Label encoding (for ordinal categories like Low/Medium/High)
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['priority_encoded'] = le.fit_transform(df['priority'])Binary encoding for many categories
df = pd.get_dummies(df, columns=['city'], drop_first=True)
Data Type Optimization
Reduce memory usage
df['age'] = df['age'].astype('int8') # Instead of int64
df['salary'] = df['salary'].astype('float32') # Instead of float64
df['country'] = df['country'].astype('category') # For low-cardinality stringsCheck memory before/after
print(df.memory_usage(deep=True).sum() / 10242) # MB
Complete Cleaning Pipeline
def clean_dataset(df):
"""Complete data cleaning workflow"""
# 1. Remove duplicates
df = df.drop_duplicates()
# 2. Handle missing values
df['age'].fillna(df['age'].median(), inplace=True)
df['country'].fillna(df['country'].mode()[0], inplace=True)
# 3. Fix inconsistencies
df['country'] = df['country'].str.lower().str.strip()
# 4. Remove outliers (IQR method)
for col in ['age', 'salary']:
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
df = df[(df[col] >= Q1 - 1.5IQR) & (df[col] <= Q3 + 1.5IQR)]
# 5. Encode categorical
df = pd.get_dummies(df, columns=['country'])
# 6. Scale numeric features
scaler = StandardScaler()
numeric_cols = df.select_dtypes(include=['int64', 'float64']).columns
df[numeric_cols] = scaler.fit_transform(df[numeric_cols])
return dfdf_clean = clean_dataset(df)
Key Takeaways
✓ Clean data is foundation of good data science ✓ Use appropriate strategies for different quality issues ✓ Validate data quality with automated checks ✓ Scale and encode features for model readiness
Practice
1. Load a real dataset from Kaggle
2. Check data quality: df.info(), df.describe(), df.isnull().sum()
3. Clean using techniques above
4. Validate with a checklist
5. Compare before/after statistics
---
Next: Exploratory Data Analysis—understanding patterns in clean data.