Module 26 of 26 · Scikit-Learn Machine Learning · Beginner

Exploratory Data Analysis (EDA)

Duration: 5 min

Exploratory Data Analysis is the critical first step in any ML project. EDA reveals data patterns, distributions, missing values, outliers, and relationships that inform preprocessing and model selection decisions. Skipping EDA leads to poor model performance and wasted effort.

Understanding Data Distribution

Visualizing feature distributions reveals skewness, multimodality, and outliers. Histograms, box plots, and KDE plots show if data is normally distributed or requires transformation.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
df = pd.read_csv('data.csv')

# Summary statistics
print(df.describe())
print(f"Shape: {df.shape}")
print(f"Missing values:\n{df.isnull().sum()}")

# Visualize distributions
fig, axes = plt.subplots(2, 2, figsize=(12, 8))

# Histogram
df['feature1'].hist(bins=30, ax=axes[0, 0])
axes[0, 0].set_title('Histogram: feature1')

# Box plot (detect outliers)
df.boxplot(column='feature1', ax=axes[0, 1])
axes[0, 1].set_title('Box Plot: feature1')

# KDE plot
df['feature1'].plot(kind='kde', ax=axes[1, 0])
axes[1, 0].set_title('KDE: feature1')

# Q-Q plot (normality check)
from scipy import stats
stats.probplot(df['feature1'], dist="norm", plot=axes[1, 1])
axes[1, 1].set_title('Q-Q Plot: feature1')

plt.tight_layout()
plt.show()

Try it in Google Colab: Open in Colab

Shape: (1000, 10)
Missing values:
feature1     5
feature2     0
...

Correlation & Relationships

Correlation matrices and scatter plots reveal feature relationships. High correlation between features suggests multicollinearity; correlation with target suggests predictive power.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Correlation matrix
corr_matrix = df.corr()

# Heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Feature Correlation Matrix')
plt.show()

# Scatter plot: feature vs target
plt.figure(figsize=(10, 6))
plt.scatter(df['feature1'], df['target'], alpha=0.5)
plt.xlabel('feature1')
plt.ylabel('target')
plt.title('Feature vs Target Relationship')
plt.show()

# Identify highly correlated features (multicollinearity)
high_corr_pairs = []
for i in range(len(corr_matrix.columns)):
    for j in range(i+1, len(corr_matrix.columns)):
        if abs(corr_matrix.iloc[i, j]) > 0.9:
            high_corr_pairs.append((corr_matrix.columns[i], corr_matrix.columns[j], corr_matrix.iloc[i, j]))

print("Highly correlated feature pairs:")
for feat1, feat2, corr in high_corr_pairs:
    print(f"{feat1} - {feat2}: {corr:.3f}")

💡 Tip: Look for correlations > 0.9 (multicollinearity) or < -0.9. Remove one feature from highly correlated pairs to reduce redundancy.

❓ What is the primary purpose of EDA?

❓ What correlation threshold indicates multicollinearity?

← Previous Continue interactively →

Related Courses