Exploratory Data Analysis (EDA)

Duration: 5 min

Exploratory Data Analysis is the critical first step in any ML project. EDA reveals data patterns, distributions, missing values, outliers, and relationships that inform preprocessing and model selection decisions. Skipping EDA leads to poor model performance and wasted effort.

Understanding Data Distribution

Visualizing feature distributions reveals skewness, multimodality, and outliers. Histograms, box plots, and KDE plots show if data is normally distributed or requires transformation.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
df = pd.read_csv('data.csv')

# Summary statistics
print(df.describe())
print(f"Shape: {df.shape}")
print(f"Missing values:\n{df.isnull().sum()}")

# Visualize distributions
fig, axes = plt.subplots(2, 2, figsize=(12, 8))

# Histogram
df['feature1'].hist(bins=30, ax=axes[0, 0])
axes[0, 0].set_title('Histogram: feature1')

# Box plot (detect outliers)
df.boxplot(column='feature1', ax=axes[0, 1])
axes[0, 1].set_title('Box Plot: feature1')

# KDE plot
df['feature1'].plot(kind='kde', ax=axes[1, 0])
axes[1, 0].set_title('KDE: feature1')

# Q-Q plot (normality check)
from scipy import stats
stats.probplot(df['feature1'], dist="norm", plot=axes[1, 1])
axes[1, 1].set_title('Q-Q Plot: feature1')

plt.tight_layout()
plt.show()

Try it in Google Colab:

Shape: (1000, 10)
Missing values:
feature1     5
feature2     0
...

Correlation & Relationships

Correlation matrices and scatter plots reveal feature relationships. High correlation between features suggests multicollinearity; correlation with target suggests predictive power.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Correlation matrix
corr_matrix = df.corr()

# Heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Feature Correlation Matrix')
plt.show()

# Scatter plot: feature vs target
plt.figure(figsize=(10, 6))
plt.scatter(df['feature1'], df['target'], alpha=0.5)
plt.xlabel('feature1')
plt.ylabel('target')
plt.title('Feature vs Target Relationship')
plt.show()

# Identify highly correlated features (multicollinearity)
high_corr_pairs = []
for i in range(len(corr_matrix.columns)):
    for j in range(i+1, len(corr_matrix.columns)):
        if abs(corr_matrix.iloc[i, j]) > 0.9:
            high_corr_pairs.append((corr_matrix.columns[i], corr_matrix.columns[j], corr_matrix.iloc[i, j]))

print("Highly correlated feature pairs:")
for feat1, feat2, corr in high_corr_pairs:
    print(f"{feat1} - {feat2}: {corr:.3f}")

💡 Tip: Look for correlations > 0.9 (multicollinearity) or < -0.9. Remove one feature from highly correlated pairs to reduce redundancy.

❓ What is the primary purpose of EDA?

To train the model To understand data patterns and inform preprocessing decisions To evaluate model performance To deploy the model

❓ What correlation threshold indicates multicollinearity?

> 0.5 > 0.7 > 0.9 > 0.95

Exploratory Data Analysis (EDA)

Understanding Data Distribution

Correlation & Relationships

Related Courses