Exploratory Data Analysis (EDA)

What is EDA?

Exploratory Data Analysis is the detective work of data science. Before building models, you analyze data to:

Understand distributions and relationships

Find patterns and anomalies

Identify which features matter most

Generate hypotheses for modeling

The EDA Workflow

1. Load data → 2. Understand structure → 3. Univariate analysis
     ↓
4. Bivariate analysis → 5. Multivariate analysis → 6. Hypotheses

Univariate Analysis (One Variable)

Numeric Features

import pandas as pd
import matplotlib.pyplot as plt
Statistical summary
print(df['age'].describe())  # Mean, std, quartiles
Visualize distribution
df['age'].hist(bins=30)
plt.title('Age Distribution')
plt.show()
Skewness
print(df['age'].skew())  # Negative = left-skewed, Positive = right-skewed

Categorical Features

Value counts
print(df['country'].value_counts())
Proportion
print(df['country'].value_counts(normalize=True))
Visualize
df['country'].value_counts().plot(kind='bar')
plt.show()

Bivariate Analysis (Two Variables)

Numeric vs Numeric

import seaborn as sns
Correlation
correlation = df[['age', 'salary']].corr()
Scatter plot
plt.scatter(df['age'], df['salary'])
plt.xlabel('Age')
plt.ylabel('Salary')
plt.show()
Heatmap
sns.heatmap(df.corr(), annot=True)
plt.show()

Categorical vs Numeric

Box plot
df.boxplot(column='salary', by='country')
plt.show()
Grouped statistics
print(df.groupby('country')['salary'].agg(['mean', 'median', 'std']))

Categorical vs Categorical

Cross-tabulation
crosstab = pd.crosstab(df['country'], df['department'])
print(crosstab)
Chi-square test (are they independent?)
from scipy.stats import chi2_contingency
chi2, p, dof, expected = chi2_contingency(crosstab)
print(f"P-value: {p}")  # Low p-value = dependent

Multivariate Analysis

Pair plot (all numeric columns against each other)
sns.pairplot(df[['age', 'salary', 'experience']])
plt.show()
3D scatter
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(df['age'], df['salary'], df['experience'])
plt.show()

Key Takeaways

✓ EDA reveals patterns before modeling ✓ Use visualizations to communicate findings ✓ Test relationships statistically ✓ Generate data-driven hypotheses

---

Practice: Choose a dataset and create 5 different visualizations exploring different relationships.

Next: Statistical foundations—testing your hypotheses.