Exploratory Data Analysis (EDA)

Duration: 15 min

Exploratory Data Analysis (EDA)

What is EDA?

Exploratory Data Analysis is the detective work of data science. Before building models, you analyze data to:

  • Understand distributions and relationships
  • Find patterns and anomalies
  • Identify which features matter most
  • Generate hypotheses for modeling

The EDA Workflow

1. Load data → 2. Understand structure → 3. Univariate analysis
     ↓
4. Bivariate analysis → 5. Multivariate analysis → 6. Hypotheses

Univariate Analysis (One Variable)

Numeric Features

import pandas as pd
import matplotlib.pyplot as plt

Statistical summary

print(df['age'].describe()) # Mean, std, quartiles

Visualize distribution

df['age'].hist(bins=30) plt.title('Age Distribution') plt.show()

Skewness

print(df['age'].skew()) # Negative = left-skewed, Positive = right-skewed

Categorical Features

Value counts

print(df['country'].value_counts())

Proportion

print(df['country'].value_counts(normalize=True))

Visualize

df['country'].value_counts().plot(kind='bar') plt.show()

Bivariate Analysis (Two Variables)

Numeric vs Numeric

import seaborn as sns

Correlation

correlation = df[['age', 'salary']].corr()

Scatter plot

plt.scatter(df['age'], df['salary']) plt.xlabel('Age') plt.ylabel('Salary') plt.show()

Heatmap

sns.heatmap(df.corr(), annot=True) plt.show()

Categorical vs Numeric

Box plot

df.boxplot(column='salary', by='country') plt.show()

Grouped statistics

print(df.groupby('country')['salary'].agg(['mean', 'median', 'std']))

Categorical vs Categorical

Cross-tabulation

crosstab = pd.crosstab(df['country'], df['department']) print(crosstab)

Chi-square test (are they independent?)

from scipy.stats import chi2_contingency chi2, p, dof, expected = chi2_contingency(crosstab) print(f"P-value: {p}") # Low p-value = dependent

Multivariate Analysis

Pair plot (all numeric columns against each other)

sns.pairplot(df[['age', 'salary', 'experience']]) plt.show()

3D scatter

from mpl_toolkits.mplot3d import Axes3D fig = plt.figure() ax = fig.add_subplot(111, projection='3d') ax.scatter(df['age'], df['salary'], df['experience']) plt.show()

Key Takeaways

✓ EDA reveals patterns before modeling ✓ Use visualizations to communicate findings ✓ Test relationships statistically ✓ Generate data-driven hypotheses

---

Practice: Choose a dataset and create 5 different visualizations exploring different relationships.

Next: Statistical foundations—testing your hypotheses.