Exploratory Data Analysis (EDA)
Duration: 15 min
Exploratory Data Analysis (EDA)
What is EDA?
Exploratory Data Analysis is the detective work of data science. Before building models, you analyze data to:
- Understand distributions and relationships
- Find patterns and anomalies
- Identify which features matter most
- Generate hypotheses for modeling
The EDA Workflow
1. Load data → 2. Understand structure → 3. Univariate analysis
↓
4. Bivariate analysis → 5. Multivariate analysis → 6. Hypotheses
Univariate Analysis (One Variable)
Numeric Features
import pandas as pd
import matplotlib.pyplot as pltStatistical summary
print(df['age'].describe()) # Mean, std, quartilesVisualize distribution
df['age'].hist(bins=30)
plt.title('Age Distribution')
plt.show()Skewness
print(df['age'].skew()) # Negative = left-skewed, Positive = right-skewed
Categorical Features
Value counts
print(df['country'].value_counts())Proportion
print(df['country'].value_counts(normalize=True))Visualize
df['country'].value_counts().plot(kind='bar')
plt.show()
Bivariate Analysis (Two Variables)
Numeric vs Numeric
import seaborn as snsCorrelation
correlation = df[['age', 'salary']].corr()Scatter plot
plt.scatter(df['age'], df['salary'])
plt.xlabel('Age')
plt.ylabel('Salary')
plt.show()Heatmap
sns.heatmap(df.corr(), annot=True)
plt.show()
Categorical vs Numeric
Box plot
df.boxplot(column='salary', by='country')
plt.show()Grouped statistics
print(df.groupby('country')['salary'].agg(['mean', 'median', 'std']))
Categorical vs Categorical
Cross-tabulation
crosstab = pd.crosstab(df['country'], df['department'])
print(crosstab)Chi-square test (are they independent?)
from scipy.stats import chi2_contingency
chi2, p, dof, expected = chi2_contingency(crosstab)
print(f"P-value: {p}") # Low p-value = dependent
Multivariate Analysis
Pair plot (all numeric columns against each other)
sns.pairplot(df[['age', 'salary', 'experience']])
plt.show()3D scatter
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(df['age'], df['salary'], df['experience'])
plt.show()
Key Takeaways
✓ EDA reveals patterns before modeling ✓ Use visualizations to communicate findings ✓ Test relationships statistically ✓ Generate data-driven hypotheses
---
Practice: Choose a dataset and create 5 different visualizations exploring different relationships.
Next: Statistical foundations—testing your hypotheses.