Statistical Foundations
Duration: 15 min
Statistical Foundations
Probability Distributions
Understanding distributions is key to statistical analysis and machine learning.
Normal Distribution (Gaussian)
- Bell-shaped curve
- Mean = median = mode
- Used in many statistical tests
- Example: Height, IQ, measurement errors
Binomial Distribution
- Discrete outcomes (success/failure)
- Defined by n (trials) and p (probability)
- Example: Coin flips, click/no-click
Poisson Distribution
- Counts of events in fixed time
- λ (lambda) = expected count
- Example: Website visitors per hour, customer calls per day
import numpy as np
import matplotlib.pyplot as pltGenerate normal distribution
data = np.random.normal(loc=100, scale=15, size=1000)
plt.hist(data, bins=50)
plt.title('Normal Distribution')
plt.show()
Hypothesis Testing
Null vs Alternative Hypothesis
- H0 (Null): No effect, no difference
- H1 (Alternative): There is an effect
P-values
- Probability of observing data if H0 is true
- p < 0.05 = typically significant (reject H0)
- p > 0.05 = not significant (fail to reject H0)
Common Tests
from scipy import statsT-test: Compare means of two groups
t_stat, p_value = stats.ttest_ind(group1, group2)Chi-square: Test independence
chi2, p_value = stats.chi2_contingency(contingency_table)ANOVA: Compare multiple groups
f_stat, p_value = stats.f_oneway(group1, group2, group3)
Correlation vs Causation
- Correlation: Two variables move together
- Causation: One variable causes change in another
Example: Ice cream sales and drowning deaths are correlated (both rise in summer) but ice cream doesn't cause drowning.
Calculate correlation
correlation = df['var1'].corr(df['var2']) # -1 to 1Pearson: Linear relationships
Spearman: Monotonic relationships
correlation = df['var1'].corr(df['var2'], method='spearman')
Key Takeaways
✓ Distributions describe data behavior ✓ Hypothesis tests tell you if differences are real ✓ Correlation ≠ Causation
---
Next: Data visualization principles.