Bayes' Theorem and Bayesian Statistics
Duration: 6 min
Bayes' Theorem
Formula
P(A|B) = P(B|A) × P(A) / P(B)Where:
- P(A|B): Posterior probability (what we want to find)
- P(B|A): Likelihood (probability of evidence given hypothesis)
- P(A): Prior probability (initial belief)
- P(B): Evidence (total probability of observation)
Intuition
- Start with prior belief P(A)
- Observe evidence B
- Update belief to posterior P(A|B)
- More evidence → more confident
Example: Medical Testing
Suppose:
- Disease prevalence: 1% (P(Disease) = 0.01)
- Test accuracy: 95% (P(Positive|Disease) = 0.95)
- False positive rate: 10% (P(Positive|No Disease) = 0.10)
If test is positive, what's probability of having disease?
P(Disease|Positive) = P(Positive|Disease) × P(Disease) / P(Positive)
= 0.95 × 0.01 / (0.95×0.01 + 0.10×0.99)
≈ 0.087 or 8.7%Despite positive test, only ~9% chance of disease (due to low prevalence)
Bayesian vs Frequentist Statistics
Frequentist
- Probability = long-run frequency
- Parameters are fixed (unknown)
- Confidence intervals have fixed interpretation
- Example: p-value testing
Bayesian
- Probability = degree of belief
- Parameters are random variables
- Incorporates prior knowledge
- Updates beliefs with data
Prior, Likelihood, Posterior
- Prior: Initial belief before data
- Likelihood: How likely data is under each hypothesis
- Posterior: Updated belief after observing data
Updating Process
Posterior ∝ Likelihood × PriorConjugate Priors
- Prior and posterior have same distribution
- Makes calculations tractable
- Examples:
- Beta prior for binomial likelihood
- Normal prior for normal likelihood
Bayesian Inference
Point Estimation
- MAP (Maximum A Posteriori): Most likely parameter value
- Mean: Average of posterior distribution
Credible Intervals
- Bayesian equivalent of confidence intervals
- 95% credible interval: 95% probability parameter is in range
- Direct probability interpretation
Applications in AI/ML
Naive Bayes Classifier
- Assumes feature independence
- Fast and effective for text classification
- P(Class|Features) ∝ P(Features|Class) × P(Class)
Bayesian Networks
- Directed acyclic graphs of variables
- Encode conditional dependencies
- Used in reasoning and inference
Bayesian Optimization
- Efficiently search parameter space
- Uses surrogate model and acquisition function
- Useful for hyperparameter tuning
❓ In Bayes' theorem, what is P(A)?