Resampling Methods: Bootstrap and Permutation Tests

Duration: 5 min

This module delves into resampling methods, specifically Bootstrap and Permutation Tests, which are crucial for making statistical inferences in machine learning. Understanding these methods allows you to assess the reliability of your models and make data-driven decisions without relying solely on traditional parametric tests.

Bootstrap Method

The Bootstrap method is a powerful resampling technique used to estimate statistics on a population by sampling a dataset with replacement. It can be used to estimate the bias, standard error, and confidence intervals of a statistic. This non-parametric approach is particularly useful when the underlying distribution is unknown or when the sample size is small.

import numpy as np
import matplotlib.pyplot as plt

# Original data
data = np.array([1, 2, 3, 4, 5])

# Number of bootstrap samples
n_bootstraps = 1000

# Bootstrap
bootstrap_means = []
for _ in range(n_bootstraps):
    resample = np.random.choice(data, size=len(data), replace=True)
    bootstrap_means.append(np.mean(resample))

# Plotting the bootstrap distribution
plt.hist(bootstrap_means, bins=30, edgecolor='black')
plt.title('Bootstrap Distribution of Means')
plt.xlabel('Mean')
plt.ylabel('Frequency')
plt.show()

Try it in Google Colab:

A histogram showing the distribution of bootstrap means.

Permutation Test

A Permutation Test is a non-parametric test that provides a way to assess the null hypothesis by comparing the observed test statistic to a distribution of test statistics obtained by randomly permuting the labels of the data. This method is useful for hypothesis testing when the assumptions of traditional parametric tests are violated.

import numpy as np

# Sample data
group1 = np.array([1, 2, 3, 4, 5])
group2 = np.array([2, 3, 4, 5, 6])

# Observed difference in means
observed_diff = np.mean(group1) - np.mean(group2)

# Permutation test
n_permutations = 1000
permutation_diffs = []
for _ in range(n_permutations):
    combined = np.concatenate([group1, group2])
    np.random.shuffle(combined)
    permuted_group1 = combined[:len(group1)]
    permuted_group2 = combined[len(group1):]
    permutation_diffs.append(np.mean(permuted_group1) - np.mean(permuted_group2))

p_value = np.mean(np.abs(permutation_diffs) >= np.abs(observed_diff))
print(f'P-value: {p_value}')

💡 Tip: Ensure that the number of bootstrap or permutation samples is sufficiently large to get a stable estimate of the statistic or p-value.

❓ What is the primary purpose of the Bootstrap method?

To perform hypothesis testing To estimate statistics on a population by sampling a dataset with replacement To visualize data distributions To perform feature selection

❓ What does a Permutation Test help to assess?

The variance of a dataset The null hypothesis by comparing the observed test statistic to a distribution of test statistics obtained by randomly permuting the labels of the data The correlation between two variables The accuracy of a machine learning model

Key Concepts

Concept	Description
Resampling	Core principle in this module
Confidence	Core principle in this module
Distribution	Core principle in this module
Estimation	Core principle in this module

Check Your Understanding

❓ How does Resampling handle edge cases?

Ignores them Applies regularization Removes them Duplicates them

❓ What is the computational complexity of Resampling?

O(n) O(n²) O(log n) Depends on implementation

❓ Which hyperparameter is most critical for Resampling?

Learning rate Batch size Epochs All equally important

Resampling Methods: Bootstrap and Permutation Tests

Bootstrap Method

Permutation Test

Key Concepts

Check Your Understanding

Related Courses