Calculus & Gradient Descent: How Models Learn
Why This Matters
Every time you train a neural network, you're solving an optimization problem: find weights that minimize loss. Gradient descent is the algorithm that does this. Understanding the calculus behind it explains why learning rates matter, why gradients vanish, and why GPUs need so much VRAM.
The Core Problem: Optimization
Loss Function
A neural network learns by minimizing a loss function $L(\theta)$, where $\theta$ are the weights:
$$L(\theta) = \frac{1}{N} \sum_{i=1}^{N} \ell(y_i, \hat{y}_i(\theta))$$
For classification, this is typically cross-entropy loss:
$$L(\theta) = -\frac{1}{N} \sum_{i=1}^{N} y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i)$$
The Goal
Find $\theta^*$ that minimizes $L(\theta)$:
$$\theta^* = \arg\min_{\theta} L(\theta)$$
Derivatives: The Slope of Loss
Intuition
The derivative $\frac{dL}{d\theta}$ tells us the slope of the loss function at the current weights. If the slope is negative, decreasing $\theta$ reduces loss. If positive, increasing $\theta$ reduces loss.
Mathematical Definition
$$\frac{dL}{d\theta} = \lim_{h \to 0} \frac{L(\theta + h) - L(\theta)}{h}$$
In Code: Numerical Gradient
import numpy as np
def numerical_gradient(loss_fn, theta, epsilon=1e-5):
"""Compute gradient by finite differences"""
grad = np.zeros_like(theta)
for i in range(len(theta)):
theta_plus = theta.copy()
theta_plus[i] += epsilon
theta_minus = theta.copy()
theta_minus[i] -= epsilon
grad[i] = (loss_fn(theta_plus) - loss_fn(theta_minus)) / (2 * epsilon)
return grad
# Example: minimize f(x) = x^2
def f(x):
return np.sum(x**2)
theta = np.array([3.0, 4.0])
grad = numerical_gradient(f, theta)
print(f"Gradient at {theta}: {grad}") # [6.0, 8.0] (correct: 2*x)The Chain Rule: Backpropagation
The Problem with Numerical Gradients
Computing gradients by finite differences is slow and inaccurate. For a network with 1 billion parameters, you'd need 1 billion forward passes.
The Solution: Automatic Differentiation
The chain rule lets us compute gradients efficiently:
$$\frac{dL}{d\theta} = \frac{dL}{dz} \cdot \frac{dz}{d\theta}$$
For a deep network:
$$\frac{dL}{d\theta_1} = \frac{dL}{dz_n} \cdot \frac{dz_n}{dz_{n-1}} \cdot \ldots \cdot \frac{dz_2}{dz_1} \cdot \frac{dz_1}{d\theta_1}$$
In PyTorch: Autograd
import torch
# Define a simple network
x = torch.tensor([2.0], requires_grad=True)
y = x**2 + 3*x + 1
# Compute gradient automatically
y.backward()
print(f"dy/dx at x=2: {x.grad}") # 7.0 (correct: 2*2 + 3)
# For a neural network
model = torch.nn.Linear(10, 1)
x = torch.randn(32, 10)
y_true = torch.randn(32, 1)
# Forward pass
y_pred = model(x)
loss = torch.nn.functional.mse_loss(y_pred, y_true)
# Backward pass (compute gradients)
loss.backward()
# Gradients are now in model.weight.grad and model.bias.grad
print(f"Weight gradient shape: {model.weight.grad.shape}")Gradient Descent: The Update Rule
Algorithm
Starting with random weights $\theta_0$, repeatedly update:
$$\theta_{t+1} = \theta_t - \alpha \frac{dL}{d\theta_t}$$
where $\alpha$ is the learning rate.
In Code
import torch
import torch.optim as optim
model = torch.nn.Linear(10, 1)
optimizer = optim.SGD(model.parameters(), lr=0.01) # lr = learning rate
for epoch in range(100):
# Forward pass
y_pred = model(x)
loss = torch.nn.functional.mse_loss(y_pred, y_true)
# Backward pass
optimizer.zero_grad() # Clear old gradients
loss.backward() # Compute new gradients
# Update weights
optimizer.step() # theta = theta - lr * grad
if epoch % 10 == 0:
print(f"Epoch {epoch}, Loss: {loss.item():.4f}")Variants: Momentum & Adam
Problem: Gradient Descent Gets Stuck
In high-dimensional spaces, gradients can point in noisy directions. Pure gradient descent oscillates and converges slowly.
Solution: Momentum
$$v_t = \beta v_{t-1} + (1-\beta) \frac{dL}{d\theta_t}$$
$$\theta_{t+1} = \theta_t - \alpha v_t$$
This accumulates gradients over time, smoothing out noise.
In Code: Adam Optimizer
# Adam = Adaptive Moment Estimation
# Combines momentum with adaptive learning rates
optimizer = optim.Adam(model.parameters(), lr=0.001)
# Same training loop as before
for epoch in range(100):
y_pred = model(x)
loss = torch.nn.functional.mse_loss(y_pred, y_true)
optimizer.zero_grad()
loss.backward()
optimizer.step()Why Adam is popular:
- Converges faster than SGD
- Less sensitive to learning rate
- Works well for most problems
GPU Memory: Why Gradients Are Expensive
The Memory Cost
During backpropagation, PyTorch stores activations from the forward pass to compute gradients. For a batch of 32 samples through a 12-layer transformer:
- Forward activations: 32 × 12 × 768 × 768 = ~2.3 GB
- Gradients: Same size
- Optimizer state (Adam): 2x more (momentum + variance)
- Total: ~7 GB for a single batch
This is why:
- Larger batch sizes require more VRAM
- Gradient checkpointing trades compute for memory
- Quantization reduces memory by 4x
Code: Gradient Checkpointing
from torch.utils.checkpoint import checkpoint
class TransformerBlock(torch.nn.Module):
def forward(self, x):
# Without checkpointing: stores all activations
# x = self.attention(x)
# x = self.ffn(x)
# With checkpointing: recomputes activations during backward
x = checkpoint(self.attention, x)
x = checkpoint(self.ffn, x)
return xKey Concepts
| Concept | Formula | Meaning |
|---|---|---|
| Derivative | $\frac{dL}{d\theta}$ | Slope of loss w.r.t. weights |
| Chain Rule | $\frac{dL}{d\theta} = \frac{dL}{dz} \cdot \frac{dz}{d\theta}$ | Backpropagation |
| Gradient Descent | $\theta_{t+1} = \theta_t - \alpha \nabla L$ | Weight update rule |
| Learning Rate | $\alpha$ | Step size (too high = diverge, too low = slow) |
| Momentum | $v_t = \beta v_{t-1} + (1-\beta) \nabla L$ | Smooth gradient updates |
Quizzes
Quiz 1: Chain Rule
Question: If $L = (y - \hat{y})^2$ and $\hat{y} = w \cdot x$, what is $\frac{dL}{dw}$?
- A) $2(y - \hat{y}) \cdot x$ ✓
- B) $2(y - \hat{y})$
- C) $x$
- D) $2(y - \hat{y})^2$
Quiz 2: Learning Rate
Question: If your learning rate is too high, what happens?
- A) Loss oscillates or diverges ✓
- B) Training is too slow
- C) Gradients become zero
- D) Model overfits
Quiz 3: Gradient Descent Update
Question: In the update $\theta_{t+1} = \theta_t - \alpha \nabla L$, why do we subtract the gradient?
- A) Gradients point uphill; we move downhill to minimize loss ✓
- B) Subtraction is faster than addition
- C) It prevents overfitting
- D) It's arbitrary; addition would work too
Resources & References
- PyTorch Autograd - Automatic differentiation
- Optimization Algorithms - SGD, Adam, etc.
- 3Blue1Brown: Backpropagation - Visual explanation
- Papers with Code: Optimizers - Implementations