Calculus & Gradient Descent: How Models Learn

Why This Matters

Every time you train a neural network, you're solving an optimization problem: find weights that minimize loss. Gradient descent is the algorithm that does this. Understanding the calculus behind it explains why learning rates matter, why gradients vanish, and why GPUs need so much VRAM.

The Core Problem: Optimization

Loss Function

A neural network learns by minimizing a loss function $L(\theta)$, where $\theta$ are the weights:

$$L(\theta) = \frac{1}{N} \sum_{i=1}^{N} \ell(y_i, \hat{y}_i(\theta))$$

For classification, this is typically cross-entropy loss:

$$L(\theta) = -\frac{1}{N} \sum_{i=1}^{N} y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i)$$

The Goal

Find $\theta^*$ that minimizes $L(\theta)$:

$$\theta^* = \arg\min_{\theta} L(\theta)$$

Derivatives: The Slope of Loss

Intuition

The derivative $\frac{dL}{d\theta}$ tells us the slope of the loss function at the current weights. If the slope is negative, decreasing $\theta$ reduces loss. If positive, increasing $\theta$ reduces loss.

Mathematical Definition

$$\frac{dL}{d\theta} = \lim_{h \to 0} \frac{L(\theta + h) - L(\theta)}{h}$$

In Code: Numerical Gradient

import numpy as np

def numerical_gradient(loss_fn, theta, epsilon=1e-5):
    """Compute gradient by finite differences"""
    grad = np.zeros_like(theta)
    for i in range(len(theta)):
        theta_plus = theta.copy()
        theta_plus[i] += epsilon
        theta_minus = theta.copy()
        theta_minus[i] -= epsilon
        grad[i] = (loss_fn(theta_plus) - loss_fn(theta_minus)) / (2 * epsilon)
    return grad

# Example: minimize f(x) = x^2
def f(x):
    return np.sum(x**2)

theta = np.array([3.0, 4.0])
grad = numerical_gradient(f, theta)
print(f"Gradient at {theta}: {grad}")  # [6.0, 8.0] (correct: 2*x)

The Chain Rule: Backpropagation

The Problem with Numerical Gradients

Computing gradients by finite differences is slow and inaccurate. For a network with 1 billion parameters, you'd need 1 billion forward passes.

The Solution: Automatic Differentiation

The chain rule lets us compute gradients efficiently:

$$\frac{dL}{d\theta} = \frac{dL}{dz} \cdot \frac{dz}{d\theta}$$

For a deep network:

$$\frac{dL}{d\theta_1} = \frac{dL}{dz_n} \cdot \frac{dz_n}{dz_{n-1}} \cdot \ldots \cdot \frac{dz_2}{dz_1} \cdot \frac{dz_1}{d\theta_1}$$

In PyTorch: Autograd

import torch

# Define a simple network
x = torch.tensor([2.0], requires_grad=True)
y = x**2 + 3*x + 1

# Compute gradient automatically
y.backward()
print(f"dy/dx at x=2: {x.grad}")  # 7.0 (correct: 2*2 + 3)

# For a neural network
model = torch.nn.Linear(10, 1)
x = torch.randn(32, 10)
y_true = torch.randn(32, 1)

# Forward pass
y_pred = model(x)
loss = torch.nn.functional.mse_loss(y_pred, y_true)

# Backward pass (compute gradients)
loss.backward()

# Gradients are now in model.weight.grad and model.bias.grad
print(f"Weight gradient shape: {model.weight.grad.shape}")

Gradient Descent: The Update Rule

Algorithm

Starting with random weights $\theta_0$, repeatedly update:

$$\theta_{t+1} = \theta_t - \alpha \frac{dL}{d\theta_t}$$

where $\alpha$ is the learning rate.

In Code

import torch
import torch.optim as optim

model = torch.nn.Linear(10, 1)
optimizer = optim.SGD(model.parameters(), lr=0.01)  # lr = learning rate

for epoch in range(100):
    # Forward pass
    y_pred = model(x)
    loss = torch.nn.functional.mse_loss(y_pred, y_true)
    
    # Backward pass
    optimizer.zero_grad()  # Clear old gradients
    loss.backward()        # Compute new gradients
    
    # Update weights
    optimizer.step()       # theta = theta - lr * grad
    
    if epoch % 10 == 0:
        print(f"Epoch {epoch}, Loss: {loss.item():.4f}")

Variants: Momentum & Adam

Problem: Gradient Descent Gets Stuck

In high-dimensional spaces, gradients can point in noisy directions. Pure gradient descent oscillates and converges slowly.

Solution: Momentum

$$v_t = \beta v_{t-1} + (1-\beta) \frac{dL}{d\theta_t}$$
$$\theta_{t+1} = \theta_t - \alpha v_t$$

This accumulates gradients over time, smoothing out noise.

In Code: Adam Optimizer

# Adam = Adaptive Moment Estimation
# Combines momentum with adaptive learning rates
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Same training loop as before
for epoch in range(100):
    y_pred = model(x)
    loss = torch.nn.functional.mse_loss(y_pred, y_true)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

Why Adam is popular:

Converges faster than SGD
Less sensitive to learning rate
Works well for most problems

GPU Memory: Why Gradients Are Expensive

The Memory Cost

During backpropagation, PyTorch stores activations from the forward pass to compute gradients. For a batch of 32 samples through a 12-layer transformer:

Forward activations: 32 × 12 × 768 × 768 = ~2.3 GB
Gradients: Same size
Optimizer state (Adam): 2x more (momentum + variance)
Total: ~7 GB for a single batch

This is why:

Larger batch sizes require more VRAM
Gradient checkpointing trades compute for memory
Quantization reduces memory by 4x

Code: Gradient Checkpointing

from torch.utils.checkpoint import checkpoint

class TransformerBlock(torch.nn.Module):
    def forward(self, x):
        # Without checkpointing: stores all activations
        # x = self.attention(x)
        # x = self.ffn(x)
        
        # With checkpointing: recomputes activations during backward
        x = checkpoint(self.attention, x)
        x = checkpoint(self.ffn, x)
        return x

Key Concepts

Concept	Formula	Meaning
Derivative	$\frac{dL}{d\theta}$	Slope of loss w.r.t. weights
Chain Rule	$\frac{dL}{d\theta} = \frac{dL}{dz} \cdot \frac{dz}{d\theta}$	Backpropagation
Gradient Descent	$\theta_{t+1} = \theta_t - \alpha \nabla L$	Weight update rule
Learning Rate	$\alpha$	Step size (too high = diverge, too low = slow)
Momentum	$v_t = \beta v_{t-1} + (1-\beta) \nabla L$	Smooth gradient updates

Quizzes

Quiz 1: Chain Rule

Question: If $L = (y - \hat{y})^2$ and $\hat{y} = w \cdot x$, what is $\frac{dL}{dw}$?

A) $2(y - \hat{y}) \cdot x$ ✓
B) $2(y - \hat{y})$
C) $x$
D) $2(y - \hat{y})^2$

Quiz 2: Learning Rate

Question: If your learning rate is too high, what happens?

A) Loss oscillates or diverges ✓
B) Training is too slow
C) Gradients become zero
D) Model overfits

Quiz 3: Gradient Descent Update

Question: In the update $\theta_{t+1} = \theta_t - \alpha \nabla L$, why do we subtract the gradient?

A) Gradients point uphill; we move downhill to minimize loss ✓
B) Subtraction is faster than addition
C) It prevents overfitting
D) It's arbitrary; addition would work too

Resources & References

PyTorch Autograd - Automatic differentiation
Optimization Algorithms - SGD, Adam, etc.
3Blue1Brown: Backpropagation - Visual explanation
Papers with Code: Optimizers - Implementations