Module 1 of 16 · Maths and Statistics in AI · Beginner

Linear Algebra in Code: Tensors & Vectorization

Why This Matters for AI Engineers

Most math courses teach linear algebra through determinant calculations and abstract proofs. But in production AI systems, linear algebra is about efficient computation on GPUs. A single dense layer in a transformer performs millions of matrix multiplications per second. Understanding the math behind vectorization is the difference between a model that runs in 2 seconds and one that takes 20 seconds.

Core Concept: From Loops to Matrices

The Problem: Python Loops Are Slow

import time
import numpy as np

# Representing 1000 tokens, each with 768 dimensions (like BERT embeddings)
tokens = np.random.randn(1000, 768)
weights = np.random.randn(768, 768)

# Pure Python loop (DON'T DO THIS)
def slow_matmul(A, B):
    result = [[0] * len(B[0]) for _ in range(len(A))]
    for i in range(len(A)):
        for j in range(len(B[0])):
            for k in range(len(B)):
                result[i][j] += A[i][k] * B[k][j]
    return result

start = time.time()
# This would take ~30 seconds for 1000x768 matrices
# slow_result = slow_matmul(tokens, weights)
print("Pure Python: ~30 seconds (skipped)")

The Solution: Vectorized Matrix Multiplication

# NumPy vectorized (FAST)
start = time.time()
result = np.matmul(tokens, weights)  # or tokens @ weights
elapsed = time.time() - start
print(f"NumPy: {elapsed:.4f} seconds")  # ~0.01 seconds

# PyTorch on GPU (EVEN FASTER)
import torch
tokens_gpu = torch.randn(1000, 768, device="cuda")
weights_gpu = torch.randn(768, 768, device="cuda")

start = time.time()
result_gpu = torch.matmul(tokens_gpu, weights_gpu)
elapsed = time.time() - start
print(f"PyTorch GPU: {elapsed:.4f} seconds")  # ~0.001 seconds

Why is vectorization 1000x faster?


The Math: Matrix Multiplication

Definition

For matrices $A \in \mathbb{R}^{m \times n}$ and $B \in \mathbb{R}^{n \times p}$, the product $C = AB$ is:

$$C_{ij} = \sum_{k=1}^{n} A_{ik} B_{kj}$$

In Code

# Manual implementation (for understanding)
def matmul_manual(A, B):
    m, n = A.shape
    n, p = B.shape
    C = np.zeros((m, p))
    for i in range(m):
        for j in range(p):
            C[i, j] = np.dot(A[i, :], B[:, j])  # Dot product of row i and column j
    return C

# Verify it matches NumPy
A = np.random.randn(3, 4)
B = np.random.randn(4, 5)
assert np.allclose(matmul_manual(A, B), A @ B)

Application: Embeddings & Cosine Similarity

In LLMs, sentences are converted to high-dimensional vectors (embeddings). To find similar sentences, we compute cosine similarity:

$$\text{similarity}(u, v) = \frac{u \cdot v}{|u| |v|} = \frac{u^T v}{\sqrt{u^T u} \sqrt{v^T v}}$$

Code Example: RAG Retrieval

import torch
import torch.nn.functional as F

# Query embedding (768-dim, like from a BERT encoder)
query = torch.randn(768)

# Document embeddings (1000 documents, each 768-dim)
documents = torch.randn(1000, 768)

# Compute cosine similarity for all documents at once
# This is a single matrix multiplication!
similarities = F.cosine_similarity(query.unsqueeze(0), documents)

# Get top-5 most similar documents
top_5_indices = torch.topk(similarities, k=5).indices
print(f"Most similar documents: {top_5_indices}")

Why this matters:


Hardware Awareness: Memory vs Compute

Dense Matrix Multiplication is Compute-Bound

For a $1000 \times 1000$ matrix multiplication:

Result: GPUs can keep cores busy; memory bandwidth is not the bottleneck.

On Apple Silicon (MPS Backend)

import torch
import time

# Test on Apple Silicon GPU
device = "mps" if torch.backends.mps.is_available() else "cpu"

A = torch.randn(4096, 4096, device=device)
B = torch.randn(4096, 4096, device=device)

start = time.time()
C = torch.matmul(A, B)
elapsed = time.time() - start

print(f"4096x4096 matmul on {device}: {elapsed:.4f}s")
# MPS: ~0.05s | CPU: ~2s

Key Concepts

Concept Definition Use Case
Dot Product $u \cdot v = \sum u_i v_i$ Similarity, attention scores
Matrix Multiplication $C = AB$ where $C_{ij} = \sum A_{ik}B_{kj}$ Dense layers, embeddings
Transpose $A^T$ swaps rows and columns Reshaping for operations
Norm $|u| = \sqrt{\sum u_i^2}$ Normalization, similarity
Vectorization Batch operations instead of loops 1000x speedup on GPUs

Quizzes

Quiz 1: Matrix Dimensions

Question: If $A$ is $100 \times 50$ and $B$ is $50 \times 200$, what is the shape of $AB$?

Quiz 2: Vectorization Speedup

Question: Why is NumPy matrix multiplication ~1000x faster than Python loops?

Quiz 3: Cosine Similarity

Question: Cosine similarity between vectors $u$ and $v$ is computed as $\frac{u \cdot v}{|u| |v|}$. What does this measure?


Resources & References

Continue interactively → Next →

Related Courses