Linear Algebra in Code: Tensors & Vectorization
Why This Matters for AI Engineers
Most math courses teach linear algebra through determinant calculations and abstract proofs. But in production AI systems, linear algebra is about efficient computation on GPUs. A single dense layer in a transformer performs millions of matrix multiplications per second. Understanding the math behind vectorization is the difference between a model that runs in 2 seconds and one that takes 20 seconds.
Core Concept: From Loops to Matrices
The Problem: Python Loops Are Slow
import time
import numpy as np
# Representing 1000 tokens, each with 768 dimensions (like BERT embeddings)
tokens = np.random.randn(1000, 768)
weights = np.random.randn(768, 768)
# Pure Python loop (DON'T DO THIS)
def slow_matmul(A, B):
result = [[0] * len(B[0]) for _ in range(len(A))]
for i in range(len(A)):
for j in range(len(B[0])):
for k in range(len(B)):
result[i][j] += A[i][k] * B[k][j]
return result
start = time.time()
# This would take ~30 seconds for 1000x768 matrices
# slow_result = slow_matmul(tokens, weights)
print("Pure Python: ~30 seconds (skipped)")The Solution: Vectorized Matrix Multiplication
# NumPy vectorized (FAST)
start = time.time()
result = np.matmul(tokens, weights) # or tokens @ weights
elapsed = time.time() - start
print(f"NumPy: {elapsed:.4f} seconds") # ~0.01 seconds
# PyTorch on GPU (EVEN FASTER)
import torch
tokens_gpu = torch.randn(1000, 768, device="cuda")
weights_gpu = torch.randn(768, 768, device="cuda")
start = time.time()
result_gpu = torch.matmul(tokens_gpu, weights_gpu)
elapsed = time.time() - start
print(f"PyTorch GPU: {elapsed:.4f} seconds") # ~0.001 secondsWhy is vectorization 1000x faster?
- NumPy uses optimized C/Fortran libraries (BLAS, LAPACK)
- GPUs have thousands of cores designed for parallel matrix ops
- Memory bandwidth is maximized when operations are batched
The Math: Matrix Multiplication
Definition
For matrices $A \in \mathbb{R}^{m \times n}$ and $B \in \mathbb{R}^{n \times p}$, the product $C = AB$ is:
$$C_{ij} = \sum_{k=1}^{n} A_{ik} B_{kj}$$
In Code
# Manual implementation (for understanding)
def matmul_manual(A, B):
m, n = A.shape
n, p = B.shape
C = np.zeros((m, p))
for i in range(m):
for j in range(p):
C[i, j] = np.dot(A[i, :], B[:, j]) # Dot product of row i and column j
return C
# Verify it matches NumPy
A = np.random.randn(3, 4)
B = np.random.randn(4, 5)
assert np.allclose(matmul_manual(A, B), A @ B)Application: Embeddings & Cosine Similarity
In LLMs, sentences are converted to high-dimensional vectors (embeddings). To find similar sentences, we compute cosine similarity:
$$\text{similarity}(u, v) = \frac{u \cdot v}{|u| |v|} = \frac{u^T v}{\sqrt{u^T u} \sqrt{v^T v}}$$
Code Example: RAG Retrieval
import torch
import torch.nn.functional as F
# Query embedding (768-dim, like from a BERT encoder)
query = torch.randn(768)
# Document embeddings (1000 documents, each 768-dim)
documents = torch.randn(1000, 768)
# Compute cosine similarity for all documents at once
# This is a single matrix multiplication!
similarities = F.cosine_similarity(query.unsqueeze(0), documents)
# Get top-5 most similar documents
top_5_indices = torch.topk(similarities, k=5).indices
print(f"Most similar documents: {top_5_indices}")Why this matters:
- Without vectorization: 1000 separate similarity calculations = slow
- With vectorization: 1 matrix operation = fast
- This is how RAG systems retrieve relevant documents in milliseconds
Hardware Awareness: Memory vs Compute
Dense Matrix Multiplication is Compute-Bound
For a $1000 \times 1000$ matrix multiplication:
- Operations: $1000^3 = 10^9$ multiplications
- Memory transfers: $3 \times 1000^2 = 3 \times 10^6$ elements
- Ratio: $\frac{10^9}{3 \times 10^6} = 333$ operations per memory transfer
Result: GPUs can keep cores busy; memory bandwidth is not the bottleneck.
On Apple Silicon (MPS Backend)
import torch
import time
# Test on Apple Silicon GPU
device = "mps" if torch.backends.mps.is_available() else "cpu"
A = torch.randn(4096, 4096, device=device)
B = torch.randn(4096, 4096, device=device)
start = time.time()
C = torch.matmul(A, B)
elapsed = time.time() - start
print(f"4096x4096 matmul on {device}: {elapsed:.4f}s")
# MPS: ~0.05s | CPU: ~2sKey Concepts
| Concept | Definition | Use Case |
|---|---|---|
| Dot Product | $u \cdot v = \sum u_i v_i$ | Similarity, attention scores |
| Matrix Multiplication | $C = AB$ where $C_{ij} = \sum A_{ik}B_{kj}$ | Dense layers, embeddings |
| Transpose | $A^T$ swaps rows and columns | Reshaping for operations |
| Norm | $|u| = \sqrt{\sum u_i^2}$ | Normalization, similarity |
| Vectorization | Batch operations instead of loops | 1000x speedup on GPUs |
Quizzes
Quiz 1: Matrix Dimensions
Question: If $A$ is $100 \times 50$ and $B$ is $50 \times 200$, what is the shape of $AB$?
- A) $100 \times 200$ ✓
- B) $50 \times 50$
- C) $200 \times 100$
- D) Not defined
Quiz 2: Vectorization Speedup
Question: Why is NumPy matrix multiplication ~1000x faster than Python loops?
- A) NumPy uses compiled C code and GPUs can parallelize ✓
- B) NumPy uses a better algorithm
- C) Python is just slow
- D) It's not actually faster
Quiz 3: Cosine Similarity
Question: Cosine similarity between vectors $u$ and $v$ is computed as $\frac{u \cdot v}{|u| |v|}$. What does this measure?
- A) The angle between vectors (normalized dot product) ✓
- B) The distance between vectors
- C) The magnitude of vector $u$
- D) The sum of all elements
Resources & References
- NumPy Linear Algebra - Official NumPy linalg module
- PyTorch Matrix Operations - GPU-accelerated operations
- 3Blue1Brown: Essence of Linear Algebra - Visual intuition
- Papers with Code: Embeddings - Real implementations