Linear Algebra in Code: Tensors & Vectorization

Why This Matters for AI Engineers

Most math courses teach linear algebra through determinant calculations and abstract proofs. But in production AI systems, linear algebra is about efficient computation on GPUs. A single dense layer in a transformer performs millions of matrix multiplications per second. Understanding the math behind vectorization is the difference between a model that runs in 2 seconds and one that takes 20 seconds.

Core Concept: From Loops to Matrices

The Problem: Python Loops Are Slow

import time
import numpy as np

# Representing 1000 tokens, each with 768 dimensions (like BERT embeddings)
tokens = np.random.randn(1000, 768)
weights = np.random.randn(768, 768)

# Pure Python loop (DON'T DO THIS)
def slow_matmul(A, B):
    result = [[0] * len(B[0]) for _ in range(len(A))]
    for i in range(len(A)):
        for j in range(len(B[0])):
            for k in range(len(B)):
                result[i][j] += A[i][k] * B[k][j]
    return result

start = time.time()
# This would take ~30 seconds for 1000x768 matrices
# slow_result = slow_matmul(tokens, weights)
print("Pure Python: ~30 seconds (skipped)")

The Solution: Vectorized Matrix Multiplication

# NumPy vectorized (FAST)
start = time.time()
result = np.matmul(tokens, weights)  # or tokens @ weights
elapsed = time.time() - start
print(f"NumPy: {elapsed:.4f} seconds")  # ~0.01 seconds

# PyTorch on GPU (EVEN FASTER)
import torch
tokens_gpu = torch.randn(1000, 768, device="cuda")
weights_gpu = torch.randn(768, 768, device="cuda")

start = time.time()
result_gpu = torch.matmul(tokens_gpu, weights_gpu)
elapsed = time.time() - start
print(f"PyTorch GPU: {elapsed:.4f} seconds")  # ~0.001 seconds

Why is vectorization 1000x faster?

NumPy uses optimized C/Fortran libraries (BLAS, LAPACK)
GPUs have thousands of cores designed for parallel matrix ops
Memory bandwidth is maximized when operations are batched

The Math: Matrix Multiplication

Definition

For matrices $A \in \mathbb{R}^{m \times n}$ and $B \in \mathbb{R}^{n \times p}$, the product $C = AB$ is:

$$C_{ij} = \sum_{k=1}^{n} A_{ik} B_{kj}$$

In Code

# Manual implementation (for understanding)
def matmul_manual(A, B):
    m, n = A.shape
    n, p = B.shape
    C = np.zeros((m, p))
    for i in range(m):
        for j in range(p):
            C[i, j] = np.dot(A[i, :], B[:, j])  # Dot product of row i and column j
    return C

# Verify it matches NumPy
A = np.random.randn(3, 4)
B = np.random.randn(4, 5)
assert np.allclose(matmul_manual(A, B), A @ B)

Application: Embeddings & Cosine Similarity

In LLMs, sentences are converted to high-dimensional vectors (embeddings). To find similar sentences, we compute cosine similarity:

$$\text{similarity}(u, v) = \frac{u \cdot v}{|u| |v|} = \frac{u^T v}{\sqrt{u^T u} \sqrt{v^T v}}$$

Code Example: RAG Retrieval

import torch
import torch.nn.functional as F

# Query embedding (768-dim, like from a BERT encoder)
query = torch.randn(768)

# Document embeddings (1000 documents, each 768-dim)
documents = torch.randn(1000, 768)

# Compute cosine similarity for all documents at once
# This is a single matrix multiplication!
similarities = F.cosine_similarity(query.unsqueeze(0), documents)

# Get top-5 most similar documents
top_5_indices = torch.topk(similarities, k=5).indices
print(f"Most similar documents: {top_5_indices}")

Why this matters:

Without vectorization: 1000 separate similarity calculations = slow
With vectorization: 1 matrix operation = fast
This is how RAG systems retrieve relevant documents in milliseconds

Hardware Awareness: Memory vs Compute

Dense Matrix Multiplication is Compute-Bound

For a $1000 \times 1000$ matrix multiplication:

Operations: $1000^3 = 10^9$ multiplications
Memory transfers: $3 \times 1000^2 = 3 \times 10^6$ elements
Ratio: $\frac{10^9}{3 \times 10^6} = 333$ operations per memory transfer

Result: GPUs can keep cores busy; memory bandwidth is not the bottleneck.

On Apple Silicon (MPS Backend)

import torch
import time

# Test on Apple Silicon GPU
device = "mps" if torch.backends.mps.is_available() else "cpu"

A = torch.randn(4096, 4096, device=device)
B = torch.randn(4096, 4096, device=device)

start = time.time()
C = torch.matmul(A, B)
elapsed = time.time() - start

print(f"4096x4096 matmul on {device}: {elapsed:.4f}s")
# MPS: ~0.05s | CPU: ~2s

Key Concepts

Concept	Definition	Use Case
Dot Product	$u \cdot v = \sum u_i v_i$	Similarity, attention scores
Matrix Multiplication	$C = AB$ where $C_{ij} = \sum A_{ik}B_{kj}$	Dense layers, embeddings
Transpose	$A^T$ swaps rows and columns	Reshaping for operations
Norm	$\|u\| = \sqrt{\sum u_i^2}$	Normalization, similarity
Vectorization	Batch operations instead of loops	1000x speedup on GPUs

Quizzes

Quiz 1: Matrix Dimensions

Question: If $A$ is $100 \times 50$ and $B$ is $50 \times 200$, what is the shape of $AB$?

A) $100 \times 200$ ✓
B) $50 \times 50$
C) $200 \times 100$
D) Not defined

Quiz 2: Vectorization Speedup

Question: Why is NumPy matrix multiplication ~1000x faster than Python loops?

A) NumPy uses compiled C code and GPUs can parallelize ✓
B) NumPy uses a better algorithm
C) Python is just slow
D) It's not actually faster

Quiz 3: Cosine Similarity

Question: Cosine similarity between vectors $u$ and $v$ is computed as $\frac{u \cdot v}{|u| |v|}$. What does this measure?

A) The angle between vectors (normalized dot product) ✓
B) The distance between vectors
C) The magnitude of vector $u$
D) The sum of all elements

Resources & References

NumPy Linear Algebra - Official NumPy linalg module
PyTorch Matrix Operations - GPU-accelerated operations
3Blue1Brown: Essence of Linear Algebra - Visual intuition
Papers with Code: Embeddings - Real implementations