What is inference in AI?

Running a trained model to produce outputs. For LLMs, this means generating text from a prompt.

Why is LLM inference expensive?

LLMs generate tokens one at a time, each requiring a full forward pass. Memory for KV cache grows with sequence length.