What is Inference?

Production

Inference — The process of running a trained model to generate predictions or outputs. In LLMs, inference means generating text token by token from a prompt.

FAQ

What is inference in AI?

Running a trained model to produce outputs. For LLMs, this means generating text from a prompt.

Why is LLM inference expensive?

LLMs generate tokens one at a time, each requiring a full forward pass. Memory for KV cache grows with sequence length.

Related Terms

Learn Inference in depth

Free hands-on course with code examples and Google Colab notebooks.

Start Course →