What is Inference?
Production
Inference — The process of running a trained model to generate predictions or outputs. In LLMs, inference means generating text token by token from a prompt.
FAQ
What is inference in AI?
Running a trained model to produce outputs. For LLMs, this means generating text from a prompt.
Why is LLM inference expensive?
LLMs generate tokens one at a time, each requiring a full forward pass. Memory for KV cache grows with sequence length.
Related Terms
Learn Inference in depth
Free hands-on course with code examples and Google Colab notebooks.
Start Course →