What is Tokenization?

NLP

Tokenization — The process of splitting text into tokens (subwords, words, or characters) that a model can process. Common tokenizers include BPE (GPT), WordPiece (BERT), and SentencePiece.

FAQ

What is tokenization?

Splitting text into tokens (subwords) that a model can process. The word unhappiness might become [un, happiness].

Why not just split by words?

Subword tokenization handles unknown words, multiple languages, and keeps vocabulary size manageable (32K-100K tokens).

Related Terms

Learn Tokenization in depth

Free hands-on course with code examples and Google Colab notebooks.

Start Course →