What is Tokenization?
NLP
Tokenization — The process of splitting text into tokens (subwords, words, or characters) that a model can process. Common tokenizers include BPE (GPT), WordPiece (BERT), and SentencePiece.
FAQ
What is tokenization?
Splitting text into tokens (subwords) that a model can process. The word unhappiness might become [un, happiness].
Why not just split by words?
Subword tokenization handles unknown words, multiple languages, and keeps vocabulary size manageable (32K-100K tokens).
Related Terms
Learn Tokenization in depth
Free hands-on course with code examples and Google Colab notebooks.
Start Course →