What is Tokenization? | AI & LLM Glossary

Tokenization is the process of breaking text into smaller units called tokens, which serve as the fundamental building blocks that language models read and generate. Tokens can represent whole words, subwords, individual characters, or even byte sequences.

Language models do not process raw text directly. Instead, they work with tokens, which are numerical representations of text segments. The tokenization process converts input text into a sequence of token IDs from a fixed vocabulary, and the model's output token IDs are converted back into text. This conversion is handled by a tokenizer that is trained alongside or specifically for the model.

Modern LLMs predominantly use subword tokenization algorithms like Byte Pair Encoding (BPE) or SentencePiece. These algorithms learn to split text into a vocabulary of common subword units based on statistical patterns in training data. Common words like "the" become single tokens, while rare words are split into multiple subword tokens. For example, "unhappiness" might be tokenized as ["un", "happiness"] or ["un", "happ", "iness"].

Tokenization has significant practical implications. The number of tokens in a text determines its cost (most APIs charge per token), whether it fits within the model's context window, and processing time. Different models use different tokenizers, so the same text may produce different token counts across models. For instance, GPT-4 and Claude use different tokenization schemes.

The quality of tokenization also affects model performance. Models handle languages and domains better when their tokenizer efficiently represents the relevant text. Tokenizers trained primarily on English text may split non-Latin scripts or specialized terminology into many more tokens, potentially degrading performance and increasing costs for those inputs.

How It Works

Vocabulary construction

During tokenizer training, an algorithm like BPE iteratively identifies the most common character or byte pairs in a large text corpus and merges them into single tokens, building a vocabulary of a fixed size (e.g., 50,000 or 100,000 tokens).

Text encoding

When input text is provided, the tokenizer applies its learned merge rules to split the text into a sequence of tokens from its vocabulary, converting each token to its corresponding numerical ID.

Model processing

The sequence of token IDs is processed by the model's neural network. Each token ID maps to a learned embedding vector, and the model operates on these vectors through its layers.

Text decoding

The model's output token IDs are mapped back to their text representations and concatenated to produce the final text output, reversing the tokenization process.

Examples

Cost estimation for API calls

A development team uses a tokenizer library to count tokens in their prompts and expected responses before making API calls, enabling accurate cost forecasting and budget planning for their LLM-powered application.

Context window management

A RAG system tokenizes retrieved documents to ensure the total input stays within the model's context window. When documents would exceed the limit, the system truncates or removes the least relevant chunks based on token count.

Multilingual application optimization

A company building a multilingual chatbot discovers that their tokenizer uses 3x more tokens for Japanese text than English. They switch to a model with a multilingual-optimized tokenizer to reduce costs and improve Japanese language performance.

Why It Matters

Tokenization is foundational to every LLM interaction. It determines API costs, context window utilization, and model performance across different languages and domains. Understanding tokenization is essential for optimizing LLM applications and managing their economics.

Frequently Asked Questions

How long is a token?

A token is roughly 3-4 characters or about 0.75 words in English. However, this varies significantly by language and content type. Code, URLs, and non-English text typically require more tokens per word. You can use tokenizer tools like tiktoken to get exact counts.

Why do different models have different token counts for the same text?

Each model family uses its own tokenizer with a unique vocabulary and merge rules. GPT-4 uses cl100k_base, Claude uses its own tokenizer, and open-source models like LLaMA use SentencePiece. These different tokenizers split the same text differently, producing different token counts.

Does tokenization affect model quality?

Yes, tokenization can impact quality. If a tokenizer splits important terms into many small tokens, the model may struggle to understand them as cohesive concepts. This is particularly noticeable with specialized terminology, rare words, and languages that are underrepresented in the tokenizer's training data.

What is Byte Pair Encoding (BPE)?

BPE is the most common tokenization algorithm for modern LLMs. It starts with individual characters and iteratively merges the most frequent adjacent pairs into new tokens. This process continues until a target vocabulary size is reached, resulting in a vocabulary that efficiently represents common text patterns.

Track Token Usage and Costs with Respan

Respan provides detailed token-level analytics for your LLM operations. Monitor input and output token counts per request, track token costs across models and providers, identify prompts with unexpectedly high token usage, and optimize your token budget with data-driven insights.

Try Respan free

What is Tokenization? | AI & LLM Glossary

How It Works

Vocabulary construction

Text encoding

When input text is provided, the tokenizer applies its learned merge rules to split the text into a sequence of tokens from its vocabulary, converting each token to its corresponding numerical ID.

Model processing

The sequence of token IDs is processed by the model's neural network. Each token ID maps to a learned embedding vector, and the model operates on these vectors through its layers.

Text decoding

The model's output token IDs are mapped back to their text representations and concatenated to produce the final text output, reversing the tokenization process.

Examples

Cost estimation for API calls

Context window management

Multilingual application optimization

Why It Matters

Frequently Asked Questions

How long is a token?

Why do different models have different token counts for the same text?

Does tokenization affect model quality?

What is Byte Pair Encoding (BPE)?

Track Token Usage and Costs with Respan

Try Respan free

What is Tokenization? | AI & LLM Glossary

How It Works

Examples

Why It Matters

Related Terms

Frequently Asked Questions

Track Token Usage and Costs with Respan

What is Tokenization? | AI & LLM Glossary

How It Works

Examples

Why It Matters

Related Terms

Frequently Asked Questions

Track Token Usage and Costs with Respan