A token limit is the maximum number of tokens (words, subwords, or characters) that a Large Language Model can process in a single request, encompassing both the input prompt and the generated output within the model's context window.
Every LLM has a finite context window measured in tokens. A token is the fundamental unit the model processes, typically representing about 3-4 characters or roughly 0.75 words in English. The token limit defines the hard boundary on how much text can fit into a single interaction, including the system prompt, user message, any injected context (like RAG results), and the model's generated response.
Token limits vary significantly across models and providers. Early GPT-3 models offered 4,096 tokens. GPT-4 Turbo expanded to 128K tokens. Claude models support up to 200K tokens. Some newer models push toward 1M+ tokens. However, a larger context window does not automatically mean better performance; models can struggle with information retrieval in very long contexts, a phenomenon known as the "lost in the middle" problem.
From a practical standpoint, token limits affect application architecture in fundamental ways. They determine how much context you can inject in RAG systems, how many few-shot examples fit in a prompt, how long a conversation history can be maintained, and how lengthy the model's response can be. Most APIs separate the limit into input tokens and max output tokens, giving developers control over the response length budget.
Token costs are also directly tied to token counts. API pricing is per-token for both input and output, with output tokens typically costing 3-5x more than input tokens. Understanding and optimizing token usage is therefore both a technical and financial imperative for production AI applications.
The input text is broken into tokens using the model's tokenizer (e.g., BPE, SentencePiece). Each model family has its own tokenizer, so the same text may produce different token counts across models. Tools like tiktoken (OpenAI) or the Anthropic tokenizer count tokens before submission.
The total context window is divided between input tokens (system prompt + user message + injected context) and output tokens (the model's response). Developers set a max_tokens parameter to reserve space for the response, and the remaining budget is available for input.
If the combined input exceeds the model's context window minus the reserved output tokens, the API returns an error. Applications must implement truncation, summarization, or chunking strategies to stay within bounds.
Production systems implement strategies like sliding window conversations (dropping older messages), dynamic context pruning (removing low-relevance RAG chunks), and prompt compression to maximize the value of each token within the limit.
A chatbot application maintains conversation history by appending each user message and assistant response. As the conversation grows, it approaches the token limit. The system implements a sliding window that summarizes older messages and retains only the most recent exchanges in full, keeping the total under the model's context window.
A document Q&A system retrieves 20 relevant chunks from a vector database, but including all of them would exceed the token limit. The system uses a re-ranker to score chunk relevance, then greedily selects the top chunks that fit within the remaining token budget after accounting for the system prompt and reserved output tokens.
An analyst needs to summarize a 200-page report that far exceeds any model's token limit. The application splits the document into overlapping segments, summarizes each segment independently, then runs a final synthesis pass over all the segment summaries to produce a cohesive overall summary.
Token limits are a fundamental constraint that shapes how every LLM application is designed. Understanding token budgets is essential for building reliable AI systems, controlling costs, and ensuring the model has sufficient context to generate accurate, relevant responses without exceeding boundaries that cause errors or degraded performance.
Respan automatically tracks token consumption across every LLM call, breaking down input and output tokens by model, endpoint, and user. Set up alerts when requests approach token limits, identify prompts that waste token budget, and optimize your context allocation strategy. Respan's cost analytics tie token usage directly to spend, helping you stay within budget while maximizing model performance.
Try Respan free