Intent classification is the task of mapping a user query to one of a small, fixed set of categories. "What's my balance?" maps to account_balance. "How do I cancel?" maps to cancellation. The intent label drives what happens next: which prompt fires, which tool gets called, which team gets paged. Get it right and the rest of the pipeline has a fighting chance. Get it wrong and the best downstream prompt in the world is answering the wrong question.
Until recently, intent classification meant fine-tuning a BERT-class encoder on a labeled dataset. That still works at high QPS or when you have ten thousand labeled examples sitting around. But cheap, fast LLMs with structured outputs have made few-shot LLM classification practical for most teams, often with better quality at lower cost in engineering time.
This guide covers what intent classification is, the three approaches that work in 2026 (encoder fine-tune, few-shot LLM, structured-output LLM), code, eval setup, and production routing patterns. For broader context, see LLM Observability and How to Evaluate an LLM.
TL;DR
- What it is. Map a query to one of N predefined intents. The first step in any router, support bot, or smart inbox.
- Three approaches. Fine-tune an encoder (BERT, DeBERTa). Few-shot prompt an LLM. Structured-output LLM with an enum schema.
- Cost profile. Encoder is cheapest per query, expensive to train. LLM is more expensive per query, free to iterate.
- Accuracy. For 5 to 30 intents with clear definitions, a modern LLM with structured outputs typically matches or beats a fine-tuned encoder out of the box.
- Production pattern. Classify, route to a specialized prompt or tool, log the intent on the trace, eval precision/recall per class weekly.
- The trap. Treating intents as static. The label set is product surface; revise it as your product changes.
What is intent classification
Two things define an intent classifier:
- The label set. A fixed list of categories. Five to fifty is the usual range. Categories must be mutually exclusive in practice; if two categories overlap, your model will be confused and so will your evals.
- The decision function. Given the query and any context, return the most likely label, plus optionally a confidence and an out-of-distribution bucket like
unknown.
That is it. Intent classification is the simplest, oldest task in NLP. What changed in 2024 to 2026 is the tooling. A solo engineer can ship a production intent classifier in an afternoon with no labeled data.
Where it shows up
- Support bots. Route tickets to the right specialized prompt (billing, bug, feature request).
- Smart inboxes. Triage email or messages by intent before generating a reply.
- Agent routers. A planner agent classifies the user message and dispatches to a sub-agent.
- Prompt routers. Pick which prompt template to use based on intent.
- Safety routing. Detect sensitive intents (medical, legal, self-harm) and short-circuit to a safe handler.
Approach 1: fine-tune an encoder
The traditional path. Take DeBERTa-v3-base or a domain-tuned variant, attach a classification head, fine-tune on a few thousand labeled examples per class.
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
model_name = "microsoft/deberta-v3-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
ds = load_dataset("csv", data_files="intents.csv")
ds = ds.map(lambda x: tokenizer(x["text"], truncation=True), batched=True)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=8)
trainer = Trainer(
model=model,
args=TrainingArguments(output_dir="out", num_train_epochs=3),
train_dataset=ds["train"],
)
trainer.train()Pros. Cheapest inference per query (sub-millisecond on a T4). Deterministic. No prompt drift. Easy to deploy on-prem.
Cons. Needs labeled data (1k to 5k per class for good results). Retraining loop on every taxonomy change. Cold start is painful: you need a labeled dataset before you can even start.
Use it when. You have a stable taxonomy, high QPS (more than 50 queries per second), strict latency budget (under 10 ms), and a labeling pipeline already in place.
Approach 2: few-shot LLM
Hand the model the label set, a few examples per label, and the query. Ask for the label.
from openai import OpenAI
client = OpenAI(base_url="https://api.respan.ai/v1")
PROMPT = """Classify the user message into one of these intents:
- billing: questions about charges, invoices, refunds
- bug_report: something is broken or unexpected
- feature_request: asks for a capability we do not have
- cancellation: wants to cancel or downgrade
- other: none of the above
Examples:
"My card was charged twice" -> billing
"The export button does nothing on Safari" -> bug_report
"Can you add SAML?" -> feature_request
"How do I close my account?" -> cancellation
Message: {message}
Intent:"""
def classify(message: str) -> str:
out = client.chat.completions.create(
model="openai/gpt-5.4-nano",
messages=[{"role": "user", "content": PROMPT.format(message=message)}],
max_tokens=10,
)
return out.choices[0].message.content.strip().lower()Pros. Zero training. Iterate on the prompt in minutes. Often surprisingly accurate even with three to five examples per label.
Cons. The model can return labels outside your enum ("billing question" instead of "billing"). It can hallucinate. Parsing free-form output is fragile.
Use it when. You want to start tomorrow and you have no labels. Good as a bootstrap to generate labeled data that later trains an encoder.
Approach 3: LLM with structured outputs
The 2026 default. Use the structured-outputs feature of your model to constrain the output to a strict enum. The model cannot return anything outside the enum, parsing is free, and you get a real probability you can threshold on.
from openai import OpenAI
from pydantic import BaseModel
from typing import Literal
client = OpenAI(base_url="https://api.respan.ai/v1")
class Intent(BaseModel):
intent: Literal["billing", "bug_report", "feature_request", "cancellation", "other"]
confidence: Literal["low", "medium", "high"]
PROMPT = """You are a support triage classifier. Read the message and emit
intent and your confidence. Use 'other' if none of the labels fit.
Definitions:
- billing: charges, invoices, refunds, payment methods
- bug_report: a feature is broken or behaves unexpectedly
- feature_request: a missing capability
- cancellation: wants to cancel or downgrade
- other: none of the above
Message: {message}"""
def classify(message: str) -> Intent:
completion = client.chat.completions.parse(
model="openai/gpt-5.4-nano",
messages=[{"role": "user", "content": PROMPT.format(message=message)}],
response_format=Intent,
)
return completion.choices[0].message.parsedPros. Guaranteed valid enum. Confidence built in. Type-safe in Python. Iterate by editing the prompt or the schema.
Cons. Slightly higher latency than raw few-shot (the constrained decoding adds a few ms). Cost per call is non-trivial at scale.
Use it when. You are building a new classifier in 2026 and you do not have a labeling team. This is the right starting point for the vast majority of cases.
A trick worth knowing: ask the model for both the label and a one-sentence rationale. The rationale gives you free interpretability and dramatically improves recall on edge cases, at the cost of more output tokens.
Setting up evals
A classifier without an eval suite is a vibe check. Run a small fixture (200 to 1000 labeled examples spread across your classes) and compute precision, recall, and F1 per class. Watch for class imbalance.
from sklearn.metrics import classification_report
import json
with open("fixture.jsonl") as f:
fixture = [json.loads(line) for line in f]
y_true = [row["intent"] for row in fixture]
y_pred = [classify(row["message"]).intent for row in fixture]
print(classification_report(y_true, y_pred, digits=3))A solid baseline on a well-defined 8-class taxonomy with GPT-5.4-nano and structured outputs typically lands at 0.92 to 0.96 macro F1 with zero training. A fine-tuned DeBERTa on 8 classes with a few thousand labeled examples per class lands around 0.95 to 0.97. The LLM closes most of the gap and is dramatically faster to ship.
Three eval practices to put in place from day one:
- Per-class breakdown. Macro F1 hides the class you keep missing. Look at the confusion matrix.
- Eval the rationale. If you collected rationales, sample 20 and read them. Wrong rationale on a right label is a precursor to future regressions.
- Run eval on every prompt change. Promote a prompt only if the macro F1 holds and no individual class regresses by more than two points. See What Is Prompt Evaluation.
Production routing patterns
Classification is rarely the end of the pipeline. It is the first hop.
Classify then specialized prompt
The most common pattern. The classifier picks the intent; a specialized prompt for that intent fires next.
PROMPTS = {
"billing": BILLING_PROMPT,
"bug_report": BUG_PROMPT,
"feature_request": FEATURE_REQUEST_PROMPT,
"cancellation": CANCELLATION_PROMPT,
"other": GENERIC_PROMPT,
}
def respond(message: str) -> str:
intent = classify(message)
prompt = PROMPTS[intent.intent].format(message=message)
return llm(prompt)This is a two-hop prompt chain. Each specialized prompt is shorter, more focused, and easier to evaluate than a single mega-prompt that tries to handle every intent.
Classify then tool call
If the intent maps to a specific action (lookup, write, escalate), skip the second prompt and call the tool directly.
def respond(message: str):
intent = classify(message)
match intent.intent:
case "billing": return billing_lookup(message)
case "cancellation": return open_cancellation_flow(message)
case "bug_report": return create_ticket(message)
case _: return generic_reply(message)This is fast and cheap. It works when each intent has a deterministic action.
Classify with low-confidence fallback
When confidence is low, escalate to a stronger classifier or to a human.
def respond(message: str):
intent = classify(message, model="openai/gpt-5.4-nano")
if intent.confidence == "low":
intent = classify(message, model="anthropic/claude-sonnet-4.6")
if intent.confidence == "low":
return escalate_to_human(message)
return PROMPTS[intent.intent](message)This is the "small model first, big model on hard cases" pattern. Routes more than 90 percent of traffic to a cheap model while preserving accuracy on the long tail.
Common mistakes
Overlapping classes. If your intents are "billing" and "refund," every refund question is also a billing question. Pick one and document the precedence rule.
Too many classes. Past 30 intents, accuracy starts to drop and your eval fixture gets thin per class. Group rare intents into broader buckets and re-classify within them only if needed.
No other bucket. Without an "other" or "unknown" class, the model is forced to guess on out-of-distribution inputs. Always include one.
Treating intents as static. Your product evolves. Intents must evolve too. Schedule a quarterly review of the label set against real traffic.
No trace on the intent. If you do not log the predicted intent on each request, you have no way to debug routing failures later. Capture intent as a span attribute or trace tag. See LLM Tracing.
Using the wrong model. GPT-5.5 for classification is expensive overkill. Use a small, fast model (GPT-5.4-nano, Haiku 4.5) for the classifier; reserve the flagship for the downstream response.
When to choose which approach
A simple decision tree.
- High QPS (more than 50/s), tight latency (under 10 ms), labels available? Fine-tune an encoder.
- Need to ship this week, no labels yet? Structured-output LLM.
- Prototyping, getting a feel for the taxonomy? Few-shot LLM, then graduate to structured outputs once the taxonomy is stable.
- Highly regulated, audit logs required for every decision? Structured-output LLM with rationale, logged. Or encoder with explicit feature attributions.
- Multilingual? Modern LLMs win comfortably over a single encoder fine-tune.
FAQ
Do I need labeled data for LLM intent classification? No. Few-shot or structured outputs work with the label definitions alone. You will want a labeled eval set, but it can be 200 to 500 examples and you can write them in an afternoon.
How big should my eval fixture be? At least 30 examples per class for stable per-class metrics. 200 to 1000 total is a comfortable working range for most teams.
Can the LLM return a probability? Not directly, but you can ask it for a categorical confidence (low, medium, high) as a second field, or extract logprobs from the API if available. Categorical confidence is more useful for routing logic in practice.
What about multi-intent messages? A message can carry two intents ("My card was charged twice and the app crashes when I open billing"). Use multi-label classification: the schema returns a list of intents. Be careful with eval; precision and recall now apply per (message, intent) pair.
How does this interact with embeddings-based routing? You can compute an embedding of the query and compare to embeddings of intent descriptions, picking the closest. This is cheap and surprisingly effective for coarse intent. Use it as a fast first-pass before a more expensive LLM classifier.
Should the classifier and the responder share a model? Usually no. A small model classifies, a bigger model responds. Separating them lets you tune cost and latency per hop.
Does fine-tuning beat structured outputs at scale? On well-bounded English-only classification with thousands of labels per class, yes, marginally. For most production systems with 5 to 30 classes, the structured-output LLM is close enough that the engineering savings dominate.