Agentic RAG is retrieval-augmented generation where an LLM agent decides what to retrieve, when to retrieve, how to refine the query, and when to stop — instead of running a single fixed retrieve-then-generate pipeline. It's RAG with an agent loop in the middle, and it's how serious 2026 RAG systems are built.
The classic RAG pattern (embed query → vector search → stuff results into prompt → generate answer) works well for simple Q&A. It struggles when questions need multi-step reasoning, follow-up retrieval, or query rewriting. Agentic RAG closes that gap by letting the model think.
TL;DR
Classic RAG:
query → embed → vector search → top-k → prompt → answer
Agentic RAG:
query → agent → [retrieve, refine, retrieve again, reason, ...] → answer
↑ ↓
└────── multi-step, decision-driven, possibly looped ─────┘
The agent has tools (retrievers, search APIs, web search, calculators) and decides which to call, in what order, and when it has enough information. The result is a more intelligent retrieval flow at the cost of higher latency and more LLM calls per query.
How agentic RAG differs from classic RAG
| Dimension | Classic RAG | Agentic RAG |
|---|---|---|
| Retrieval calls per query | 1 | 1-N (N can be 10+) |
| Query rewriting | None or fixed | Agent decides |
| Multiple retrievers | Hard-coded | Agent picks |
| Termination | Always after retrieve | Agent decides when enough |
| Latency | Single-shot | Multi-step (slower) |
| Cost | Low (1 LLM + 1 retrieval) | Higher (multiple LLM + retrievals) |
| Quality on hard queries | Limited | Better |
| Quality on simple queries | Fine | Sometimes worse (overcomplicated) |
When agentic RAG wins
Three scenarios:
- Multi-hop questions. "Who was the CEO of the company that acquired Anthropic in 2027?" requires multiple retrievals chained together.
- Ambiguous queries. "Tell me about cost optimization" — the agent can rewrite to a more specific query before retrieving.
- Domain with multiple knowledge sources. Agent picks the right source (docs vs FAQ vs blog vs API) based on the question.
When classic RAG is better
- Simple Q&A on a single knowledge base. Single-shot RAG is faster and cheaper.
- High-volume / cost-sensitive workloads. Agentic adds 5-10× more LLM calls per query.
- Latency-sensitive products. A multi-step agent loop adds seconds; classic RAG is sub-second.
A typical agentic RAG flow
User: "What was Stripe's gross margin in Q3 2025 vs their main payment competitor?"
Agent:
1. Decide: I need Stripe Q3 2025 financials AND a competitor comparison
2. Retrieve: search for "Stripe Q3 2025 earnings"
3. Reason: I have Stripe's gross margin (45%). Now identify their main competitor.
4. Retrieve: search for "stripe main competitor payment"
5. Reason: PayPal is the main competitor. Need their Q3 2025 financials.
6. Retrieve: search for "PayPal Q3 2025 gross margin"
7. Compose: synthesize the comparison
Output: structured answer with citations.
A classic RAG system would have done one retrieval and likely missed half the question. Agentic RAG handles the decomposition naturally.
Architecture
A production agentic RAG system has:
- Retriever(s) — vector search, keyword search, web search, structured query (SQL)
- Query rewriter — agent step that refines the query before retrieval
- Agent loop — orchestration that decides which retriever to call, when to stop
- Memory — partial results accumulated across steps
- Synthesizer — final step that produces the answer with citations
In a framework, this looks like:
- LangGraph: state graph with retrieve / reason / decide nodes
- LlamaIndex: their agent abstractions specialize in this
- CrewAI: multi-agent variation where one agent retrieves, another reasons
Common patterns
- Router — agent picks one of several retrievers based on query type
- Iterative refinement — retrieve, check if enough, refine query, retrieve again
- Sub-question decomposition — break complex query into sub-questions, solve each, combine
- Self-correcting — retrieve, draft answer, check answer quality with LLM-as-judge, retry if low
These are not exclusive — a serious agentic RAG system uses several.
Cost and latency
Agentic RAG is roughly 5-10× more expensive and 5-10× slower than classic RAG for the same query. The win is on hard queries where classic RAG would have failed; for easy queries, agentic is overkill.
The right architecture for production: route to classic RAG for simple queries, route to agentic RAG for complex ones. A query classifier (small LLM, ~$0.20/$1.25) decides at the entry. The cost differential pays for the complexity.
Common pitfalls
- Agentic everything, even simple queries. Use a classifier; don't agent the trivial.
- No termination criteria. Agents loop forever without a stopping condition.
- No retrieval evaluation. You can't tell if the agent is actually retrieving better than classic RAG without evals.
- Skipping observability. Multi-step agents are impossible to debug without tracing.
- Optimizing latency before quality. Get quality right first; latency optimizations come second.
Tools that handle agentic RAG
- LangGraph (deep dive) — best for production stateful agents including agentic RAG
- LlamaIndex (deep dive) — RAG-first framework with strong agent abstractions
- CrewAI — fastest prototyping
- OpenAI Agents SDK — strong handoff patterns
For observability and evaluation specifically — agentic RAG breaks if you can't see and score the agent's decisions. Use Respan or equivalent to trace every step.
How to start
If you have classic RAG working:
- Identify queries where classic RAG fails. Sample bad outputs from production traces.
- Build a query classifier that flags those queries for agentic flow.
- Build the agentic flow — start with iterative refinement (retrieve → check → refine → retrieve → answer).
- Wire evals comparing classic vs agentic on the failure-mode queries.
- Roll out gradually — start with 5% of traffic on agentic, monitor latency / cost / quality.
FAQ
Is agentic RAG always better than classic RAG? No. For simple queries, classic RAG is faster and cheaper. Agentic RAG wins on multi-hop, ambiguous, or multi-source queries.
How much more does agentic RAG cost? Roughly 5-10× more LLM calls per query. The cost is justified for queries classic RAG handles poorly.
Should I use it for chat applications? Selectively. Route simple turns to classic RAG, complex turns to agentic. A classifier at the entry decides.
Does agentic RAG work with any LLM? Yes — it's an architectural pattern, not a model feature. Better models with tool-use capability work better. Claude Sonnet 4.6, GPT-5.5, and Gemini 3.1 Pro all do agentic RAG well.
What's the difference between agentic RAG and an agent that uses RAG as a tool? They overlap heavily. "Agentic RAG" usually emphasizes RAG as the primary activity; "agent with RAG tools" emphasizes a broader agent that includes retrieval. The implementations look similar.
Which framework is best? LangGraph for production stateful agents; LlamaIndex if your stack is RAG-first. See our framework comparison.