Agentic RAG is retrieval-augmented generation where an LLM agent decides what to retrieve, when to retrieve, how to refine the query, and when to stop, instead of running a single fixed retrieve-then-generate pipeline. It's RAG with an agent loop in the middle, and it's how serious 2026 RAG systems are built.
The classic RAG pattern (embed query → vector search → stuff results into prompt → generate answer) works well for simple Q&A. It struggles when questions need multi-step reasoning, follow-up retrieval, or query rewriting. Agentic RAG closes that gap by letting the model think.
TL;DR
Classic RAG:
query → embed → vector search → top-k → prompt → answer
Agentic RAG:
query → agent → [retrieve, refine, retrieve again, reason, ...] → answer
↑ ↓
└────── multi-step, decision-driven, possibly looped ─────┘
The agent has tools (retrievers, search APIs, web search, calculators) and decides which to call, in what order, and when it has enough information. The result is a more intelligent retrieval flow at the cost of higher latency and more LLM calls per query.
How agentic RAG differs from classic RAG
| Dimension | Classic RAG | Agentic RAG |
|---|---|---|
| Retrieval calls per query | 1 | 1-N (N can be 10+) |
| Query rewriting | None or fixed | Agent decides |
| Multiple retrievers | Hard-coded | Agent picks |
| Termination | Always after retrieve | Agent decides when enough |
| Latency | Single-shot | Multi-step (slower) |
| Cost | Low (1 LLM + 1 retrieval) | Higher (multiple LLM + retrievals) |
| Quality on hard queries | Limited | Better |
| Quality on simple queries | Fine | Sometimes worse (overcomplicated) |
When agentic RAG wins
Three scenarios:
- Multi-hop questions. "Who was the CEO of the company that acquired Anthropic in 2027?" requires multiple retrievals chained together.
- Ambiguous queries. "Tell me about cost optimization." The agent can rewrite to a more specific query before retrieving.
- Domain with multiple knowledge sources. Agent picks the right source (docs vs FAQ vs blog vs API) based on the question.
When classic RAG is better
- Simple Q&A on a single knowledge base. Single-shot RAG is faster and cheaper.
- High-volume / cost-sensitive workloads. Agentic adds 5-10× more LLM calls per query.
- Latency-sensitive products. A multi-step agent loop adds seconds; classic RAG is sub-second.
A typical agentic RAG flow
User: "What was Stripe's gross margin in Q3 2025 vs their main payment competitor?"
Agent:
1. Decide: I need Stripe Q3 2025 financials AND a competitor comparison
2. Retrieve: search for "Stripe Q3 2025 earnings"
3. Reason: I have Stripe's gross margin (45%). Now identify their main competitor.
4. Retrieve: search for "stripe main competitor payment"
5. Reason: PayPal is the main competitor. Need their Q3 2025 financials.
6. Retrieve: search for "PayPal Q3 2025 gross margin"
7. Compose: synthesize the comparison
Output: structured answer with citations.
A classic RAG system would have done one retrieval and likely missed half the question. Agentic RAG handles the decomposition naturally.
Architecture
A production agentic RAG system has:
- Retriever(s): vector search, keyword search, web search, structured query (SQL)
- Query rewriter: agent step that refines the query before retrieval
- Agent loop: orchestration that decides which retriever to call, when to stop
- Memory: partial results accumulated across steps
- Synthesizer: final step that produces the answer with citations
In a framework, this looks like:
- LangGraph: state graph with retrieve / reason / decide nodes
- LlamaIndex: their agent abstractions specialize in this
- CrewAI: multi-agent variation where one agent retrieves, another reasons
Common patterns
- Router: agent picks one of several retrievers based on query type
- Iterative refinement: retrieve, check if enough, refine query, retrieve again
- Sub-question decomposition: break complex query into sub-questions, solve each, combine
- Self-correcting: retrieve, draft answer, check answer quality with LLM-as-judge, retry if low
These are not exclusive. A serious agentic RAG system uses several.
Cost and latency
Agentic RAG is roughly 5-10× more expensive and 5-10× slower than classic RAG for the same query. The win is on hard queries where classic RAG would have failed; for easy queries, agentic is overkill.
The right architecture for production: route to classic RAG for simple queries, route to agentic RAG for complex ones. A query classifier (small LLM, ~$0.20/$1.25) decides at the entry. The cost differential pays for the complexity.
Common pitfalls
- Agentic everything, even simple queries. Use a classifier; don't agent the trivial.
- No termination criteria. Agents loop forever without a stopping condition.
- No retrieval evaluation. You can't tell if the agent is actually retrieving better than classic RAG without evals.
- Skipping observability. Multi-step agents are impossible to debug without tracing.
- Optimizing latency before quality. Get quality right first; latency optimizations come second.
Tools that handle agentic RAG
- LangGraph (deep dive): best for production stateful agents including agentic RAG
- LlamaIndex (deep dive): RAG-first framework with strong agent abstractions
- CrewAI: fastest prototyping
- OpenAI Agents SDK: strong handoff patterns
For observability and evaluation specifically, agentic RAG breaks if you can't see and score the agent's decisions. Use Respan or equivalent to trace every step.
How to start
If you have classic RAG working:
- Identify queries where classic RAG fails. Sample bad outputs from production traces.
- Build a query classifier that flags those queries for agentic flow.
- Build the agentic flow. Start with iterative refinement (retrieve → check → refine → retrieve → answer).
- Wire evals comparing classic vs agentic on the failure-mode queries.
- Roll out gradually. Start with 5% of traffic on agentic, monitor latency / cost / quality.
FAQ
Is agentic RAG always better than classic RAG? No. For simple queries, classic RAG is faster and cheaper. Agentic RAG wins on multi-hop, ambiguous, or multi-source queries.
How much more does agentic RAG cost? Roughly 5-10× more LLM calls per query. The cost is justified for queries classic RAG handles poorly.
Should I use it for chat applications? Selectively. Route simple turns to classic RAG, complex turns to agentic. A classifier at the entry decides.
Does agentic RAG work with any LLM? Yes. It's an architectural pattern, not a model feature. Better models with tool-use capability work better. Claude Sonnet 4.6, GPT-5.5, and Gemini 3.1 Pro all do agentic RAG well.
What's the difference between agentic RAG and an agent that uses RAG as a tool? They overlap heavily. "Agentic RAG" usually emphasizes RAG as the primary activity; "agent with RAG tools" emphasizes a broader agent that includes retrieval. The implementations look similar.
Which framework is best? LangGraph for production stateful agents; LlamaIndex if your stack is RAG-first. See our framework comparison.