A RAG pipeline is the end-to-end system that combines a knowledge source with a large language model to produce grounded, cited answers. Documents in, embeddings out, retrievals at query time, augmented prompts, generated answers — that's the loop. Building a working RAG pipeline is the first thing many engineering teams ship in their LLM journey; building one that survives production is where most stumble.

This guide is the architectural map. For specific implementation choices, see the related guides linked throughout.

TL;DR

A RAG pipeline has five components:

Document ingestion — load and clean source documents
Chunking — split documents into retrievable pieces
Embedding — convert chunks to vectors and store in a vector DB
Retrieval — at query time, embed the query and find relevant chunks
Generation — pass retrieved chunks + user query to an LLM, return grounded answer

That's the basic shape. Production systems add re-ranking, hybrid search (vector + keyword), agentic retrieval loops, citation tracking, and evaluation pipelines. The classic pattern works for simple Q&A; complex domains need the additions.

The five components

1. Document ingestion

Load your source data — PDFs, web pages, internal docs, transcripts, code, structured data. Each source has its own ingestion pattern:

PDFs: parsing libraries (pypdf, pdfplumber, LlamaParse for complex layouts)
Web: HTML cleaners + readability extraction
Structured data: SQL or API → Markdown
Multimodal: separate pipeline for image OCR, video transcript, audio transcript

The hardest part is the long tail of document types. Multi-column PDFs, scanned tables, embedded images — these break naive parsers. Most production RAG pipelines spend more engineering time on ingestion quality than on any other component.

2. Chunking

Split documents into retrievable units (chunks). Common strategies:

Fixed-size — N tokens per chunk, with overlap (simplest, often good enough)
Semantic — split on natural boundaries (paragraphs, sections, code blocks)
Hierarchical — chunks at multiple granularities (full doc, section, paragraph)
Late chunking — embed full document, then derive chunk embeddings (newer technique, better recall)

Chunk size matters. Too small: chunks lack context. Too large: retrieval is imprecise. Common sweet spot: 200-500 tokens per chunk with ~50-token overlap.

3. Embedding

Convert chunks to dense vectors using an embedding model (OpenAI text-embedding-3-large, Cohere, Voyage, or open-source like BGE). Store in a vector database:

Pinecone — managed, simple
Weaviate — open source, hybrid search built in
Qdrant — open source, performant
pgvector — Postgres extension, simplest if you already use Postgres
Turbopuffer — newer, cost-effective

Choice depends on scale, cost sensitivity, existing infrastructure. For most teams, pgvector or Pinecone is the starting point.

4. Retrieval

At query time:

Embed the user query with the same embedding model
Vector search to find top-k similar chunks
Optionally re-rank with a cross-encoder for higher precision
Optionally hybrid search (combine vector with BM25 keyword search)

Top-k is usually 5-20. Re-ranking with a model like Cohere Rerank or BGE Reranker meaningfully improves answer quality but adds 100-300ms latency.

For complex queries, agentic RAG replaces single-shot retrieval with a multi-step agent that decides when to retrieve, what to retrieve, and when to stop.

5. Generation

Pass retrieved chunks + user query to an LLM with a prompt like:

Answer the user's question using ONLY the retrieved context.
If the context doesn't contain the answer, say so.

Context: {retrieved_chunks}
Question: {user_query}

Answer (with citations):

The model produces an answer grounded in the retrieved chunks, ideally with citations linking claims back to source documents.

Common production additions

Above the basic pipeline, production RAG systems usually add:

Re-ranking — cross-encoder ranks the top-k from vector search for higher precision
Hybrid search — combine vector similarity with BM25 keyword search; better recall on specific terms / acronyms
Query rewriting — agent rewrites query before retrieval (handles ambiguity, expansion)
Metadata filtering — filter retrieval by tags (date, source, language) before semantic ranking
Citation extraction — parse model output to extract which chunks were referenced
Eval pipeline — score retrieval quality and answer quality separately (see /llm-evals)

Common RAG pipeline mistakes

Bad chunking — too small, too large, or breaking on bad boundaries (mid-sentence). Chunking quality dominates retrieval quality.
Wrong embedding model for the domain. General-purpose embeddings struggle with specialized vocabularies (legal, medical). Sometimes a fine-tuned embedding model wins.
No re-ranking. Top-k vector search has decent recall but mediocre precision. Re-ranking is cheap relative to the quality boost.
No retrieval evaluation. Many teams evaluate end-to-end answer quality but never measure if the right docs got retrieved. Decompose: retrieval evals separate from generation evals.
One-shot retrieval for complex queries. Multi-hop questions need agentic retrieval.
Skipping production observability. Tracing every retrieval + generation step is essential for debugging.

When RAG isn't the right answer

Three scenarios:

Information fits in the prompt. If your knowledge base is under ~50 pages, just stuff it in the system prompt with prompt caching. Cheaper and simpler than RAG.
Real-time information is needed. Use Perplexity or web search instead of indexed documents.
Structured data Q&A. SQL or API querying often beats RAG for "what's our customer count last quarter?"

How to start

If you're building your first RAG pipeline:

Day 1: ingest a small corpus (50-200 documents). Use OpenAI text-embedding-3-large and pgvector or Pinecone. Fixed-size chunks of 400 tokens with 50-token overlap.
Day 2-3: build the retrieval + generation loop. Use a modern model (Sonnet 4.6 or GPT-5.4) for generation.
Day 4-5: build a 50-example test set sampled from real user queries. Score retrieval (was the right doc retrieved?) and generation (was the answer correct?) separately.
Week 2: add re-ranking with Cohere Rerank or BGE Reranker. Compare scores.
Week 3: add hybrid search (vector + BM25). Compare scores.
Week 4: production deploy with observability and online evals.

This path gets you to a working production RAG pipeline in a month. Going faster is possible; the scoring/eval steps are what separate "demo" from "production" so don't skip them.

Tools and frameworks

LlamaIndex — RAG-first framework, best out of the box
LangChain — broad framework, retrievers + vectorstore wrappers
LlamaParse — best parser for complex documents (paid)
Respan — observability + evals over the full RAG loop

FAQ

What does RAG stand for? Retrieval-Augmented Generation. Retrieve relevant context, augment the LLM prompt with it, generate the answer.

Do I need a vector database? For RAG at any meaningful scale, yes. For under ~10k documents, pgvector inside an existing Postgres works. Above that, dedicated vector DBs (Pinecone, Weaviate, Qdrant) are the right call.

Which embedding model should I use? OpenAI's text-embedding-3-large is a strong default. Voyage and Cohere are competitive. For specialized domains (legal, medical, code), test fine-tuned alternatives.

How do I evaluate a RAG pipeline? Split into retrieval evals (was the right document retrieved?) and generation evals (was the answer correct given the retrieved docs?). Different failure modes. See our LLM Evals guide.

What's the difference between RAG and fine-tuning? RAG provides knowledge at inference time via retrieval. Fine-tuning bakes knowledge into the model weights. RAG is faster to update (re-index docs) and cheaper than fine-tuning. Most teams should start with RAG.

Should I use agentic RAG? For simple queries, no — overkill. For multi-hop, ambiguous, or multi-source queries, yes. See our agentic RAG explainer.

How big should chunks be? Common sweet spot: 200-500 tokens with 50-token overlap. Test on your specific data.

This guide is the architectural map. For specific implementation choices, see the related guides linked throughout.

TL;DR

A RAG pipeline has five components:

Document ingestion — load and clean source documents
Chunking — split documents into retrievable pieces
Embedding — convert chunks to vectors and store in a vector DB
Retrieval — at query time, embed the query and find relevant chunks
Generation — pass retrieved chunks + user query to an LLM, return grounded answer

The five components

1. Document ingestion

Load your source data — PDFs, web pages, internal docs, transcripts, code, structured data. Each source has its own ingestion pattern:

PDFs: parsing libraries (pypdf, pdfplumber, LlamaParse for complex layouts)
Web: HTML cleaners + readability extraction
Structured data: SQL or API → Markdown
Multimodal: separate pipeline for image OCR, video transcript, audio transcript

2. Chunking

Split documents into retrievable units (chunks). Common strategies:

Fixed-size — N tokens per chunk, with overlap (simplest, often good enough)
Semantic — split on natural boundaries (paragraphs, sections, code blocks)
Hierarchical — chunks at multiple granularities (full doc, section, paragraph)
Late chunking — embed full document, then derive chunk embeddings (newer technique, better recall)

Chunk size matters. Too small: chunks lack context. Too large: retrieval is imprecise. Common sweet spot: 200-500 tokens per chunk with ~50-token overlap.

3. Embedding

Convert chunks to dense vectors using an embedding model (OpenAI text-embedding-3-large, Cohere, Voyage, or open-source like BGE). Store in a vector database:

Pinecone — managed, simple
Weaviate — open source, hybrid search built in
Qdrant — open source, performant
pgvector — Postgres extension, simplest if you already use Postgres
Turbopuffer — newer, cost-effective

Choice depends on scale, cost sensitivity, existing infrastructure. For most teams, pgvector or Pinecone is the starting point.

4. Retrieval

At query time:

Embed the user query with the same embedding model
Vector search to find top-k similar chunks
Optionally re-rank with a cross-encoder for higher precision
Optionally hybrid search (combine vector with BM25 keyword search)

Top-k is usually 5-20. Re-ranking with a model like Cohere Rerank or BGE Reranker meaningfully improves answer quality but adds 100-300ms latency.

For complex queries, agentic RAG replaces single-shot retrieval with a multi-step agent that decides when to retrieve, what to retrieve, and when to stop.

5. Generation

Pass retrieved chunks + user query to an LLM with a prompt like:

Answer the user's question using ONLY the retrieved context.
If the context doesn't contain the answer, say so.

Context: {retrieved_chunks}
Question: {user_query}

Answer (with citations):

The model produces an answer grounded in the retrieved chunks, ideally with citations linking claims back to source documents.

Common production additions

Above the basic pipeline, production RAG systems usually add:

Re-ranking — cross-encoder ranks the top-k from vector search for higher precision
Hybrid search — combine vector similarity with BM25 keyword search; better recall on specific terms / acronyms
Query rewriting — agent rewrites query before retrieval (handles ambiguity, expansion)
Metadata filtering — filter retrieval by tags (date, source, language) before semantic ranking
Citation extraction — parse model output to extract which chunks were referenced
Eval pipeline — score retrieval quality and answer quality separately (see /llm-evals)

Common RAG pipeline mistakes

Bad chunking — too small, too large, or breaking on bad boundaries (mid-sentence). Chunking quality dominates retrieval quality.
Wrong embedding model for the domain. General-purpose embeddings struggle with specialized vocabularies (legal, medical). Sometimes a fine-tuned embedding model wins.
No re-ranking. Top-k vector search has decent recall but mediocre precision. Re-ranking is cheap relative to the quality boost.
No retrieval evaluation. Many teams evaluate end-to-end answer quality but never measure if the right docs got retrieved. Decompose: retrieval evals separate from generation evals.
One-shot retrieval for complex queries. Multi-hop questions need agentic retrieval.
Skipping production observability. Tracing every retrieval + generation step is essential for debugging.

When RAG isn't the right answer

Three scenarios:

Information fits in the prompt. If your knowledge base is under ~50 pages, just stuff it in the system prompt with prompt caching. Cheaper and simpler than RAG.
Real-time information is needed. Use Perplexity or web search instead of indexed documents.
Structured data Q&A. SQL or API querying often beats RAG for "what's our customer count last quarter?"

How to start

If you're building your first RAG pipeline:

Day 1: ingest a small corpus (50-200 documents). Use OpenAI text-embedding-3-large and pgvector or Pinecone. Fixed-size chunks of 400 tokens with 50-token overlap.
Day 2-3: build the retrieval + generation loop. Use a modern model (Sonnet 4.6 or GPT-5.4) for generation.
Day 4-5: build a 50-example test set sampled from real user queries. Score retrieval (was the right doc retrieved?) and generation (was the answer correct?) separately.
Week 2: add re-ranking with Cohere Rerank or BGE Reranker. Compare scores.
Week 3: add hybrid search (vector + BM25). Compare scores.
Week 4: production deploy with observability and online evals.

This path gets you to a working production RAG pipeline in a month. Going faster is possible; the scoring/eval steps are what separate "demo" from "production" so don't skip them.

Tools and frameworks

LlamaIndex — RAG-first framework, best out of the box
LangChain — broad framework, retrievers + vectorstore wrappers
LlamaParse — best parser for complex documents (paid)
Respan — observability + evals over the full RAG loop

FAQ

What does RAG stand for? Retrieval-Augmented Generation. Retrieve relevant context, augment the LLM prompt with it, generate the answer.

Should I use agentic RAG? For simple queries, no — overkill. For multi-hop, ambiguous, or multi-source queries, yes. See our agentic RAG explainer.

How big should chunks be? Common sweet spot: 200-500 tokens with 50-token overlap. Test on your specific data.

What Is a RAG Pipeline?

TL;DR

The five components

1. Document ingestion

2. Chunking

3. Embedding

4. Retrieval

5. Generation

Common production additions

Common RAG pipeline mistakes

When RAG isn't the right answer

How to start

Tools and frameworks

FAQ

Related articles

What Is Agentic RAG?

What Is an LLM Gateway?

What Is LLM Inference?

Built for AI agents.
Break less.
Ship more.

What Is a RAG Pipeline?

TL;DR

The five components

1. Document ingestion

2. Chunking

3. Embedding

4. Retrieval

5. Generation

Common production additions

Common RAG pipeline mistakes

When RAG isn't the right answer

How to start

Tools and frameworks

FAQ

Related articles

What Is Agentic RAG?

What Is an LLM Gateway?

What Is LLM Inference?

Built for AI agents.
Break less.
Ship more.

Related articles

Explainer
What Is Agentic RAG?
Agentic RAG explained: how it differs from classic RAG, when to use it, the production architecture, and the tools that handle it well.
Frank Chen · 18 hours ago

Explainer
What Is an LLM Gateway?
LLM gateway explained: what it is, what it does (routing, fallback, caching, rate limits), why teams adopt one, the difference from an AI gateway, and how to choose.
Frank Chen · 18 hours ago

Explainer
What Is LLM Inference?
LLM inference explained: what it is, how it works, why it costs what it does, latency components (TTFT, generation), batching, caching, and the production patterns that matter.
Frank Chen · 18 hours ago

What Is a RAG Pipeline?

TL;DR

The five components

1. Document ingestion

2. Chunking

3. Embedding

4. Retrieval

5. Generation

Common production additions

Common RAG pipeline mistakes

When RAG isn't the right answer

How to start

Tools and frameworks

FAQ

Related

Related articles

What Is Agentic RAG?

What Is an LLM Gateway?

What Is LLM Inference?

Built for AI agents. Break less. Ship more.

What Is a RAG Pipeline?

TL;DR

The five components

1. Document ingestion

2. Chunking

3. Embedding

4. Retrieval

5. Generation

Common production additions

Common RAG pipeline mistakes

When RAG isn't the right answer

How to start

Tools and frameworks

FAQ

Related

Related articles

What Is Agentic RAG?

What Is an LLM Gateway?

What Is LLM Inference?

Built for AI agents. Break less. Ship more.

Built for AI agents.
Break less.
Ship more.

Built for AI agents.
Break less.
Ship more.