Semantic search is the technique of retrieving documents by meaning rather than by keyword overlap. A query for "how do I refund a charge" finds a help-desk article titled "issuing a customer reimbursement," even though the words do not match. The trick is to encode both query and documents into a shared vector space using an embedding model, then retrieve documents whose vectors are closest to the query vector. Geometry replaces string matching.
That sounds simple, and the proof of concept is small (one embedding model, one vector database, one cosine-similarity query). But production semantic search has more moving parts than people expect: chunking strategy, embedding choice, hybrid scoring with BM25, reranking with cross-encoders, evaluation harness, freshness and re-indexing. Get any of these wrong and your search either returns nothing useful or buries the right answer at rank 47.
In 2026, semantic search is the retrieval layer underneath nearly every RAG application, every "chat with your docs" feature, every agentic workflow that pulls context from a knowledge base. This guide is the engineer's view: what each component does, what to pick in 2026, and the code to wire it up.
TL;DR
- What it is: retrieving documents by vector similarity in an embedding space, instead of by keyword overlap.
- vs BM25: BM25 is fast, exact, and great at proper nouns and rare terms. Semantic search is robust to paraphrasing and synonyms. Hybrid is almost always better than either alone.
- Embeddings: OpenAI text-embedding-3-large or text-embedding-3-small for general English, Cohere embed-v4 for multilingual, BAAI bge-large-en-v1.5 for open-source.
- Vector databases: Pinecone (managed, scale), pgvector (in your Postgres), Qdrant (open-source self-host), Chroma (local dev), Weaviate (enterprise features).
- Hybrid + rerank: BM25 + dense, then cross-encoder rerank, is the production default in 2026.
- Evaluate: recall@k, MRR, NDCG. Build a small labeled set, run it nightly.
Semantic vs keyword search
Two failure modes motivate semantic search.
Vocabulary mismatch. A user types "my package never arrived." Your knowledge base says "shipment lost in transit." BM25 will not match those. An embedding model knows both phrases mean the same thing.
Multi-word concepts. "How do I write tests for async functions in Python" should retrieve the pytest-asyncio guide even if it never uses the literal phrase "write tests for async functions." Embeddings collapse the concept to a region in vector space; keyword search treats every word as independent.
Two failure modes motivate keeping keyword search.
Rare exact terms. A user types "SKU-7841-A". Embeddings of made-up identifiers are essentially random; BM25 retrieves the exact match instantly.
Speed and explainability. BM25 on an indexed corpus is sub-millisecond and the score decomposes into per-term contributions. Dense retrieval is opaque.
That is why nearly everyone runs both. BM25 catches the rare-term cases, semantic catches the paraphrase cases, and a rank fusion or reranker reconciles them.
How embeddings power semantic search
An embedding model is a neural network trained so that semantically similar inputs land near each other in a fixed-dimensional vector space. text-embedding-3-large outputs 3072-dimensional vectors; bge-large-en-v1.5 outputs 1024-dimensional vectors; Cohere embed-v4 supports flexible dimensions.
The training objective varies. Contrastive learning is the dominant approach: pull positive pairs (query and matching document) together, push negative pairs apart. The result is a function from text to a unit vector where cosine similarity correlates with semantic relatedness.
For semantic search, you embed every document chunk at index time and store the vectors. At query time, you embed the query and find the K nearest document vectors. "Nearest" is usually cosine similarity (equivalent to dot product if vectors are normalized), occasionally Euclidean or inner product.
The quality of your retrieval is bounded by the embedding model. A great database and a mediocre embedding model give mediocre results. Pick the embedding model first.
Choosing an embedding model in 2026
Three default picks.
OpenAI text-embedding-3-large. 3072 dimensions, strong general English, fine on common code. Cheap per token. Matryoshka representation: you can truncate to 1024 or 256 dimensions and lose only a little quality, which saves storage and improves search speed.
Cohere embed-v4. Multilingual, strong on long documents, supports input-type hints ("search_query" vs "search_document"). Good when your corpus is non-English or mixed-language.
bge-large-en-v1.5 (BAAI). Open-source, runs locally or on your own GPU. Competitive with OpenAI on MTEB. Use when you cannot send data to a hosted API.
A few practical rules.
- Match the embedding to the corpus language. English-only models on Japanese documents fail silently.
- Embed queries and documents with the same model. Mixing is a common bug.
- Some models distinguish "query" vs "document" mode. Honor it; ignoring the distinction costs 1 to 3 points on recall@10.
- Re-embed everything when you change models. There is no migration path between embedding spaces.
Vector database comparison
A short, opinionated tour.
Pinecone. Managed, scales to billions of vectors, hybrid search built in, namespaces for multi-tenancy. Default pick for teams that want managed and have budget.
pgvector. Postgres extension. Your vectors live in the same database as the rest of your data; you can JOIN against application tables. ivfflat and HNSW indexes are both supported. Scales to roughly 10M vectors comfortably; beyond that, sharding gets fiddly. Default pick for teams already on Postgres.
Qdrant. Open-source, written in Rust, fast HNSW, good filtering. Self-hostable or managed. Default pick for teams that want open-source plus production polish.
Chroma. Local-first, easy to start, great for prototypes and notebooks. Production deployments are possible but the operational story is younger.
Weaviate. GraphQL API, multi-tenancy, hybrid search, modules for external models. Enterprise-friendly. Strong on filtering and aggregation queries.
Most teams should not start with a vector database at all. Start with pgvector or Chroma. Migrate when you outgrow it. The query API is similar enough that switching is days, not weeks.
Code example: pgvector with text-embedding-3-large
End to end. Schema, indexing, querying.
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE docs (
id BIGSERIAL PRIMARY KEY,
content TEXT NOT NULL,
embedding VECTOR(3072)
);
CREATE INDEX docs_embedding_hnsw
ON docs USING hnsw (embedding vector_cosine_ops);import os
import psycopg
from openai import OpenAI
client = OpenAI()
conn = psycopg.connect(os.environ["DATABASE_URL"])
def embed(text: str, mode: str = "document") -> list[float]:
resp = client.embeddings.create(
model="text-embedding-3-large",
input=text,
)
return resp.data[0].embedding
def index(content: str):
vec = embed(content)
with conn.cursor() as cur:
cur.execute(
"INSERT INTO docs (content, embedding) VALUES (%s, %s)",
(content, vec),
)
conn.commit()
def search(query: str, k: int = 10):
qvec = embed(query, mode="query")
with conn.cursor() as cur:
cur.execute(
"""
SELECT id, content, 1 - (embedding <=> %s::vector) AS score
FROM docs
ORDER BY embedding <=> %s::vector
LIMIT %s
""",
(qvec, qvec, k),
)
return cur.fetchall()The <=> operator is pgvector's cosine distance; 1 - distance is cosine similarity. HNSW gives you sub-100ms queries up to a few million vectors.
For real-world chunking, split documents on semantic boundaries (paragraphs, sections), aim for 200 to 800 tokens per chunk, and keep document IDs so you can show the source.
Hybrid search: BM25 plus dense
Hybrid search runs both retrievers and combines them. Two common strategies.
Reciprocal Rank Fusion (RRF). Run BM25 and dense independently, get a ranked list from each, score each document by sum(1 / (k + rank_i)) across retrievers. Robust, parameter-light, hard to outperform without tuning.
Weighted score combination. Normalize BM25 and cosine scores to the same range, take a weighted sum. Needs tuning per corpus.
In Postgres, you can do both in one query using tsvector for BM25-ish ranking and pgvector for dense.
WITH bm25 AS (
SELECT id, ts_rank(tsv, plainto_tsquery('english', %(q)s)) AS score
FROM docs
WHERE tsv @@ plainto_tsquery('english', %(q)s)
ORDER BY score DESC
LIMIT 50
),
dense AS (
SELECT id, 1 - (embedding <=> %(qvec)s::vector) AS score
FROM docs
ORDER BY embedding <=> %(qvec)s::vector
LIMIT 50
)
SELECT d.id, d.content,
COALESCE(b.score, 0) * 0.4 + COALESCE(de.score, 0) * 0.6 AS score
FROM docs d
LEFT JOIN bm25 b ON b.id = d.id
LEFT JOIN dense de ON de.id = d.id
WHERE b.id IS NOT NULL OR de.id IS NOT NULL
ORDER BY score DESC
LIMIT 10;Test with and without hybrid on your own eval set. The lift over pure dense is usually 5 to 15 points of recall@10 for English business content.
Reranking with cross-encoders
After hybrid retrieval, you typically have 20 to 100 candidates. The next step is reranking: a small, expensive model reads (query, document) pairs and scores them in context. This catches cases where the embedding got the gist but missed a detail.
Two production options.
Cohere Rerank (rerank-v3 in 2026). Hosted API. Send query plus up to 1000 documents, get scored results back. Fast and consistent.
bge-reranker-v2-m3 (BAAI). Open-source, multilingual, runs locally on GPU. Roughly competitive with Cohere on MTEB.
import cohere
co = cohere.Client(os.environ["COHERE_API_KEY"])
def rerank(query: str, candidates: list[str], top_k: int = 5):
resp = co.rerank(
model="rerank-v3-english",
query=query,
documents=candidates,
top_n=top_k,
)
return [(c.index, c.relevance_score) for c in resp.results]Wire reranking after hybrid retrieval. The standard pipeline is: BM25 (top 50) + dense (top 50), fuse, rerank, take top 5 or 10. That stack is the 2026 production default for RAG.
Evaluation: recall@k, MRR, NDCG
You cannot improve search you cannot measure. Build a labeled set early.
Recall@k. Of the relevant documents that exist, how many did you retrieve in the top k? Most important early metric for RAG.
Mean Reciprocal Rank (MRR). Reciprocal rank of the first relevant result, averaged across queries. Good when there is usually one correct answer.
NDCG (Normalized Discounted Cumulative Gain). Rewards relevant results higher in the ranking, with graded relevance labels. Use when relevance is not binary.
A minimal harness.
def recall_at_k(queries, retrieve, k=10):
hits = 0
for q in queries:
retrieved = [doc_id for doc_id, _ in retrieve(q["query"], k=k)]
if any(doc in retrieved for doc in q["relevant"]):
hits += 1
return hits / len(queries)Run nightly. Track metrics over time. When recall@10 drops, something broke (a model swap, a chunking change, a stale index). When it climbs, your latest tweak worked. See prompt evaluation for the broader pattern of evals as a first-class artifact.
Integration with RAG pipelines
Semantic search is the retrieval layer of RAG. The end-to-end shape:
- Chunk and embed documents at index time.
- At query time, embed the user question, retrieve top-K candidates with hybrid search.
- Rerank to top 5 or 10.
- Format retrieved chunks into the LLM prompt.
- Generate the answer.
Agentic RAG extends this: the agent decides when to retrieve, what query to send, whether to re-retrieve based on partial answers. Semantic search is still the underlying retrieval primitive; the agent just calls it as a tool.
For implementation details on embeddings specifically, including model selection and dimensionality, see the OpenAI embeddings guide.
FAQ
Can I skip BM25 and just use dense retrieval? You can, but you will leave recall on the table. Add BM25 as soon as your corpus contains proper nouns, product codes, or rare terms. The marginal complexity is small; the gain is real.
How long should my chunks be? 200 to 800 tokens is the usual range. Shorter chunks give precise matches but lose context; longer chunks give context but dilute the embedding. Test on your eval set.
Do I need to embed metadata? Sometimes. Including the document title in the chunk text often helps. Including unrelated metadata (author, timestamp) usually hurts because it dilutes the semantic signal.
Is cosine similarity always the right metric? For normalized embeddings, cosine and dot product give the same ranking. Most embedding models are L2-normalized by default. Euclidean distance is mathematically equivalent for unit vectors. Pick cosine and move on.
What about MMR (Maximal Marginal Relevance)? MMR re-ranks results to balance relevance and diversity. Useful when the top 10 hits are all near-duplicates and the user needs varied results. Cheaper than a cross-encoder rerank but solves a different problem.
How fresh does my index need to be? Depends on the corpus. Static documentation can re-index weekly. Customer support tickets need near-real-time. For pgvector, an incremental indexer with a timestamp watermark is the simplest approach.
Do I need a GPU? For embeddings: no, hosted APIs work fine for most teams. For reranking on bge models at scale: yes. For Cohere Rerank: no, it is hosted.