OpenAI embeddings are dense vectors that turn text into numbers a machine can compare. Two texts about the same topic end up close in vector space; two texts about different topics end up far apart. That is the whole game, and almost every retrieval-augmented generation (RAG) pipeline, semantic search box, clustering job, and classification pipeline in production today runs on embeddings of some flavor.
In 2026 the practical choice on the OpenAI side is simple: text-embedding-3-large for accuracy, text-embedding-3-small for cost and throughput. The legacy text-embedding-ada-002 is still callable but you should not start a new project on it. The interesting decisions are downstream: which dimension count to truncate to, how to batch, how to score similarity, and which vector store to land in.
This guide is the engineer's version. Code in Python and TypeScript, real prices, the trade-offs that show up in production. For the bigger picture on retrieval architecture, see what is a RAG pipeline and what is agentic RAG.
TL;DR
- Default to
text-embedding-3-small. It costs $0.02 per million input tokens and handles most retrieval tasks well. - Reach for
text-embedding-3-largewhen recall matters more than cost. It scores 64.6 on MTEB retrieval, about 3.6 points higher thanada-002, at $0.13 per million tokens. - Use the
dimensionsparameter to shrink vectors. A 3-large truncated to 256 dims still beatsada-002at 1536, with a fraction of the storage cost. - Batch aggressively. The Embeddings API accepts arrays. One request with 100 inputs is dramatically cheaper in wall-clock time than 100 sequential requests.
- Cosine similarity is the default. Normalized OpenAI vectors mean cosine and dot product give identical rankings.
- Land in a real vector store.
pgvectorif you already run Postgres; Pinecone, Qdrant, or Weaviate if you need scale or filtering features.
The three models, and which to pick
OpenAI ships three embedding models in 2026:
| Model | Default dims | Max input | Price per 1M tokens | MTEB retrieval |
|---|---|---|---|---|
text-embedding-3-large | 3072 | 8191 | $0.13 | 64.6 |
text-embedding-3-small | 1536 | 8191 | $0.02 | 62.3 |
text-embedding-ada-002 (legacy) | 1536 | 8191 | $0.10 | 61.0 |
A few observations.
text-embedding-3-small is the obvious starting point. It is 5x cheaper than ada-002 for better quality, full stop. There is no reason to default to ada-002 for a new project.
text-embedding-3-large earns its keep on harder retrieval problems: technical documentation, multi-lingual corpora, anything where the top-5 result quality directly drives a downstream LLM. The price gap (6.5x) sounds large until you realize embedding costs are usually a single-digit percentage of total RAG cost. Generation tokens dominate.
The legacy ada-002 model is still callable. The only reason to keep using it is migration cost: you have millions of vectors stored, the recall is acceptable, and you have not budgeted a re-embedding job.
The dimensions parameter is the real lever
Both 3-large and 3-small were trained with Matryoshka representation learning, which means you can truncate the output vector and lose less quality than a naive PCA projection would imply. You set this via the dimensions parameter in the API call.
from openai import OpenAI
client = OpenAI()
# Default: 3072 dimensions
full = client.embeddings.create(
model="text-embedding-3-large",
input="The quick brown fox jumps over the lazy dog.",
)
# Truncated: 512 dimensions, much cheaper to store
small = client.embeddings.create(
model="text-embedding-3-large",
input="The quick brown fox jumps over the lazy dog.",
dimensions=512,
)
print(len(full.data[0].embedding)) # 3072
print(len(small.data[0].embedding)) # 512OpenAI's published benchmark: text-embedding-3-large truncated to 256 dimensions still outperforms ada-002 at its native 1536 dimensions on MTEB. That is a 12x storage reduction for better quality.
Practical heuristic for picking a dimension count:
- 256: prototyping, latency-sensitive search, embedded devices
- 512 to 1024: most production RAG systems
- 1536 to 3072: when you have the budget and recall is everything
The catch: once you commit a corpus to a particular dimension count, re-embedding to change it is non-trivial. Pick one, validate on a held-out eval set, then commit.
Calling the API: Python and TypeScript
The Python SDK is the canonical path.
from openai import OpenAI
client = OpenAI()
response = client.embeddings.create(
model="text-embedding-3-small",
input=[
"How do I reset my password?",
"Where can I find my invoice?",
"Is two-factor authentication supported?",
],
dimensions=512,
)
vectors = [item.embedding for item in response.data]
print(f"Got {len(vectors)} vectors of dim {len(vectors[0])}")The TypeScript SDK is the same shape.
import OpenAI from "openai";
const client = new OpenAI();
const response = await client.embeddings.create({
model: "text-embedding-3-small",
input: [
"How do I reset my password?",
"Where can I find my invoice?",
"Is two-factor authentication supported?",
],
dimensions: 512,
});
const vectors = response.data.map((d) => d.embedding);Three things to know about the input parameter.
First, it accepts either a single string or an array of strings. Always pass arrays in production. The marginal cost of adding inputs to one request is dominated by token count, not request count.
Second, the per-request token budget is large but not unbounded. The hard cap is 8191 tokens per individual input string. For batch requests, the API will return an error if any single input exceeds that. Chunk your documents before embedding; do not assume the API will do it for you.
Third, newlines in input are now handled correctly by the 3-series models. The old advice to strip newlines (which existed for ada-002) is obsolete.
Batching for cost and throughput
The Embeddings endpoint accepts up to 2048 inputs per request, capped by total token count. The right pattern for backfilling a corpus is to chunk into batches that respect both limits.
from openai import OpenAI
import tiktoken
client = OpenAI()
encoder = tiktoken.encoding_for_model("text-embedding-3-small")
def batch_inputs(texts, max_tokens=8000, max_count=128):
batch, batch_tokens = [], 0
for text in texts:
tokens = len(encoder.encode(text))
if batch and (batch_tokens + tokens > max_tokens or len(batch) >= max_count):
yield batch
batch, batch_tokens = [], 0
batch.append(text)
batch_tokens += tokens
if batch:
yield batch
def embed_corpus(texts, model="text-embedding-3-small", dims=512):
all_vectors = []
for batch in batch_inputs(texts):
response = client.embeddings.create(
model=model, input=batch, dimensions=dims,
)
all_vectors.extend(item.embedding for item in response.data)
return all_vectorsFor very large corpora (millions of documents), use the Batch API. It cuts embedding costs in half: text-embedding-3-small drops to $0.01 per million tokens, text-embedding-3-large drops to $0.065 per million tokens. Turnaround is up to 24 hours, which is fine for nightly index rebuilds.
Cosine vs dot product
OpenAI's embedding vectors come out L2-normalized. That means cosine similarity and dot product produce identical rankings:
import numpy as np
def cosine(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
def dot(a, b):
return np.dot(a, b)If your vectors are normalized (OpenAI's are), prefer dot product. It is a single fused-multiply-add per dimension; cosine adds two norms and a division. At billion-vector scale that matters. Pinecone, Qdrant, and pgvector all expose both metrics; pick dot for OpenAI embeddings and you save the wasted math.
If you mix OpenAI embeddings with embeddings from another provider that does not normalize, switch back to cosine to be safe.
Vector store integration
The three sane options in 2026, by deployment model.
pgvector (Postgres)
If you already run Postgres, this is the answer. pgvector is a Postgres extension; vectors are a column type, similarity search is a SQL operator.
CREATE EXTENSION vector;
CREATE TABLE docs (
id bigserial PRIMARY KEY,
body text,
embedding vector(512)
);
CREATE INDEX docs_embedding_idx
ON docs USING hnsw (embedding vector_ip_ops);
-- Find the 10 most similar docs to a query embedding
SELECT id, body
FROM docs
ORDER BY embedding <#> $1
LIMIT 10;The <#> operator is negative inner product, which gives the right ordering for normalized vectors. HNSW indexes are the default in pgvector 0.7+ and dramatically outperform IVF for typical RAG workloads.
Pinecone
Pinecone is the managed option that requires the least operational thought. You create an index, upsert vectors with metadata, query with filters. The pricing is per-pod or per-serverless-usage; serverless is the right default for most teams.
from pinecone import Pinecone
pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])
index = pc.Index("support-docs")
index.upsert(vectors=[
{"id": "doc1", "values": vector, "metadata": {"source": "kb"}},
])
results = index.query(vector=query_vector, top_k=10, include_metadata=True)Qdrant
Qdrant is the strong open-source competitor: Rust, fast, good filtering, both self-hosted and managed. The API is similar to Pinecone, the operational surface is similar to running any modern database.
The right choice between the three is usually a question of operational stack, not technology. If you live in Postgres, use pgvector. If you want managed and do not want to think about it, use Pinecone. If you need self-hosting with strong filtering, use Qdrant.
Real use cases beyond RAG
Embeddings show up everywhere; RAG is the loud one.
Semantic search. Replace BM25 (or augment it with reciprocal rank fusion) for queries where users phrase things conceptually rather than by keyword. Most production systems use hybrid: BM25 for lexical match, embeddings for semantic match, fused at query time.
Clustering. Embed support tickets, run HDBSCAN or k-means, look at the cluster sizes weekly. New large clusters are usually new product problems. Tag the clusters with an LLM pass and you have a free taxonomy.
Classification. For small label sets with limited training data, embed the input, compute cosine similarity to a labeled centroid per class, pick the closest. This beats a fine-tuned model when you have fewer than ~50 examples per class.
Deduplication. Hash near-duplicates by chunking, embedding, and comparing pairwise within a similarity threshold. Faster and more accurate than shingle-based MinHash on natural language.
Recommendation. "More like this" buttons. Embed the seed item, find the nearest neighbors, filter by business rules, return.
Common mistakes
A short list, ordered by how often I see them.
- Embedding the wrong thing. You embedded titles instead of titles + body, or chunks too small to carry context. Always look at five random retrievals before declaring a system done.
- No chunking strategy. Dumping 50,000-token PDFs as single embeddings buries the signal. Chunk to 200 to 800 tokens with 10 to 20% overlap for prose.
- Forgetting to truncate. Defaulting to 3072 dims for a 10M-vector index costs 4-6x more storage than necessary for the same recall.
- Cosine on already-normalized vectors. Wasted CPU. Use dot product.
- Single-input requests in a loop. Batch.
- No eval harness. You have no idea if your retrieval improved or regressed when you swapped models. Build a 200-query gold set and run it every change.
Observability for embeddings
Embedding calls are a noticeable line item once you cross a few million documents. Track per-feature spend, latency, and token counts the same way you track chat completions. If you route through an LLM gateway, you get this for free, plus the ability to swap providers without code changes. See our Respan tracing docs for capturing embedding calls as spans in your traces.
FAQ
Should I use text-embedding-3-large or text-embedding-3-small?
Start with 3-small. Switch to 3-large only after you have an eval harness that shows the large model materially improves recall on your data.
What dimensions should I use? 512 for most production RAG. 256 for very large corpora or latency-sensitive paths. 1024+ when recall is everything and storage is cheap.
Is ada-002 still worth using?
Only for legacy systems where re-embedding is too expensive. For new projects, no.
Do I need to normalize the vectors? OpenAI returns L2-normalized vectors already. You can use dot product directly.
How do I evaluate embedding quality? Build a small set of (query, expected-relevant-doc-id) pairs. Compute recall@10 and MRR. Re-run on every change.
Can I fine-tune embeddings?
OpenAI does not currently support fine-tuning of text-embedding-3-* models. For domain-specific tuning, train a small projection layer on top, or use a self-hosted model like bge-m3 or e5-mistral.
How much will my project cost? Rough rule: 1M documents at 500 tokens each, embedded once with 3-small, is 500M tokens, which is $10. Re-embedding for a model change is the same. Production cost is almost always dominated by query embeddings (cheap) and downstream LLM calls (not cheap).