OpenAI embeddings are dense vectors that turn text into numbers a machine can compare. Two texts about the same topic end up close in vector space; two texts about different topics end up far apart. That is the whole game, and almost every retrieval-augmented generation (RAG) pipeline, semantic search box, clustering job, and classification pipeline in production today runs on embeddings of some flavor.

In 2026 the practical choice on the OpenAI side is simple: text-embedding-3-large for accuracy, text-embedding-3-small for cost and throughput. The legacy text-embedding-ada-002 is still callable but you should not start a new project on it. The interesting decisions are downstream: which dimension count to truncate to, how to batch, how to score similarity, and which vector store to land in.

This guide is the engineer's version. Code in Python and TypeScript, real prices, the trade-offs that show up in production. For the bigger picture on retrieval architecture, see what is a RAG pipeline and what is agentic RAG.

TL;DR

Default to text-embedding-3-small. It costs $0.02 per million input tokens and handles most retrieval tasks well.
Reach for text-embedding-3-large when recall matters more than cost. It scores 64.6 on MTEB retrieval, about 3.6 points higher than ada-002, at $0.13 per million tokens.
Use the dimensions parameter to shrink vectors. A 3-large truncated to 256 dims still beats ada-002 at 1536, with a fraction of the storage cost.
Batch aggressively. The Embeddings API accepts arrays. One request with 100 inputs is dramatically cheaper in wall-clock time than 100 sequential requests.
Cosine similarity is the default. Normalized OpenAI vectors mean cosine and dot product give identical rankings.
Land in a real vector store. pgvector if you already run Postgres; Pinecone, Qdrant, or Weaviate if you need scale or filtering features.

The three models, and which to pick

OpenAI ships three embedding models in 2026:

Model	Default dims	Max input	Price per 1M tokens	MTEB retrieval
`text-embedding-3-large`	3072	8191	$0.13	64.6
`text-embedding-3-small`	1536	8191	$0.02	62.3
`text-embedding-ada-002` (legacy)	1536	8191	$0.10	61.0

A few observations.

text-embedding-3-small is the obvious starting point. It is 5x cheaper than ada-002 for better quality, full stop. There is no reason to default to ada-002 for a new project.

text-embedding-3-large earns its keep on harder retrieval problems: technical documentation, multi-lingual corpora, anything where the top-5 result quality directly drives a downstream LLM. The price gap (6.5x) sounds large until you realize embedding costs are usually a single-digit percentage of total RAG cost. Generation tokens dominate.

The legacy ada-002 model is still callable. The only reason to keep using it is migration cost: you have millions of vectors stored, the recall is acceptable, and you have not budgeted a re-embedding job.

The dimensions parameter is the real lever

Both 3-large and 3-small were trained with Matryoshka representation learning, which means you can truncate the output vector and lose less quality than a naive PCA projection would imply. You set this via the dimensions parameter in the API call.

from openai import OpenAI
 
client = OpenAI()
 
# Default: 3072 dimensions
full = client.embeddings.create(
    model="text-embedding-3-large",
    input="The quick brown fox jumps over the lazy dog.",
)
 
# Truncated: 512 dimensions, much cheaper to store
small = client.embeddings.create(
    model="text-embedding-3-large",
    input="The quick brown fox jumps over the lazy dog.",
    dimensions=512,
)
 
print(len(full.data[0].embedding))   # 3072
print(len(small.data[0].embedding))  # 512

OpenAI's published benchmark: text-embedding-3-large truncated to 256 dimensions still outperforms ada-002 at its native 1536 dimensions on MTEB. That is a 12x storage reduction for better quality.

Practical heuristic for picking a dimension count:

256: prototyping, latency-sensitive search, embedded devices
512 to 1024: most production RAG systems
1536 to 3072: when you have the budget and recall is everything

The catch: once you commit a corpus to a particular dimension count, re-embedding to change it is non-trivial. Pick one, validate on a held-out eval set, then commit.

Calling the API: Python and TypeScript

The Python SDK is the canonical path.

from openai import OpenAI
 
client = OpenAI()
 
response = client.embeddings.create(
    model="text-embedding-3-small",
    input=[
        "How do I reset my password?",
        "Where can I find my invoice?",
        "Is two-factor authentication supported?",
    ],
    dimensions=512,
)
 
vectors = [item.embedding for item in response.data]
print(f"Got {len(vectors)} vectors of dim {len(vectors[0])}")

The TypeScript SDK is the same shape.

import OpenAI from "openai";
 
const client = new OpenAI();
 
const response = await client.embeddings.create({
  model: "text-embedding-3-small",
  input: [
    "How do I reset my password?",
    "Where can I find my invoice?",
    "Is two-factor authentication supported?",
  ],
  dimensions: 512,
});
 
const vectors = response.data.map((d) => d.embedding);

Three things to know about the input parameter.

First, it accepts either a single string or an array of strings. Always pass arrays in production. The marginal cost of adding inputs to one request is dominated by token count, not request count.

Second, the per-request token budget is large but not unbounded. The hard cap is 8191 tokens per individual input string. For batch requests, the API will return an error if any single input exceeds that. Chunk your documents before embedding; do not assume the API will do it for you.

Third, newlines in input are now handled correctly by the 3-series models. The old advice to strip newlines (which existed for ada-002) is obsolete.

Batching for cost and throughput

The Embeddings endpoint accepts up to 2048 inputs per request, capped by total token count. The right pattern for backfilling a corpus is to chunk into batches that respect both limits.

from openai import OpenAI
import tiktoken
 
client = OpenAI()
encoder = tiktoken.encoding_for_model("text-embedding-3-small")
 
def batch_inputs(texts, max_tokens=8000, max_count=128):
    batch, batch_tokens = [], 0
    for text in texts:
        tokens = len(encoder.encode(text))
        if batch and (batch_tokens + tokens > max_tokens or len(batch) >= max_count):
            yield batch
            batch, batch_tokens = [], 0
        batch.append(text)
        batch_tokens += tokens
    if batch:
        yield batch
 
def embed_corpus(texts, model="text-embedding-3-small", dims=512):
    all_vectors = []
    for batch in batch_inputs(texts):
        response = client.embeddings.create(
            model=model, input=batch, dimensions=dims,
        )
        all_vectors.extend(item.embedding for item in response.data)
    return all_vectors

For very large corpora (millions of documents), use the Batch API. It cuts embedding costs in half: text-embedding-3-small drops to $0.01 per million tokens, text-embedding-3-large drops to $0.065 per million tokens. Turnaround is up to 24 hours, which is fine for nightly index rebuilds.

Cosine vs dot product

OpenAI's embedding vectors come out L2-normalized. That means cosine similarity and dot product produce identical rankings:

import numpy as np
 
def cosine(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
 
def dot(a, b):
    return np.dot(a, b)

If your vectors are normalized (OpenAI's are), prefer dot product. It is a single fused-multiply-add per dimension; cosine adds two norms and a division. At billion-vector scale that matters. Pinecone, Qdrant, and pgvector all expose both metrics; pick dot for OpenAI embeddings and you save the wasted math.

If you mix OpenAI embeddings with embeddings from another provider that does not normalize, switch back to cosine to be safe.

Vector store integration

The three sane options in 2026, by deployment model.

pgvector (Postgres)

If you already run Postgres, this is the answer. pgvector is a Postgres extension; vectors are a column type, similarity search is a SQL operator.

CREATE EXTENSION vector;
 
CREATE TABLE docs (
    id bigserial PRIMARY KEY,
    body text,
    embedding vector(512)
);
 
CREATE INDEX docs_embedding_idx
    ON docs USING hnsw (embedding vector_ip_ops);
 
-- Find the 10 most similar docs to a query embedding
SELECT id, body
FROM docs
ORDER BY embedding <#> $1
LIMIT 10;

The <#> operator is negative inner product, which gives the right ordering for normalized vectors. HNSW indexes are the default in pgvector 0.7+ and dramatically outperform IVF for typical RAG workloads.

Pinecone

Pinecone is the managed option that requires the least operational thought. You create an index, upsert vectors with metadata, query with filters. The pricing is per-pod or per-serverless-usage; serverless is the right default for most teams.

from pinecone import Pinecone
 
pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])
index = pc.Index("support-docs")
 
index.upsert(vectors=[
    {"id": "doc1", "values": vector, "metadata": {"source": "kb"}},
])
 
results = index.query(vector=query_vector, top_k=10, include_metadata=True)

Qdrant

Qdrant is the strong open-source competitor: Rust, fast, good filtering, both self-hosted and managed. The API is similar to Pinecone, the operational surface is similar to running any modern database.

The right choice between the three is usually a question of operational stack, not technology. If you live in Postgres, use pgvector. If you want managed and do not want to think about it, use Pinecone. If you need self-hosting with strong filtering, use Qdrant.

Real use cases beyond RAG

Embeddings show up everywhere; RAG is the loud one.

Semantic search. Replace BM25 (or augment it with reciprocal rank fusion) for queries where users phrase things conceptually rather than by keyword. Most production systems use hybrid: BM25 for lexical match, embeddings for semantic match, fused at query time.

Clustering. Embed support tickets, run HDBSCAN or k-means, look at the cluster sizes weekly. New large clusters are usually new product problems. Tag the clusters with an LLM pass and you have a free taxonomy.

Classification. For small label sets with limited training data, embed the input, compute cosine similarity to a labeled centroid per class, pick the closest. This beats a fine-tuned model when you have fewer than ~50 examples per class.

Deduplication. Hash near-duplicates by chunking, embedding, and comparing pairwise within a similarity threshold. Faster and more accurate than shingle-based MinHash on natural language.

Recommendation. "More like this" buttons. Embed the seed item, find the nearest neighbors, filter by business rules, return.

Common mistakes

A short list, ordered by how often I see them.

Embedding the wrong thing. You embedded titles instead of titles + body, or chunks too small to carry context. Always look at five random retrievals before declaring a system done.
No chunking strategy. Dumping 50,000-token PDFs as single embeddings buries the signal. Chunk to 200 to 800 tokens with 10 to 20% overlap for prose.
Forgetting to truncate. Defaulting to 3072 dims for a 10M-vector index costs 4-6x more storage than necessary for the same recall.
Cosine on already-normalized vectors. Wasted CPU. Use dot product.
Single-input requests in a loop. Batch.
No eval harness. You have no idea if your retrieval improved or regressed when you swapped models. Build a 200-query gold set and run it every change.

Observability for embeddings

Embedding calls are a noticeable line item once you cross a few million documents. Track per-feature spend, latency, and token counts the same way you track chat completions. If you route through an LLM gateway, you get this for free, plus the ability to swap providers without code changes. See our Respan tracing docs for capturing embedding calls as spans in your traces.

FAQ

Should I use text-embedding-3-large or text-embedding-3-small? Start with 3-small. Switch to 3-large only after you have an eval harness that shows the large model materially improves recall on your data.

What dimensions should I use? 512 for most production RAG. 256 for very large corpora or latency-sensitive paths. 1024+ when recall is everything and storage is cheap.

Is ada-002 still worth using? Only for legacy systems where re-embedding is too expensive. For new projects, no.

Do I need to normalize the vectors? OpenAI returns L2-normalized vectors already. You can use dot product directly.

How do I evaluate embedding quality? Build a small set of (query, expected-relevant-doc-id) pairs. Compute recall@10 and MRR. Re-run on every change.

Can I fine-tune embeddings? OpenAI does not currently support fine-tuning of text-embedding-3-* models. For domain-specific tuning, train a small projection layer on top, or use a self-hosted model like bge-m3 or e5-mistral.

How much will my project cost? Rough rule: 1M documents at 500 tokens each, embedded once with 3-small, is 500M tokens, which is $10. Re-embedding for a model change is the same. Production cost is almost always dominated by query embeddings (cheap) and downstream LLM calls (not cheap).

TL;DR

Default to text-embedding-3-small. It costs $0.02 per million input tokens and handles most retrieval tasks well.
Reach for text-embedding-3-large when recall matters more than cost. It scores 64.6 on MTEB retrieval, about 3.6 points higher than ada-002, at $0.13 per million tokens.
Use the dimensions parameter to shrink vectors. A 3-large truncated to 256 dims still beats ada-002 at 1536, with a fraction of the storage cost.
Batch aggressively. The Embeddings API accepts arrays. One request with 100 inputs is dramatically cheaper in wall-clock time than 100 sequential requests.
Cosine similarity is the default. Normalized OpenAI vectors mean cosine and dot product give identical rankings.
Land in a real vector store. pgvector if you already run Postgres; Pinecone, Qdrant, or Weaviate if you need scale or filtering features.

The three models, and which to pick

OpenAI ships three embedding models in 2026:

Model	Default dims	Max input	Price per 1M tokens	MTEB retrieval
`text-embedding-3-large`	3072	8191	$0.13	64.6
`text-embedding-3-small`	1536	8191	$0.02	62.3
`text-embedding-ada-002` (legacy)	1536	8191	$0.10	61.0

A few observations.

text-embedding-3-small is the obvious starting point. It is 5x cheaper than ada-002 for better quality, full stop. There is no reason to default to ada-002 for a new project.

The dimensions parameter is the real lever

from openai import OpenAI
 
client = OpenAI()
 
# Default: 3072 dimensions
full = client.embeddings.create(
    model="text-embedding-3-large",
    input="The quick brown fox jumps over the lazy dog.",
)
 
# Truncated: 512 dimensions, much cheaper to store
small = client.embeddings.create(
    model="text-embedding-3-large",
    input="The quick brown fox jumps over the lazy dog.",
    dimensions=512,
)
 
print(len(full.data[0].embedding))   # 3072
print(len(small.data[0].embedding))  # 512

Practical heuristic for picking a dimension count:

256: prototyping, latency-sensitive search, embedded devices
512 to 1024: most production RAG systems
1536 to 3072: when you have the budget and recall is everything

The catch: once you commit a corpus to a particular dimension count, re-embedding to change it is non-trivial. Pick one, validate on a held-out eval set, then commit.

Calling the API: Python and TypeScript

The Python SDK is the canonical path.

from openai import OpenAI
 
client = OpenAI()
 
response = client.embeddings.create(
    model="text-embedding-3-small",
    input=[
        "How do I reset my password?",
        "Where can I find my invoice?",
        "Is two-factor authentication supported?",
    ],
    dimensions=512,
)
 
vectors = [item.embedding for item in response.data]
print(f"Got {len(vectors)} vectors of dim {len(vectors[0])}")

The TypeScript SDK is the same shape.

import OpenAI from "openai";
 
const client = new OpenAI();
 
const response = await client.embeddings.create({
  model: "text-embedding-3-small",
  input: [
    "How do I reset my password?",
    "Where can I find my invoice?",
    "Is two-factor authentication supported?",
  ],
  dimensions: 512,
});
 
const vectors = response.data.map((d) => d.embedding);

Three things to know about the input parameter.

First, it accepts either a single string or an array of strings. Always pass arrays in production. The marginal cost of adding inputs to one request is dominated by token count, not request count.

Third, newlines in input are now handled correctly by the 3-series models. The old advice to strip newlines (which existed for ada-002) is obsolete.

Batching for cost and throughput

The Embeddings endpoint accepts up to 2048 inputs per request, capped by total token count. The right pattern for backfilling a corpus is to chunk into batches that respect both limits.

from openai import OpenAI
import tiktoken
 
client = OpenAI()
encoder = tiktoken.encoding_for_model("text-embedding-3-small")
 
def batch_inputs(texts, max_tokens=8000, max_count=128):
    batch, batch_tokens = [], 0
    for text in texts:
        tokens = len(encoder.encode(text))
        if batch and (batch_tokens + tokens > max_tokens or len(batch) >= max_count):
            yield batch
            batch, batch_tokens = [], 0
        batch.append(text)
        batch_tokens += tokens
    if batch:
        yield batch
 
def embed_corpus(texts, model="text-embedding-3-small", dims=512):
    all_vectors = []
    for batch in batch_inputs(texts):
        response = client.embeddings.create(
            model=model, input=batch, dimensions=dims,
        )
        all_vectors.extend(item.embedding for item in response.data)
    return all_vectors

Cosine vs dot product

OpenAI's embedding vectors come out L2-normalized. That means cosine similarity and dot product produce identical rankings:

import numpy as np
 
def cosine(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
 
def dot(a, b):
    return np.dot(a, b)

If you mix OpenAI embeddings with embeddings from another provider that does not normalize, switch back to cosine to be safe.

Vector store integration

The three sane options in 2026, by deployment model.

pgvector (Postgres)

If you already run Postgres, this is the answer. pgvector is a Postgres extension; vectors are a column type, similarity search is a SQL operator.

CREATE EXTENSION vector;
 
CREATE TABLE docs (
    id bigserial PRIMARY KEY,
    body text,
    embedding vector(512)
);
 
CREATE INDEX docs_embedding_idx
    ON docs USING hnsw (embedding vector_ip_ops);
 
-- Find the 10 most similar docs to a query embedding
SELECT id, body
FROM docs
ORDER BY embedding <#> $1
LIMIT 10;

Pinecone

from pinecone import Pinecone
 
pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])
index = pc.Index("support-docs")
 
index.upsert(vectors=[
    {"id": "doc1", "values": vector, "metadata": {"source": "kb"}},
])
 
results = index.query(vector=query_vector, top_k=10, include_metadata=True)

Qdrant

Real use cases beyond RAG

Embeddings show up everywhere; RAG is the loud one.

Deduplication. Hash near-duplicates by chunking, embedding, and comparing pairwise within a similarity threshold. Faster and more accurate than shingle-based MinHash on natural language.

Recommendation. "More like this" buttons. Embed the seed item, find the nearest neighbors, filter by business rules, return.

Common mistakes

A short list, ordered by how often I see them.

Embedding the wrong thing. You embedded titles instead of titles + body, or chunks too small to carry context. Always look at five random retrievals before declaring a system done.
No chunking strategy. Dumping 50,000-token PDFs as single embeddings buries the signal. Chunk to 200 to 800 tokens with 10 to 20% overlap for prose.
Forgetting to truncate. Defaulting to 3072 dims for a 10M-vector index costs 4-6x more storage than necessary for the same recall.
Cosine on already-normalized vectors. Wasted CPU. Use dot product.
Single-input requests in a loop. Batch.
No eval harness. You have no idea if your retrieval improved or regressed when you swapped models. Build a 200-query gold set and run it every change.

Observability for embeddings

FAQ

What dimensions should I use? 512 for most production RAG. 256 for very large corpora or latency-sensitive paths. 1024+ when recall is everything and storage is cheap.

Is ada-002 still worth using? Only for legacy systems where re-embedding is too expensive. For new projects, no.

Do I need to normalize the vectors? OpenAI returns L2-normalized vectors already. You can use dot product directly.

How do I evaluate embedding quality? Build a small set of (query, expected-relevant-doc-id) pairs. Compute recall@10 and MRR. Re-run on every change.

OpenAI Embeddings: Engineer's Guide (2026)

TL;DR

The three models, and which to pick

The dimensions parameter is the real lever

Calling the API: Python and TypeScript

Batching for cost and throughput

Cosine vs dot product

Vector store integration

pgvector (Postgres)

Pinecone

Qdrant

Real use cases beyond RAG

Common mistakes

Observability for embeddings

FAQ

Related articles

What Is Semantic Search?

Anthropic API Rate Limits + 429/529 Handling Guide (2026)

Anthropic Message Batches API: 50% Off Async Jobs (2026)

Built for AI agents.
Break less.
Ship more.

OpenAI Embeddings: Engineer's Guide (2026)

TL;DR

The three models, and which to pick

The dimensions parameter is the real lever

Calling the API: Python and TypeScript

Batching for cost and throughput

Cosine vs dot product

Vector store integration

pgvector (Postgres)

Pinecone

Qdrant

Real use cases beyond RAG

Common mistakes

Observability for embeddings

FAQ

Related articles

What Is Semantic Search?

Anthropic API Rate Limits + 429/529 Handling Guide (2026)

Anthropic Message Batches API: 50% Off Async Jobs (2026)

Built for AI agents.
Break less.
Ship more.

Related articles

Explainer
What Is Semantic Search?
Semantic search explained: embeddings, vector databases, hybrid search with BM25, reranking with cross-encoders, evaluation, and a pgvector code example.
Frank Chen · 1 day ago

How-to
Anthropic API Rate Limits + 429/529 Handling Guide (2026)
Anthropic API rate limits explained: Build Tier 1-4 RPM/ITPM/OTPM, 429 vs 529 overloaded errors, exponential backoff with jitter, and gateway fallback patterns.
Frank Chen · 1 day ago

How-to
Anthropic Message Batches API: 50% Off Async Jobs (2026)
Anthropic Batches API guide: 50% discount on async jobs, up to 24-hour completion, Python and TypeScript examples, gotchas, and comparison to OpenAI Batch API.
Frank Chen · 1 day ago

OpenAI Embeddings: Engineer's Guide (2026)

TL;DR

The three models, and which to pick

The dimensions parameter is the real lever

Calling the API: Python and TypeScript

Batching for cost and throughput

Cosine vs dot product

Vector store integration

pgvector (Postgres)

Pinecone

Qdrant

Real use cases beyond RAG

Common mistakes

Observability for embeddings

FAQ

Related

Related articles

What Is Semantic Search?

Anthropic API Rate Limits + 429/529 Handling Guide (2026)

Anthropic Message Batches API: 50% Off Async Jobs (2026)

Built for AI agents. Break less. Ship more.

OpenAI Embeddings: Engineer's Guide (2026)

TL;DR

The three models, and which to pick

The dimensions parameter is the real lever

Calling the API: Python and TypeScript

Batching for cost and throughput

Cosine vs dot product

Vector store integration

pgvector (Postgres)

Pinecone

Qdrant

Real use cases beyond RAG

Common mistakes

Observability for embeddings

FAQ

Related

Related articles

What Is Semantic Search?

Anthropic API Rate Limits + 429/529 Handling Guide (2026)

Anthropic Message Batches API: 50% Off Async Jobs (2026)

Built for AI agents. Break less. Ship more.

Built for AI agents.
Break less.
Ship more.

Built for AI agents.
Break less.
Ship more.