Reproducibility means: same prompt + same parameters → same output. In practice, cloud APIs and self-hosted stacks still show run-to-run drift due to sampling, batching/scheduling, and numeric quirks. A key insight from recent engineering work is batch invariance -making kernel results independent of batch size and scheduling-to collapse most of that drift (even at temperature=0).
The article Defeating Nondeterminism in LLM Inference explains which kernels matter (RMSNorm, matmul, attention) and shows measured trade-offs.
Why Outputs Drift Even with "Greedy" Settings
- Sampling & ties: temperature=0 reduces variance but doesn't guarantee identical outputs in all stacks.
- Provider & model updates: The "same" model name can point to a new snapshot.
- Batching & server load: Different batch sizes or scheduling can change numeric paths-hence the push for batch-invariant kernels.
- Numeric/hardware differences: Framework docs note that full determinism across devices/versions isn't guaranteed; deterministic algorithms just tighten it.
OpenAI (ChatGPT): "Mostly" Reproducible with seed
OpenAI's Chat Completions exposes a seed that helps repeat outputs when all inputs and parameters are identical (model snapshot, prompt bytes, temperature/top-p, penalties, tokens). It's a practical way to get "mostly deterministic" behavior.
Python (Chat Completions API)
from openai import OpenAI
client = OpenAI()
resp = client.chat.completions.create(
model="gpt-4o-mini-2025-07-18", # pin the exact variant you deploy
messages=[{"role": "user", "content": "Explain CRDTs in 3 bullets."}],
temperature=0,
top_p=1,
seed=42, # helps make outputs repeatable when all else is identical
max_tokens=300,
presence_penalty=0,
frequency_penalty=0,
)
print(resp.choices[0].message.content)Google Gemini (Vertex AI): seed Is "Best Effort"
Gemini supports a seed, but docs say it's best-effort-deterministic output isn't guaranteed, and changing models or parameters can vary results even with the same seed.
Python (Vertex AI)
from vertexai import init
from vertexai.generative_models import GenerativeModel, GenerationConfig
init(project="YOUR_PROJECT", location="us-central1")
model = GenerativeModel("gemini-1.5-pro")
cfg = GenerationConfig(
temperature=0,
top_p=1,
seed=42, # best-effort repeatability
max_output_tokens=300,
)
resp = model.generate_content("Explain CRDTs in 3 bullets.", generation_config=cfg)
print(resp.text)Anthropic Claude: No Official seed; Minimize Randomness
Claude exposes temperature/top-p(/top-k) but no official seed in the public API today. Minimize variance by pushing toward greedy decoding and holding everything else constant; pin exact model snapshots where your platform allows (e.g., Bedrock model IDs).
Python (Messages API)
from anthropic import Anthropic
client = Anthropic()
resp = client.messages.create(
model="claude-3-7-sonnet-2025-05-xx",
max_tokens=300,
temperature=0, # minimize randomness
top_p=1,
messages=[{"role": "user", "content": "Explain CRDTs in 3 bullets."}],
)
print(resp.content[0].text)vLLM (Self-Hosted): Seeds + Scheduling Controls
For local serving, vLLM provides a reproducibility guide. In current versions you typically (a) set per-request or global seeds and (b) turn off multiprocessing in V1 to make scheduling deterministic.
Python (engine + per-request seed)
import os
from vllm import LLM, SamplingParams
# Reduce scheduling nondeterminism in V1 engines:
os.environ["VLLM_ENABLE_V1_MULTIPROCESSING"] = "0"
llm = LLM(model="Qwen/Qwen2.5-7B-Instruct")
sp = SamplingParams(
temperature=0,
top_p=1,
seed=42, # per-request seed
max_tokens=300,
)
out = llm.generate(["Explain CRDTs in 3 bullets."], sampling_params=sp)
print(out[0].outputs[0].text)Under concurrency, the strongest gains come from batch-invariant kernels (RMSNorm/matmul/attention), which remove batch-size dependencies at some throughput cost. See the Thinking Machines article for design details and evidence.
Hugging Face Transformers + PyTorch: Greedy + Deterministic Algorithms
Greedy decoding (do_sample=False) is deterministic when you keep the software/hardware stack fixed. Set framework seeds and opt into PyTorch's deterministic algorithms; PyTorch cautions that bit-identical results across devices/versions aren't promised.
Python
import os, random, numpy as np, torch
from transformers import AutoTokenizer, AutoModelForCausalLM, set_seed
# 1) Seed everything
SEED = 42
set_seed(SEED)
random.seed(SEED); np.random.seed(SEED); torch.manual_seed(SEED)
# 2) Prefer deterministic algorithms (may reduce speed)
torch.use_deterministic_algorithms(True)
torch.backends.cudnn.benchmark = False
# Optional (per PyTorch docs) for CUDA/cuBLAS:
# os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"
tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
mdl = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct").eval()
inp = tok("Explain CRDTs in 3 bullets.", return_tensors="pt")
out_ids = mdl.generate(**inp, max_new_tokens=300, do_sample=False) # greedy
print(tok.decode(out_ids[0], skip_special_tokens=True))Ollama: Deterministic Generation with seed (Best-Effort)
Ollama's Modelfile/options supports a seed-"setting this to a specific number will make the model generate the same text for the same prompt." In practice, treat as best-effort and keep temperature/top-p/top-k and context length constant; pin your build.
REST (JSON body options)
curl http://localhost:11434/api/generate -d '{
"model": "llama3",
"prompt": "Explain CRDTs in 3 bullets.",
"options": {
"temperature": 0,
"top_p": 1,
"top_k": 1,
"seed": 42
}
}'llama.cpp / llama-cpp-python: Seed + Strict Greedy
llama.cpp and llama-cpp-python expose a seed; for strict greedy decoding recent guidance is to set a single top-k sampler with k=1 (don't rely on temp=0 alone). As with other stacks, different backends/builds can still cause minor drift-pin binaries, drivers, and quantization.
CLI (llama.cpp)
./llama-cli -m ./models/llama3.gguf \
--seed 42 \
--sampling-seq k --top-k 1 \
--temp 0 \
-p "Explain CRDTs in 3 bullets."Python (llama-cpp-python)
from llama_cpp import Llama
llm = Llama(model_path="models/llama3.gguf", seed=42)
resp = llm.create_completion(
prompt="Explain CRDTs in 3 bullets.",
max_tokens=300,
temperature=0,
top_p=1,
top_k=1,
)
print(resp["choices"][0]["text"])NVIDIA TensorRT-LLM: Reuse the Same Engine + Fixed Inputs
At runtime, using the same built engine with the same inputs should be deterministic; however, engine building isn't deterministic (tactic selection can differ), and some performance features can lead to slightly different outputs under varying load. Reuse the exact engine file for repeatability.
Serve (conceptual example)
# Build once, reuse the same engine file for all runs:
trtllm-build --checkpoint_dir ./weights --gpt_attention_plugin float16 --output_dir ./engines
trtllm-serve --engine_dir ./engines --max_batch_size 1
# Then call your endpoint with temperature=0, top_p=1, fixed max tokens, etc.SGLang: Fast Server, Acknowledged Non-Determinism Under Load
SGLang is optimized for speed; its FAQ notes results may differ even at temperature=0 due to dynamic batching and prefix caching. If determinism matters, reduce concurrency features or run single-request batches; follow project issues for seed controls.
Launch (single-tenant style to reduce variance)
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.2-3B-Instruct \
--host 0.0.0.0 --port 8080 \
--context-length 8192
# keep load low; avoid dynamic batching if possible in your versionHugging Face TGI (Text Generation Inference): Client-Side Control
TGI serves Transformers models with high throughput. Determinism mainly comes from your client's generation params (e.g., greedy vs sampling) and a stable container/model snapshot; TGI itself doesn't promise determinism under batching. Keep decoding params fixed and pin images/tags.
cURL (OpenAI-style route exposed by TGI)
curl http://localhost:8080/generate \
-X POST -H "Content-Type: application/json" \
-d '{
"inputs": "Explain CRDTs in 3 bullets.",
"parameters": {
"do_sample": false,
"max_new_tokens": 300
}
}'Further Reading
- Defeating Nondeterminism in LLM Inference - batch-invariant kernels and why batching causes drift.
- OpenAI Cookbook: Reproducible outputs with seed - how to use seed and keep other params fixed.
- Google Vertex AI: seed is best-effort - helps, but not a guarantee.
- vLLM Reproducibility - seeds + scheduling controls for self-hosted.
- PyTorch Reproducibility Notes - what deterministic algorithms do (and don't) guarantee.
- Ollama Modelfile Reference - seed option and other knobs.
- llama-cpp-python API / llama.cpp guidance - seed parameter and strict greedy setup.
- TensorRT-LLM & TensorRT forums - reuse the same engine; building isn't deterministic.
- SGLang FAQ - why results can differ even at temperature=0.



