Reproducibility means: same prompt + same parameters → same output. In practice, cloud APIs and self-hosted stacks still show run-to-run drift due to sampling, batching/scheduling, and numeric quirks. A key insight from recent engineering work is batch invariance -making kernel results independent of batch size and scheduling-to collapse most of that drift (even at temperature=0).

The article Defeating Nondeterminism in LLM Inference explains which kernels matter (RMSNorm, matmul, attention) and shows measured trade-offs.

Why Outputs Drift Even with "Greedy" Settings

Sampling & ties: temperature=0 reduces variance but doesn't guarantee identical outputs in all stacks.
Provider & model updates: The "same" model name can point to a new snapshot.
Batching & server load: Different batch sizes or scheduling can change numeric paths-hence the push for batch-invariant kernels.
Numeric/hardware differences: Framework docs note that full determinism across devices/versions isn't guaranteed; deterministic algorithms just tighten it.

OpenAI (ChatGPT): "Mostly" Reproducible with seed

OpenAI's Chat Completions exposes a seed that helps repeat outputs when all inputs and parameters are identical (model snapshot, prompt bytes, temperature/top-p, penalties, tokens). It's a practical way to get "mostly deterministic" behavior.

Python (Chat Completions API)

from openai import OpenAI
client = OpenAI()
 
resp = client.chat.completions.create(
    model="gpt-4o-mini-2025-07-18",  # pin the exact variant you deploy
    messages=[{"role": "user", "content": "Explain CRDTs in 3 bullets."}],
    temperature=0,
    top_p=1,
    seed=42,               # helps make outputs repeatable when all else is identical
    max_tokens=300,
    presence_penalty=0,
    frequency_penalty=0,
)
print(resp.choices[0].message.content)

Google Gemini (Vertex AI): seed Is "Best Effort"

Gemini supports a seed, but docs say it's best-effort-deterministic output isn't guaranteed, and changing models or parameters can vary results even with the same seed.

Python (Vertex AI)

from vertexai import init
from vertexai.generative_models import GenerativeModel, GenerationConfig
 
init(project="YOUR_PROJECT", location="us-central1")
model = GenerativeModel("gemini-1.5-pro")
 
cfg = GenerationConfig(
    temperature=0,
    top_p=1,
    seed=42,               # best-effort repeatability
    max_output_tokens=300,
)
 
resp = model.generate_content("Explain CRDTs in 3 bullets.", generation_config=cfg)
print(resp.text)

Anthropic Claude: No Official seed; Minimize Randomness

Claude exposes temperature/top-p(/top-k) but no official seed in the public API today. Minimize variance by pushing toward greedy decoding and holding everything else constant; pin exact model snapshots where your platform allows (e.g., Bedrock model IDs).

Python (Messages API)

from anthropic import Anthropic
client = Anthropic()
 
resp = client.messages.create(
    model="claude-3-7-sonnet-2025-05-xx",
    max_tokens=300,
    temperature=0,   # minimize randomness
    top_p=1,
    messages=[{"role": "user", "content": "Explain CRDTs in 3 bullets."}],
)
print(resp.content[0].text)

vLLM (Self-Hosted): Seeds + Scheduling Controls

For local serving, vLLM provides a reproducibility guide. In current versions you typically (a) set per-request or global seeds and (b) turn off multiprocessing in V1 to make scheduling deterministic.

Python (engine + per-request seed)

import os
from vllm import LLM, SamplingParams
 
# Reduce scheduling nondeterminism in V1 engines:
os.environ["VLLM_ENABLE_V1_MULTIPROCESSING"] = "0"
 
llm = LLM(model="Qwen/Qwen2.5-7B-Instruct")
sp = SamplingParams(
    temperature=0,
    top_p=1,
    seed=42,                  # per-request seed
    max_tokens=300,
)
out = llm.generate(["Explain CRDTs in 3 bullets."], sampling_params=sp)
print(out[0].outputs[0].text)

Under concurrency, the strongest gains come from batch-invariant kernels (RMSNorm/matmul/attention), which remove batch-size dependencies at some throughput cost. See the Thinking Machines article for design details and evidence.

Hugging Face Transformers + PyTorch: Greedy + Deterministic Algorithms

Greedy decoding (do_sample=False) is deterministic when you keep the software/hardware stack fixed. Set framework seeds and opt into PyTorch's deterministic algorithms; PyTorch cautions that bit-identical results across devices/versions aren't promised.

Python

import os, random, numpy as np, torch
from transformers import AutoTokenizer, AutoModelForCausalLM, set_seed
 
# 1) Seed everything
SEED = 42
set_seed(SEED)
random.seed(SEED); np.random.seed(SEED); torch.manual_seed(SEED)
 
# 2) Prefer deterministic algorithms (may reduce speed)
torch.use_deterministic_algorithms(True)
torch.backends.cudnn.benchmark = False
# Optional (per PyTorch docs) for CUDA/cuBLAS:
# os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"
 
tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
mdl = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct").eval()
 
inp = tok("Explain CRDTs in 3 bullets.", return_tensors="pt")
out_ids = mdl.generate(**inp, max_new_tokens=300, do_sample=False)  # greedy
print(tok.decode(out_ids[0], skip_special_tokens=True))

Ollama: Deterministic Generation with seed (Best-Effort)

Ollama's Modelfile/options supports a seed-"setting this to a specific number will make the model generate the same text for the same prompt." In practice, treat as best-effort and keep temperature/top-p/top-k and context length constant; pin your build.

REST (JSON body options)

curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "Explain CRDTs in 3 bullets.",
  "options": {
    "temperature": 0,
    "top_p": 1,
    "top_k": 1,
    "seed": 42
  }
}'

llama.cpp / llama-cpp-python: Seed + Strict Greedy

llama.cpp and llama-cpp-python expose a seed; for strict greedy decoding recent guidance is to set a single top-k sampler with k=1 (don't rely on temp=0 alone). As with other stacks, different backends/builds can still cause minor drift-pin binaries, drivers, and quantization.

CLI (llama.cpp)

./llama-cli -m ./models/llama3.gguf \
  --seed 42 \
  --sampling-seq k --top-k 1 \
  --temp 0 \
  -p "Explain CRDTs in 3 bullets."

Python (llama-cpp-python)

from llama_cpp import Llama
llm = Llama(model_path="models/llama3.gguf", seed=42)
 
resp = llm.create_completion(
    prompt="Explain CRDTs in 3 bullets.",
    max_tokens=300,
    temperature=0,
    top_p=1,
    top_k=1,
)
print(resp["choices"][0]["text"])

NVIDIA TensorRT-LLM: Reuse the Same Engine + Fixed Inputs

At runtime, using the same built engine with the same inputs should be deterministic; however, engine building isn't deterministic (tactic selection can differ), and some performance features can lead to slightly different outputs under varying load. Reuse the exact engine file for repeatability.

Serve (conceptual example)

# Build once, reuse the same engine file for all runs:
trtllm-build --checkpoint_dir ./weights --gpt_attention_plugin float16 --output_dir ./engines
trtllm-serve --engine_dir ./engines --max_batch_size 1
# Then call your endpoint with temperature=0, top_p=1, fixed max tokens, etc.

SGLang: Fast Server, Acknowledged Non-Determinism Under Load

SGLang is optimized for speed; its FAQ notes results may differ even at temperature=0 due to dynamic batching and prefix caching. If determinism matters, reduce concurrency features or run single-request batches; follow project issues for seed controls.

Launch (single-tenant style to reduce variance)

python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.2-3B-Instruct \
  --host 0.0.0.0 --port 8080 \
  --context-length 8192
  # keep load low; avoid dynamic batching if possible in your version

Hugging Face TGI (Text Generation Inference): Client-Side Control

TGI serves Transformers models with high throughput. Determinism mainly comes from your client's generation params (e.g., greedy vs sampling) and a stable container/model snapshot; TGI itself doesn't promise determinism under batching. Keep decoding params fixed and pin images/tags.

cURL (OpenAI-style route exposed by TGI)

curl http://localhost:8080/generate \
  -X POST -H "Content-Type: application/json" \
  -d '{
    "inputs": "Explain CRDTs in 3 bullets.",
    "parameters": {
      "do_sample": false,
      "max_new_tokens": 300
    }
  }'

Why Outputs Drift Even with "Greedy" Settings

Sampling & ties: temperature=0 reduces variance but doesn't guarantee identical outputs in all stacks.
Provider & model updates: The "same" model name can point to a new snapshot.
Batching & server load: Different batch sizes or scheduling can change numeric paths-hence the push for batch-invariant kernels.
Numeric/hardware differences: Framework docs note that full determinism across devices/versions isn't guaranteed; deterministic algorithms just tighten it.

OpenAI (ChatGPT): "Mostly" Reproducible with seed

Python (Chat Completions API)

from openai import OpenAI
client = OpenAI()
 
resp = client.chat.completions.create(
    model="gpt-4o-mini-2025-07-18",  # pin the exact variant you deploy
    messages=[{"role": "user", "content": "Explain CRDTs in 3 bullets."}],
    temperature=0,
    top_p=1,
    seed=42,               # helps make outputs repeatable when all else is identical
    max_tokens=300,
    presence_penalty=0,
    frequency_penalty=0,
)
print(resp.choices[0].message.content)

Google Gemini (Vertex AI): seed Is "Best Effort"

Gemini supports a seed, but docs say it's best-effort-deterministic output isn't guaranteed, and changing models or parameters can vary results even with the same seed.

Python (Vertex AI)

from vertexai import init
from vertexai.generative_models import GenerativeModel, GenerationConfig
 
init(project="YOUR_PROJECT", location="us-central1")
model = GenerativeModel("gemini-1.5-pro")
 
cfg = GenerationConfig(
    temperature=0,
    top_p=1,
    seed=42,               # best-effort repeatability
    max_output_tokens=300,
)
 
resp = model.generate_content("Explain CRDTs in 3 bullets.", generation_config=cfg)
print(resp.text)

Anthropic Claude: No Official seed; Minimize Randomness

Python (Messages API)

from anthropic import Anthropic
client = Anthropic()
 
resp = client.messages.create(
    model="claude-3-7-sonnet-2025-05-xx",
    max_tokens=300,
    temperature=0,   # minimize randomness
    top_p=1,
    messages=[{"role": "user", "content": "Explain CRDTs in 3 bullets."}],
)
print(resp.content[0].text)

vLLM (Self-Hosted): Seeds + Scheduling Controls

Python (engine + per-request seed)

import os
from vllm import LLM, SamplingParams
 
# Reduce scheduling nondeterminism in V1 engines:
os.environ["VLLM_ENABLE_V1_MULTIPROCESSING"] = "0"
 
llm = LLM(model="Qwen/Qwen2.5-7B-Instruct")
sp = SamplingParams(
    temperature=0,
    top_p=1,
    seed=42,                  # per-request seed
    max_tokens=300,
)
out = llm.generate(["Explain CRDTs in 3 bullets."], sampling_params=sp)
print(out[0].outputs[0].text)

Under concurrency, the strongest gains come from batch-invariant kernels (RMSNorm/matmul/attention), which remove batch-size dependencies at some throughput cost. See the Thinking Machines article for design details and evidence.

Hugging Face Transformers + PyTorch: Greedy + Deterministic Algorithms

Python

import os, random, numpy as np, torch
from transformers import AutoTokenizer, AutoModelForCausalLM, set_seed
 
# 1) Seed everything
SEED = 42
set_seed(SEED)
random.seed(SEED); np.random.seed(SEED); torch.manual_seed(SEED)
 
# 2) Prefer deterministic algorithms (may reduce speed)
torch.use_deterministic_algorithms(True)
torch.backends.cudnn.benchmark = False
# Optional (per PyTorch docs) for CUDA/cuBLAS:
# os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"
 
tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
mdl = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct").eval()
 
inp = tok("Explain CRDTs in 3 bullets.", return_tensors="pt")
out_ids = mdl.generate(**inp, max_new_tokens=300, do_sample=False)  # greedy
print(tok.decode(out_ids[0], skip_special_tokens=True))

Ollama: Deterministic Generation with seed (Best-Effort)

REST (JSON body options)

curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "Explain CRDTs in 3 bullets.",
  "options": {
    "temperature": 0,
    "top_p": 1,
    "top_k": 1,
    "seed": 42
  }
}'

llama.cpp / llama-cpp-python: Seed + Strict Greedy

CLI (llama.cpp)

./llama-cli -m ./models/llama3.gguf \
  --seed 42 \
  --sampling-seq k --top-k 1 \
  --temp 0 \
  -p "Explain CRDTs in 3 bullets."

Python (llama-cpp-python)

from llama_cpp import Llama
llm = Llama(model_path="models/llama3.gguf", seed=42)
 
resp = llm.create_completion(
    prompt="Explain CRDTs in 3 bullets.",
    max_tokens=300,
    temperature=0,
    top_p=1,
    top_k=1,
)
print(resp["choices"][0]["text"])

NVIDIA TensorRT-LLM: Reuse the Same Engine + Fixed Inputs

Serve (conceptual example)

# Build once, reuse the same engine file for all runs:
trtllm-build --checkpoint_dir ./weights --gpt_attention_plugin float16 --output_dir ./engines
trtllm-serve --engine_dir ./engines --max_batch_size 1
# Then call your endpoint with temperature=0, top_p=1, fixed max tokens, etc.

SGLang: Fast Server, Acknowledged Non-Determinism Under Load

Launch (single-tenant style to reduce variance)

python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.2-3B-Instruct \
  --host 0.0.0.0 --port 8080 \
  --context-length 8192
  # keep load low; avoid dynamic batching if possible in your version

Hugging Face TGI (Text Generation Inference): Client-Side Control

cURL (OpenAI-style route exposed by TGI)

curl http://localhost:8080/generate \
  -X POST -H "Content-Type: application/json" \
  -d '{
    "inputs": "Explain CRDTs in 3 bullets.",
    "parameters": {
      "do_sample": false,
      "max_new_tokens": 300
    }
  }'

How to get consistent and reproducible LLM outputs

Why Outputs Drift Even with "Greedy" Settings

OpenAI (ChatGPT): "Mostly" Reproducible with seed

Google Gemini (Vertex AI): seed Is "Best Effort"

Anthropic Claude: No Official seed; Minimize Randomness

vLLM (Self-Hosted): Seeds + Scheduling Controls

Hugging Face Transformers + PyTorch: Greedy + Deterministic Algorithms

Ollama: Deterministic Generation with seed (Best-Effort)

llama.cpp / llama-cpp-python: Seed + Strict Greedy

NVIDIA TensorRT-LLM: Reuse the Same Engine + Fixed Inputs

SGLang: Fast Server, Acknowledged Non-Determinism Under Load

Hugging Face TGI (Text Generation Inference): Client-Side Control

Further Reading

You might also like

Portkey was just acquired by Palo Alto Networks. Here's where to migrate.

The mess of OTel semantic conventions and why tracing CLI tools is still painful

Claude Code vs Codex: What traces reveal about how they actually work

Built for AI agents.
Break less.
Ship more.

How to get consistent and reproducible LLM outputs

Why Outputs Drift Even with "Greedy" Settings

OpenAI (ChatGPT): "Mostly" Reproducible with seed

Google Gemini (Vertex AI): seed Is "Best Effort"

Anthropic Claude: No Official seed; Minimize Randomness

vLLM (Self-Hosted): Seeds + Scheduling Controls

Hugging Face Transformers + PyTorch: Greedy + Deterministic Algorithms

Ollama: Deterministic Generation with seed (Best-Effort)

llama.cpp / llama-cpp-python: Seed + Strict Greedy

NVIDIA TensorRT-LLM: Reuse the Same Engine + Fixed Inputs

SGLang: Fast Server, Acknowledged Non-Determinism Under Load

Hugging Face TGI (Text Generation Inference): Client-Side Control

Further Reading

You might also like

Portkey was just acquired by Palo Alto Networks. Here's where to migrate.

The mess of OTel semantic conventions and why tracing CLI tools is still painful

Claude Code vs Codex: What traces reveal about how they actually work

Built for AI agents.
Break less.
Ship more.

How to get consistent and reproducible LLM outputs

Why Outputs Drift Even with "Greedy" Settings

OpenAI (ChatGPT): "Mostly" Reproducible with seed

Google Gemini (Vertex AI): seed Is "Best Effort"

Anthropic Claude: No Official seed; Minimize Randomness

vLLM (Self-Hosted): Seeds + Scheduling Controls

Hugging Face Transformers + PyTorch: Greedy + Deterministic Algorithms

Ollama: Deterministic Generation with seed (Best-Effort)

llama.cpp / llama-cpp-python: Seed + Strict Greedy

NVIDIA TensorRT-LLM: Reuse the Same Engine + Fixed Inputs

SGLang: Fast Server, Acknowledged Non-Determinism Under Load

Hugging Face TGI (Text Generation Inference): Client-Side Control

Further Reading

You might also like

Portkey was just acquired by Palo Alto Networks. Here's where to migrate.

The mess of OTel semantic conventions and why tracing CLI tools is still painful

Claude Code vs Codex: What traces reveal about how they actually work

Built for AI agents. Break less. Ship more.

How to get consistent and reproducible LLM outputs

Why Outputs Drift Even with "Greedy" Settings

OpenAI (ChatGPT): "Mostly" Reproducible with seed

Google Gemini (Vertex AI): seed Is "Best Effort"

Anthropic Claude: No Official seed; Minimize Randomness

vLLM (Self-Hosted): Seeds + Scheduling Controls

Hugging Face Transformers + PyTorch: Greedy + Deterministic Algorithms

Ollama: Deterministic Generation with seed (Best-Effort)

llama.cpp / llama-cpp-python: Seed + Strict Greedy

NVIDIA TensorRT-LLM: Reuse the Same Engine + Fixed Inputs

SGLang: Fast Server, Acknowledged Non-Determinism Under Load

Hugging Face TGI (Text Generation Inference): Client-Side Control

Further Reading

You might also like

Portkey was just acquired by Palo Alto Networks. Here's where to migrate.

The mess of OTel semantic conventions and why tracing CLI tools is still painful

Claude Code vs Codex: What traces reveal about how they actually work

Built for AI agents. Break less. Ship more.

Built for AI agents.
Break less.
Ship more.

Built for AI agents.
Break less.
Ship more.