Fine-tuning is the most over-recommended technique in the LLM toolkit. Engineers reach for it because it sounds like the real thing: you are training the model, not just talking to it. In practice, 80% of the cases where someone proposes a fine-tune are better solved with a better prompt, a few-shot example, or a RAG pipeline. The remaining 20% are where fine-tuning earns its place, and on those problems nothing else gets close.
This guide is the engineering playbook: which OpenAI models you can actually fine-tune in 2026, the three methods (supervised, DPO, reinforcement), data prep, cost math, the deployment loop, and the decision framework I use before kicking off a job. The big lesson up front: do not fine-tune until you can clearly state what your eval metric is and why prompting cannot move it.
If you are still in the exploratory phase, read how to evaluate an LLM and how to reduce OpenAI API costs first. Most "I need fine-tuning" requests turn into "I need an eval harness and a better prompt" once measured.
TL;DR
- Fine-tunable models in 2026: GPT-4.1 and GPT-4.1-mini for SFT and DPO. o4-mini for reinforcement fine-tuning (RFT). The GPT-5.4 family is not currently available for fine-tuning.
- Three methods: Supervised Fine-Tuning (SFT) for behaviors you can demonstrate. Direct Preference Optimization (DPO) when you can rank pairs but not write the perfect answer. Reinforcement Fine-Tuning (RFT) for verifiable-correct tasks on reasoning models.
- Skip fine-tuning if the task is fact retrieval (use RAG), the failure is wrong tone (try prompting), or your eval set has fewer than ~100 examples (collect more data first).
- Data format: JSONL of chat messages. Hold out 10-20% as validation. Start with 50-500 examples for SFT; more is not always better.
- Cost: GPT-4.1-mini training runs around $5 to $50 for a typical SFT job; deployed inference is roughly 2x the base model price. Always compare against base-model price plus a longer prompt before committing.
When to fine-tune (and when not to)
The decision tree I use, in order:
- Is the task fact retrieval? Use RAG, not fine-tuning. Fine-tuned models hallucinate facts they "knew" at training time when you need fresh data.
- Is the failure tone, format, or style? Try a stricter prompt and a structured output schema. Fine-tuning is overkill for "always respond in JSON."
- Can you write a clear rubric for what good looks like? If no, you cannot evaluate the fine-tune, so do not start it.
- Do you have at least 50-100 high-quality training examples? If no, generate them or get human labelers; do not start with junk data.
- Have you tried few-shot prompting with 5-10 examples in the system prompt? If that works, ship it. Prompt-engineering is a one-line change; fine-tuning is a deployment.
- Is the prompt cost killing you? Fine-tuning bakes the instruction into the model, so you can drop the long system prompt at inference. This is a real cost play if you run high QPS.
- Do you need consistent reasoning behavior on a verifiable task? Look at RFT on o4-mini.
If you pass items 1 through 5 and have a concrete answer for 6 or 7, fine-tuning is on the table. Otherwise, fix the upstream problem first.
The three methods
OpenAI supports three fine-tuning techniques in 2026. They optimize for different signals and need different data.
Supervised Fine-Tuning (SFT)
SFT is the default. You provide input/output pairs that demonstrate the behavior you want, the model learns to imitate them. Use SFT when:
- The desired output is well-defined and you can write it.
- You have 50 to a few thousand examples.
- The signal is "match this style or format or domain language."
Classic SFT wins: structured extraction for a specific schema, support agent tone matched to your brand, code completion for an internal DSL, classification with a long fixed taxonomy.
Direct Preference Optimization (DPO)
DPO uses pairwise comparisons. You provide a prompt, a preferred response, and a rejected response. The model learns to favor the preferred pattern. Use DPO when:
- You can rank outputs as better or worse, but it is hard to write "the perfect" answer.
- You have an existing SFT-tuned model and want to nudge its behavior.
- You have user feedback (thumbs up, thumbs down) you can convert to pairs.
DPO is usually a second pass after SFT, not a replacement. The pattern: SFT on demonstrations, then DPO on preference data to polish.
Reinforcement Fine-Tuning (RFT)
RFT is the newest method and the most powerful for the right problem. You define a grader (a reward function that scores outputs) and the model trains against it. Generally available on o-series reasoning models, currently scoped to o4-mini in 2026.
Use RFT when:
- The task is verifiable: math, code that passes tests, structured output that parses against a schema.
- You have a programmatic grader, not a human-judgment loop.
- You need reasoning quality on a narrow domain.
RFT shines on things like: theorem proving, competitive coding subdomains, scientific data extraction, deterministic agent tool-use. It is not the right tool for open-ended generation where there is no clean grader.
Data prep: the JSONL format
OpenAI's fine-tuning format is JSONL, one example per line, each line a chat-style payload.
{"messages": [
{"role": "system", "content": "You extract product SKUs from text."},
{"role": "user", "content": "I ordered SKU-7821 yesterday."},
{"role": "assistant", "content": "{\"sku\": \"SKU-7821\"}"}
]}A few practical rules I have learned the painful way.
Make the system prompt match what production will use. If your fine-tune sees one system prompt during training and a different one at inference, you will get drift. Pick the production prompt before you start labeling.
Hold out a validation set. OpenAI's API accepts a separate validation_file. Aim for 10-20% of your data, picked to be representative (not just the last 10% of the file, which often has temporal bias).
Quality matters more than quantity. 200 hand-curated examples beat 2000 synthetic ones for most SFT tasks. The fine-tune will learn your typos, your inconsistent formatting, and your one bad example with the wrong label.
For DPO, use the preference format.
{
"input": {"messages": [{"role": "user", "content": "..."}]},
"preferred_output": [{"role": "assistant", "content": "good answer"}],
"non_preferred_output": [{"role": "assistant", "content": "bad answer"}]
}For RFT, you write a grader. Either inline graders (string match, JSON schema check) or hosted grader models. The OpenAI docs walk through this in detail; the key intuition is that the grader needs to be robust to surface variation but strict about correctness.
Kicking off a fine-tune
The API surface is straightforward. You upload a training file, optionally a validation file, then create a job.
from openai import OpenAI
client = OpenAI()
# 1. Upload training data
train_file = client.files.create(
file=open("train.jsonl", "rb"),
purpose="fine-tune",
)
val_file = client.files.create(
file=open("val.jsonl", "rb"),
purpose="fine-tune",
)
# 2. Create the job
job = client.fine_tuning.jobs.create(
training_file=train_file.id,
validation_file=val_file.id,
model="gpt-4.1-mini-2025-04-14",
hyperparameters={"n_epochs": 3},
suffix="extraction-v1",
)
# 3. Poll for completion
import time
while True:
job = client.fine_tuning.jobs.retrieve(job.id)
print(job.status, job.trained_tokens)
if job.status in ("succeeded", "failed", "cancelled"):
break
time.sleep(60)
print(f"Fine-tuned model: {job.fine_tuned_model}")The suffix parameter shows up in the deployed model name, which makes life easier when you ship multiple variants. The default n_epochs is "auto"; specify 3 to 5 for small datasets, 1 to 2 for large ones to avoid overfitting.
In TypeScript the same flow:
import OpenAI from "openai";
import fs from "fs";
const client = new OpenAI();
const trainFile = await client.files.create({
file: fs.createReadStream("train.jsonl"),
purpose: "fine-tune",
});
const job = await client.fineTuning.jobs.create({
training_file: trainFile.id,
model: "gpt-4.1-mini-2025-04-14",
suffix: "extraction-v1",
});
console.log(job.id);Cost estimation
The price model has three components:
- Training: charged per million tokens of training data, multiplied by epochs. For GPT-4.1-mini this is roughly $0.80 per million tokens, so a 5MB JSONL file (~1.5M tokens) at 3 epochs is around $3.60.
- Deployed inference: the tuned model is roughly 2x the base model's inference price.
- Idle hosting: no extra charge in OpenAI's current model; you pay only per call.
A back-of-envelope for a typical fine-tune: 500 examples, ~1000 tokens each, 3 epochs is 1.5M tokens, which costs under $5 on GPT-4.1-mini. Production inference at 10M calls/month with ~500 tokens average works out to a meaningful uplift over base GPT-4.1-mini, but is usually offset by being able to drop a long system prompt.
Always do the math before kicking off the job. The base-model-plus-better-prompt baseline is the one to beat.
Evaluating against the base model
A fine-tune is only worth shipping if it beats the alternatives on your eval set. The minimum eval loop:
def score(model_id, eval_examples):
correct = 0
for ex in eval_examples:
response = client.chat.completions.create(
model=model_id,
messages=ex["messages"][:-1],
)
if response.choices[0].message.content == ex["messages"][-1]["content"]:
correct += 1
return correct / len(eval_examples)
base_acc = score("gpt-4.1-mini-2025-04-14", eval_set)
ft_acc = score(job.fine_tuned_model, eval_set)
print(f"Base: {base_acc:.3f} | Tuned: {ft_acc:.3f}")For non-exact-match tasks, use an LLM-as-judge with a strong evaluator model (GPT-5.5 typically). Capture every call as a span and tag it with the model version so you can A/B fairly. See how to evaluate an LLM for the full eval playbook, and the LLM evals product page if you want to skip building this in-house.
A real gotcha: fine-tunes regress on tasks outside your training distribution. Include a "general competence" slice in your eval set, drawn from production traffic.
Deployment patterns
Once the job succeeds you have a model ID like ft:gpt-4.1-mini:your-org:extraction-v1:abc123. You call it the same as any other model.
response = client.chat.completions.create(
model="ft:gpt-4.1-mini:your-org:extraction-v1:abc123",
messages=[{"role": "user", "content": "..."}],
)In production, two patterns matter:
Shadow traffic first. For one to two weeks, send a copy of production traffic to the new tuned model, log both outputs, and compare offline. Do not switch the user-facing path until the shadow comparison agrees.
Canary with a kill switch. Route 1% to 5% of traffic to the tuned model, watch your KPIs (latency, error rate, business metric), and have a feature flag ready to revert. The right place to do this is your LLM gateway, which can do percentage-based routing without code changes.
When NOT to fine-tune
A few patterns where fine-tuning is the wrong answer, in order of how often I see them.
- You want the model to "know" your docs. Use RAG. Fine-tunes leak old facts and hallucinate confidently. This is the single most common misuse.
- Your eval set is your training set. You will measure the model's ability to memorize, not generalize. Always hold out a fresh test set the model never sees.
- You have 10 examples. That is not a fine-tune, that is a prompt. Put them in the system message.
- You want better reasoning on hard math. Use a reasoning model (o4-mini, GPT-5.5 reasoning) and prompting; SFT does not give you reasoning capacity you did not start with.
- The base model is fine and the prompt is fine but you "want to fine-tune anyway." Stop. Spend the budget on evals instead.
FAQ
Which OpenAI models can I fine-tune in 2026? GPT-4.1 and GPT-4.1-mini for SFT and DPO. o4-mini for RFT. The GPT-5.4 family (5.4, 5.4-mini, 5.4-nano) is not currently available for fine-tuning; check OpenAI's current docs before assuming this has not changed.
How many examples do I need? 50 to 100 is the floor for SFT to show measurable improvement. 200 to 500 hand-curated examples is the sweet spot for most tasks. More than 5000 examples rarely helps unless the task is truly large.
SFT or DPO first? SFT first, then DPO on top if you have preference data. DPO alone usually does not produce a coherent style; it polishes one.
How long does a fine-tune take? Typically 30 minutes to a few hours for SFT on a few thousand examples. RFT runs are longer (hours to a day) because the rollouts and grading are expensive.
Can I delete a fine-tuned model? Yes, via the API or dashboard. You should delete experimental tunes once a successor ships; they count against deployment quotas.
What about cost compared to a longer prompt? The fine-tune wins at high volume because you can drop the long system prompt. Below ~1M calls/month, a stronger prompt is usually cheaper end-to-end.
Does fine-tuning improve safety? SFT on bad data makes the model less safe. Fine-tuning is not a safety mechanism; it is a behavior modifier. Add evals for jailbreaks and harmful output to your post-tune validation.