In 2024 Berkeley's Pardos and Bhandari ran ChatGPT against an OpenStax-aligned set of math problems and found that 32% of generated hints contained both incorrect work and an incorrect solution across elementary algebra, intermediate algebra, college algebra, and statistics (PLOS One 2024). When they sampled 10 responses per problem and took the majority vote, the algebra error rate dropped to near zero. Statistics still sat at 13%. Frontier models in 2026 do better on benchmarks, but GSM8K-Platinum showed that the headline 99% on GSM8K masks measurable error rates the benchmark itself was leaking.
Hallucination in tutoring is not a generic LLM problem. It is the load-bearing failure mode that determines whether your product helps a student learn or builds durable misconceptions. This piece is for engineers shipping AI tutors who need to drive hallucination rates down without waiting for a better model. It covers the data on where models actually fail, the mislearning literature on why a confident wrong answer is more harmful than no answer at all, the OpenAI paper arguing that hallucination is mathematically inevitable under accuracy-only evals, and the six engineering fixes that close the gap.
For the wider Education cluster: the pillar covers the five build patterns at the platform level, the evaluation spoke covers the eval framework, the FERPA/COPPA spoke covers the regulatory layer, and the essay grader spoke walks through a complete build.
The data: where AI tutors actually fail
The Berkeley study is the canonical citation in K-12 AI tutoring. 32% of ChatGPT-generated hints had both incorrect work and an incorrect solution. Crucially, the learning gains from ChatGPT-help were statistically indistinguishable from human-tutor-authored help even before mitigation, but the errors happened at scale and the mitigation matters more than the headline.
The mitigation: sample N=10 responses per problem, group by final answer, take majority vote. Result: algebra error rate dropped to near zero. Statistics held at 13%. This is the most important number in the dossier. Self-consistency gets you most of the way for procedural domains where wrong answers are randomly distributed. It plateaus where wrong answers cluster into a single shared misconception that majority voting reinforces.
Frontier models in 2026 have moved benchmarks forward. GSM8K is functionally saturated: GPT-5 99.7%, o1 99.2%. MATH (Hendrycks, 12,500 competition problems) is near ceiling for top reasoning models. The live frontiers are AIME 2025 (GPT-5 94.6% no tools, GPT-5.2 Thinking 100%, Gemini 3 Pro Deep Think 98 to 99) and FrontierMath where even GPT-5.2 Thinking sits at ~40% on Tiers 1 to 3.
But GSM8K-Platinum re-evaluated frontier models on a cleaned and disambiguated subset of GSM8K and found measurable error rates remain even when the headline says 99%. The benchmark leaked accuracy.
In edtech specifically, KQED triggered Khanmigo accepting "430" as the answer to 272 - 172 = 440 ("Excellent!"), then rejecting a correct answer to 152 - 92 = 144 in another session. Khan Academy's own engineering blog acknowledged they had to build a calculator and pipe Khanmigo's numerical work through it rather than rely on token prediction, and to engineer separate advanced-math capabilities for symbolic work in geometry, calculus, and trig. They have not published their internal accuracy numbers.
Why a confident wrong answer is worse than no answer
The cognitive science literature does not use the phrase "mislearning event," but the concept is well-grounded in two adjacent literatures.
Misconception persistence. Persistence of intuitive conceptions (e.g., heavier objects sink more) survives instruction and coexists with the correct model. Students take longer to answer correctly because they are suppressing an interfering misconception (Springer 2014; JRST 2020). High-confidence misconceptions are "more strongly represented in memory" and require deleting an internally consistent mental framework, not just adding a new fact (UC Berkeley primer).
Hypercorrection effect. Metcalfe and Butterfield (2006, 2011) showed that high-confidence errors are corrected 70 to 90% of the time when explicit feedback is given, because the metacognitive surprise focuses attention. The 2011 follow-up showed that without retrieval practice, high-confidence wrong answers re-emerge after a week.
The implication for AI tutors: a confident wrong answer from the AI may stick more than a tentative one, and corrections fade unless paired with spaced retrieval. No one has published a comparative AI-vs-human-teacher study on mislearning rates yet (as of May 2026 this remains a real research gap). What is established is that confidence calibration matters more for tutors than for many other LLM applications, because the cost of a confident wrong answer is durable mislearning, not a bad demo.
OpenAI's argument: the bug is in your eval suite
In September 2025 OpenAI published Why Language Models Hallucinate (Kalai, Nachum, Zhang, Vempala). The core argument is that hallucinations are mathematically inevitable under the current training and eval regime, because:
- Pretraining on next-token cross-entropy gives a baseline error floor.
- Post-training evals are dominated by accuracy benchmarks that penalize "I don't know" identically to wrong answers.
So a model that abstains gets a zero. A model that guesses gets a positive expected score. Training on the scores trains the model to guess. The proposed fix is socio-technical: re-grade the dominant benchmarks to give partial credit for calibrated abstention, rather than introduce more hallucination evals (which get drowned out).
The implication for edtech: if your offline eval rewards the model for guessing (and most do, because they grade only on final-answer correctness), fine-tuning on it will increase tutor hallucination. The fix is not just "use a smarter base model." It is also "re-score your evals to give partial credit for 'I don't know, let me check' before retraining."
The companion piece is R-Tuning (Zhang, Diao et al., NAACL 2024 Outstanding Paper). Fine-tune on a refusal-aware dataset built from the model's own knowledge intersection: ask the model questions, label correct ones as "answer," incorrect ones as "I don't know," and SFT on the union. The refusal ability generalizes out of domain. It is a meta-skill, not just task-specific calibration.
Six engineering fixes
Each fix is a pattern, not a product. Mix and match based on your subject domain and latency budget.
Fix 1: Self-consistency plus voting
Wang et al., ICLR 2023. Sample N reasoning paths at temperature > 0, marginalize by final answer. Reported lifts: GSM8K +17.9%, SVAMP +11.0%, AQuA +12.2%. Pardos's tutoring deployment used N=10 and got near-zero algebra error.
When it fails: statistics-style problems (Pardos 13% residual), open-ended or non-verifiable answers, and any task where wrong answers are correlated (a shared misconception means majority vote reinforces the error rather than canceling it). For procedural domains with diverse failure modes it is the cheapest meaningful fix. For domains with a single dominant misconception it is theater.
Fix 2: Tool calling
Toolformer (Schick et al., NeurIPS 2023) showed self-supervised API-call learning. MathSensei (Bing search + Python + Wolfram Alpha) hit +8.1% over solution-generator-only when both ran on GPT-3.5. Davis and Aaronson's GPT-4-with-Wolfram report found Wolfram and Code Interpreter "significantly enhanced" math performance, but the bottleneck moved upstream: GPT often misformulated the problem for the tool.
The lesson is hybrid. LLM for parsing and explanation, symbolic for computation, with the LLM never asserting an arithmetic fact it did not tool-compute. Flint explicitly does this (Claude 4 Sonnet plus code-execution math plus translation tool plus web search). Khanmigo wires a calculator and a separate symbolic-math module. Photomath represents the pre-LLM tradition: trained-on-math OCR plus symbolic solver, deterministic, narrow.
A non-obvious eval target falls out of this. The interesting metric is not "did the model use the calculator," it is "did the model formulate the call correctly." Process reward models address this directly.
Fix 3: RAG over curriculum and textbooks
Levonian et al., EDM 2024 used vetted open-source math textbooks as the retrieval corpus to ground LLM responses to real student questions, with citation enforcement. The HICSS 2024 course-grounded tutor design is the practitioner reference for chunking and citation linking.
When it fails: worked-example synthesis (textbook chunks rarely contain the student's exact problem), curriculum drift (state standards change), and retrieval recall failures on novel student phrasings. RAG is also the most forgiving of the patterns. Worst case is irrelevant retrieval, which is recoverable. Math reasoning is the least forgiving, where a wrong answer destroys trust instantly.
Fix 4: Process Reward Models and LLM-as-judge
Lightman et al., "Let's Verify Step by Step" (OpenAI, ICLR 2024) is foundational. Process supervision (label each step) beats outcome supervision (label only the final answer); their PRM solved 78% of MATH on a representative subset. They released PRM800K (800K step-level human labels). This work fed directly into o1-class reasoning training.
The 2025-2026 follow-ups make PRMs cheap: ThinkPRM (generative PRM trained on 1% of PRM800K labels, beats LLM-as-judge and discriminative verifiers), R-PRM (+11.9 F1 on ProcessBench), FOVER (auto-labels using formal verifiers like Z3 and Isabelle).
For tutoring, this means you do not need to depend on majority-vote self-consistency for math correctness. A trained PRM gives you per-step verdicts, which is also exactly the signal your hint engine wants when the student is stuck.
Inter-rater data: trained LLM judges hit ~89% agreement with humans (parity with human-human IRR). GPT-4 judges agree 76% on math feedback quality. Use a 0-5 scale, not 1-10 (Hugging Face cookbook, Databricks grading-notes both confirm).
Fix 5: Refusal training and "Know What You Don't Know"
R-Tuning (above) is the reference SFT recipe. The OpenAI hallucination paper argues the eval-side fix matters more than the training-side fix, because as long as your eval rewards guessing, the model will learn to guess no matter how cleanly you fine-tune it.
No edtech vendor has publicly released a refusal-tuned model card as of May 2026. Khanmigo's "doing math..." hand-off to its calculator is functionally a domain-specific refusal: the LLM refuses to compute and defers to a tool. The pattern is correct even without the formal SFT recipe. Wherever you can architecturally route the model away from making a confident assertion it cannot verify, you should.
Fix 6: Production monitoring and dataset capture
The flywheel that closes the loop. The pattern is now standard:
- Trace every session.
- Capture explicit (thumbs) and implicit (rephrase, abandon, repeat-attempt) negative signals.
- Route flagged traces to human review.
- Label correct outputs.
- Add to offline regression eval suite.
- Score the next prompt or model version against the regression set.
The non-obvious bit is that "technically valid but wrong for your domain" is the hard class. Self-consistency catches randomness, RAG catches missing context, PRMs catch step-level errors, but the case where the model gives a textbook-correct answer that misses the student's actual misunderstanding is only caught by human-in-the-loop review.
For edtech specifically, Eedi deserves a separate mention. They have 60,000+ diagnostic questions where each wrong answer choice is mapped to a specific named misconception (a "Misconception Map"). They are running a constrained-AI-Tutor RCT in partnership with Google DeepMind starting April 2026. An earlier Eedi RCT showed students below median gained three months of additional progress. Their architectural innovation is human-in-the-loop: human tutors review and edit AI messages before they reach students.
Wiring the fixes on Respan
A practical setup for an AI math tutor that combines several of these fixes:
import os
from respan import Respan
client = Respan(api_key=os.environ["RESPAN_API_KEY"])
@client.workflow(name="math-tutor-turn")
def tutor_step(student_token, question, history):
# 1. Retrieve relevant curriculum chunks
context = retrieve_curriculum(question, grade_level=history["grade"])
# 2. Self-consistency: sample N reasoning paths
paths = []
for _ in range(5):
result = client.chat.completions.create(
model="auto",
customer_id=student_token,
temperature=0.7,
messages=build_tutor_messages(question, context, history),
)
paths.append(parse(result))
# 3. Tool-verify the numerical work
verified = []
for p in paths:
if p["needs_calc"]:
p["calc_result"] = sympy_eval(p["expression"])
verified.append(p)
# 4. Process Reward Model: score each path's reasoning
scored = client.evals.run(
evaluator="prm_step_verifier",
candidates=verified,
)
# 5. Pick the highest-scoring path with consistent final answer
return select_best(scored)Five fixes wired into one trace: retrieval, self-consistency, tool-grounded computation, PRM scoring, and a final selector. Every span is captured with sub-80ms ingestion latency. When a student says "this is wrong," the trace replays the entire decision and surfaces which step failed.
For monitoring, the patterns that matter for tutors:
- Per-subject hallucination rate. Self-consistency gets algebra to ~0% but plateaus at 13% on statistics. The per-subject view tells you where to invest.
- Calculator-deferral rate. If the model is computing arithmetic in tokens instead of routing to the calculator, that is a leading indicator of regressions.
- Refusal correctness. When the model abstains, was the abstention right? Calibration is the metric.
- Teacher and student override rate. Where humans correct the AI, that is the next eval.
# Online sampling, score 10% of live tutor turns nightly
client.monitors.create(
name="math-tutor-hallucination-rate",
workflow="math-tutor-turn",
sample_rate=0.10,
evaluators=["prm_step_verifier", "calc_deferral_correctness"],
alert_on={"prm_pass_rate": "<0.85", "calc_deferral_correctness": "<0.95"},
)The alert on calc_deferral_correctness < 0.95 will catch the silent regression where a model upgrade starts producing arithmetic in tokens instead of deferring to the tool. That is exactly the kind of drift that surfaces only weeks later in customer reports if you are not monitoring it.
A short reference architecture for an AI tutor
If you are starting today, the smallest defensible math-tutor setup combines:
- Retrieval over a curated curriculum corpus, chunked at the worked-example level, with skill-graph-aware retrieval if you have one.
- Tool-grounding for every arithmetic operation. Calculator for elementary, SymPy for symbolic, Wolfram Alpha for advanced. The LLM never asserts a numerical fact it did not tool-compute.
- Self-consistency at N=5 to N=10 for the reasoning paths, with majority vote on the final answer.
- A Process Reward Model scoring step-level coherence. Cheap to train with ThinkPRM-style approaches.
- Refusal logic wired at the architecture level (route to tools, defer to teacher, escalate to human review) rather than relying on prompt-engineered abstention.
- Tracing every turn with retrieval, tool-call, and reasoning-path spans.
- Online sampling of 5 to 20% of live traffic through PRM and calc-deferral evaluators with weekly drift alerts.
- Teacher and student override capture going back into the regression eval set.
CTA
To wire the fixes above on Respan, start tracing for free, read the docs, or talk to us. For the rest of the Education cluster, see the pillar, the evaluation spoke, the FERPA/COPPA spoke, and the essay grader walkthrough.
How Respan fits
AI tutors fail in load-bearing ways: confident wrong arithmetic, fabricated worked examples, and step errors that compound into durable student misconceptions. Respan is the platform layer that lets you trace, evaluate, gate, and monitor every tutor turn so hallucinations get caught before they reach a student.
- Tracing: every tutor turn captured as one connected trace. Auto-instrumented for LangChain, LlamaIndex, Vercel AI SDK, CrewAI, AutoGen, OpenAI Agents SDK. Retrieval, self-consistency samples, tool calls (calculator, SymPy, Wolfram), PRM scores, and the final selector all show up as spans on the same trace, so when a student says "this is wrong" you replay the exact decision path.
- Evals: ten built-in evaluators (faithfulness, citation accuracy, refusal correctness, harmfulness) plus LLM-as-judge and custom Python evaluators. Production traffic flows directly into datasets. CI-aware experiments block regressions on tutor failures (token-arithmetic instead of calculator deferral, uncited claims, miscalibrated confidence on statistics problems) before deploys ship.
- Gateway: 500+ models behind an OpenAI-compatible interface, semantic caching, fallback chains, per-customer spending caps. Route easy turns to a cheap model, escalate hard math to o1-class or GPT-5 Thinking, and cap per-student spend so a runaway loop on one account does not blow your unit economics.
- Prompt management: versioned registry, dev/staging/prod environments with approval workflows, A/B testing in production with one-click rollback. Tutor system prompts, hint templates, and refusal scaffolds live in the registry so a prompt tweak never ships through a code deploy and a regression rolls back in seconds.
- Monitors and alerts: per-subject hallucination rate, calculator-deferral rate, refusal correctness, citation grounding rate, teacher and student override rate. Slack, email, PagerDuty, webhook. The silent regression where a model upgrade starts computing arithmetic in tokens instead of routing to the calculator surfaces in hours, not weeks.
A reasonable starter loop for AI tutor builders:
- Instrument every LLM call with Respan tracing including retrieval, self-consistency sample, tool-call, and PRM-scoring spans.
- Pull 200 to 500 production tutor turns into a dataset and label them for correctness, calibration, calculator-deferral, and pedagogy quality.
- Wire two or three evaluators that catch the failure modes you most fear (token-arithmetic instead of calculator deferral, uncited curriculum claims, confident wrong answers on statistics).
- Put your tutor system prompts and hint templates behind the registry so you can version, A/B, and roll back without a deploy.
- Route through the gateway so you can A/B reasoning models on hard turns while keeping easy turns cheap and capping per-student spend.
That loop turns hallucinations from a customer-reported incident into a monitored metric you can drive down release over release.
To wire any of the patterns above on Respan, start tracing for free, read the docs, or talk to us.
FAQ
How much does self-consistency actually help? On algebra, in the Berkeley study, it dropped error rates from 32% to near zero with N=10. On statistics, it plateaued at 13%. Self-consistency works when wrong answers are randomly distributed and fails when wrong answers cluster into a shared misconception. Plan accordingly per subject.
Should I train my own PRM? At low scale, no. Use frontier reasoning models (o1-class, GPT-5 Thinking, Claude Opus 4.7) with structured step-by-step output and a generic LLM-as-judge for step verification. At scale (millions of tutor turns per week), training a small PRM with the ThinkPRM recipe on 1% of PRM800K-style labels is the right move and pays back in inference cost.
Why do frontier models still fail on basic arithmetic? Token-prediction is not a calculation engine. GPT-4 with Wolfram Alpha works fine when the model formulates the tool call correctly; the failure mode has moved upstream from execution to formulation. Architecturally, do not let the model assert any numerical fact it did not tool-compute. Even within frontier models in May 2026, that is the safest invariant.
Is the OpenAI hallucination paper actually saying my model is fine? Not quite. It is saying the training reward structure is the deeper bug, and that no amount of bigger-model investment fixes it as long as your benchmarks penalize abstention identically to wrong answers. For edtech specifically, this means you should re-score your offline eval to give partial credit for calibrated "I don't know, let me check" before you fine-tune.
Is there a comparative study on AI vs human teacher mislearning rates? Not as of May 2026. The closest results are the Berkeley study showing learning gains from ChatGPT-help statistically indistinguishable from human-tutor-authored help even before mitigation, and the WestEd Khanmigo RCT showing positive effects when the architecture is right. A direct mislearning-rate comparison remains an open research gap.
