If you ship an AI tutor in 2026, you ship a system that 50+ international research teams and the best edtech eval programs are quietly telling you is not yet good enough for unsupervised use. The BEA 2025 Shared Task on Pedagogical Ability Assessment ran 50+ teams across five tracks. Best macro-F1 for "providing pedagogical guidance" landed at 58.34 on a three-class problem. Best macro-F1 for "tutor identity detection" was 96.98. Translation: state-of-the-art AI tutors are easy to fingerprint and hard to make pedagogically sound. The gap between those two numbers is the gap between "this works as a demo" and "this works in a classroom."
This piece is for engineers building tutoring AI who need an evaluation stack that actually catches the regressions. It covers a three-layer framework (correctness, pedagogy, safety), the ASAP-AES baseline that is still canonical, the judge biases that quietly inflate your eval scores, and the production patterns the leading edtech teams are running today.
For the wider Education cluster, this is the eval spoke. The compliance spoke (FERPA and COPPA for AI Tutoring) is published. Pillar, hallucination, and essay-grader pieces are next.
The three-layer eval framework
Academic literature in 2025-2026 has converged on a triad you can operationalize. The clearest single source is SHAPE: Unifying Safety, Helpfulness and Pedagogy for Educational LLMs (2026), which formalizes the same three concepts as a 9,087-pair benchmark of student-tutor dialogues under adversarial pressure. SafeTutors extends the safety dimension into an 11-axis, 48-sub-risk taxonomy specific to learning. MathTutorBench (EMNLP 2025) is the strongest pedagogy-only benchmark.
The three layers are:
- Correctness: is the answer factually and mathematically right
- Pedagogy: does the response teach (Socratic, scaffolded, age-appropriate), or does it just hand over the answer
- Safety: child-appropriate, on-topic, refusing manipulation
Each needs its own dataset, its own evaluator, its own threshold. Conflating them in a single "quality" score is the most common mistake. A model can be correct and pedagogically terrible (gives the answer instead of guiding). A model can be pedagogically gentle and unsafe (refuses to address self-harm signals). The eval suite has to score them independently and gate deploys on each.
Layer 1: Correctness
Math, science, factual
For math correctness, the modern reference is process supervision. Lightman et al., "Let's Verify Step by Step" showed that grading every reasoning step (with a Process Reward Model) beats grading only the final answer, and released PRM800K (800K step-level human labels). The 2025-2026 follow-ups make PRMs cheap to train: ThinkPRM hits state-of-the-art with 1% of PRM800K labels, R-PRM adds reasoning-driven verification, FOVER auto-labels using formal verifiers (Z3, Isabelle).
For tutoring specifically, the implication is that you do not need to depend on majority-vote self-consistency for math correctness. A trained PRM gives you per-step verdicts, which is also the signal your hint engine wants. The hallucination spoke covers this in depth.
For essay correctness, see Layer 2. Essays sit awkwardly across correctness and pedagogy because there is rarely a single right answer, only a defensible one.
ASAP-AES is still the canonical essay benchmark
The Hewlett ASAP-AES corpus from 2012 (~12,978 essays across 8 prompts, double-rated by humans, scored on prompt-specific scales) remains the canonical public dataset in 2026 because nothing else has its combination of double-rating, scale, and a baked-in agreement metric (Quadratic Weighted Kappa).
The successors:
- ASAP++ (LREC 2018) adds trait-level scores across 10 dimensions (Content, Organization, Word Choice, Sentence Fluency, Conventions, Prompt Adherence, Language, Narrativity). This is what you want if you are building rubric-anchored grading.
- ASAP 2.0 (Learning Agency Lab, July 2024) is ~24,000 student-written argumentative essays aligned to current standards. The modern successor, but not yet as established.
- TOEFL11 for cross-prompt generalization on ESL writing.
- DREsS for rubric-based EFL writing.
What you should expect from QWK
The ETS standard for AES systems is QWK ≥ 0.70, derived from the Praxis I e-rater report. Human-human QWK on the same Praxis I task was 0.74. For essays generally, human-human QWK lands in the 0.6 to 0.8 band.
The most recent comprehensive synthesis (2025 review of 65 LLM-AES studies, Jan 2022 to Aug 2025) puts human-LLM QWK at "moderate to good," 0.30 to 0.80, with frontier models on the high end when given criteria and justification prompts. Translation: ASAP-AES is not solved. Your aggregate QWK can land at 0.75 and still hide a regime where the model is below the ETS acceptability threshold.
The most credible recent improvement is Reflect-and-Revise (2025), which reports +0.47 QWK on ASAP and +0.19 on TOEFL11 using GPT-4.1, Gemini 2.5 Pro, and Qwen-3-Next via iterative rubric refinement. Not bigger model, smarter scoring loop. LCES (May 2025) shows pairwise comparison beats zero-shot pointwise on both ASAP and TOEFL11.
Layer 2: Pedagogy
This is the layer where AI tutors fail in production and where most teams have no formal eval.
The pedagogy-vs-expertise trade-off
The single most important finding in the 2025 literature is from MathTutorBench: subject expertise and pedagogical ability are a measurable trade-off, not a free lunch. Models that are better at solving the math problem tend to be worse at teaching it, at the margin. Top MMLU does not equal top MathTutorBench.
The implication for production: choosing the model with the highest reasoning benchmark is not the right default. You need to evaluate on a pedagogy-specific reward model, not a general capability one. BEA 2025 made this concrete with five tracks of pedagogical ability tasks (mistake identification, mistake location, guidance, actionability, tutor-identity detection), and the F1 ceilings are sobering even after a 50-team competition.
What pedagogy eval looks like
For tutoring, you want at minimum:
- Hint quality: does the response help the student think, or does it just produce the answer? Khan Academy adopted the ICAP framework (Interactive, Constructive, Active, Passive) as their cognitive-engagement metric, chosen because prior efficacy research showed engagement predicts skill gains. They built an LLM-as-judge trained against human-expert ICAP labels, and it now scores ~20% of Khanmigo conversations nightly.
- Refusal correctness: does the model give the answer when it shouldn't, and refuse when it should? SHAPE's knowledge-mastery graph is the academic frame. The practical version is: build a 200-case set of "student should not get the answer" prompts, and score the rate at which the model holds the line.
- Step-by-step coherence: for math, does the explanation follow from the previous step? PRMs are the tool.
- Tone and age-appropriateness: is the language at the student's grade level, encouraging without being condescending?
The "do my homework" attack
This is a pedagogy attack as much as a safety one. Students will phrase requests in ways that look legitimate ("can you check my work" while presenting an empty page, "I already know this, just verify"). Your eval set has to include these. SHAPE provides a starting corpus. The eval question is not "did the model produce a correct answer" but "did the model maintain its pedagogical role under pressure."
Layer 3: Safety
For child-facing tutors, safety is regulated as well as pedagogical. The FERPA and COPPA spoke covers the regulatory side. For eval specifically, you want:
- SafeTutors taxonomy (arXiv 2603.17373) , 11 dimensions, 48 sub-risks, drawn from learning-science literature. Far more granular than generic content-safety classifiers.
- Adversarial robustness: How to Trick Your AI TA (Dec 2025) coined "academic jailbreaking" and showed how easy it is to manipulate AI graders for unearned grades on GPT-4.1-Mini, Gemini 2.5 Flash, and LLaMA-3.2.
- Prompt injection: OWASP LLM01:2025 is the canonical taxonomy. The adversarial poetry jailbreak (late 2025) achieves >60% success across 25 major models. Tutors get attacked not just on rubrics but on system prompts and cross-student data extraction.
- Cross-student data exposure: students asking "what did Sarah write?" is a FERPA event. Your safety eval needs a synthetic test set that probes this directly.
The non-obvious eval move is to run safety as a separate gate from quality. A response can be high-quality on correctness and pedagogy, and still fail on safety. Don't average them.
The judge bias trap
LLM-as-judge is the cheapest way to scale eval, and the biases are well-documented enough that you can no longer plead ignorance.
Self-preference bias is mechanistic
Panickssery et al. (NeurIPS 2024) and the 2026 follow-up showed that GPT-4 measurably prefers its own outputs. The mechanism: judges favor outputs with lower perplexity under their own distribution, regardless of source. Cross-family judging (Claude judges OpenAI outputs and vice versa) is therefore not a fix on its own. Both judges still favor whatever each finds more familiar.
Mitigation: calibrate against held-out human-graded anchors, not just "use a different vendor."
Length bias is conditional, not constant
A widely cited finding (Nature Scientific Reports 2025) is that GPT-4 essay-scoring inflation is most pronounced on longer answers and disappears for short ones. Length-stratified eval is mandatory. An aggregate QWK of 0.75 can hide a 0.85-on-short and 0.55-on-long split, which is the regime where students are gaming the system by writing more.
Positivity bias in education specifically
Studies on GPT-4o and L1 groups and GPT-4 placement test scoring show models are systematically more lenient than human raters, with the score distribution compressed upward. There is also early evidence of demographic bias against Asian-American writing in AI grading. Cause unclear, mitigation requires demographic-split testing, not aggregate metrics.
Position bias scales with quality similarity
The 150K-judgment study on MTBench and DevBench found position bias is strongest when the two candidates are close in quality: exactly the regime where you most need the judge to be reliable, like model-vs-model regression testing. Always swap-and-aggregate, especially on close pairs. Pointwise judging is not immune; the 2026 paper Am I More Pointwise or Pairwise? shows even rubric-based pointwise scoring leaks position effects.
Calibration techniques that actually work
Synthesizing from Autorubric, LLM-Rubric, and the AWS Nova rubric-judge guidance:
- Behavioral anchoring: locked rubrics with explicit textual evidence rules per score level
- Few-shot exemplars maintained per evaluation method (separate sets for pointwise vs pairwise)
- Logprob-weighted scoring on the score tokens, not greedy decode
- Justification-required scoring: force the judge to write rationale before the score; this reliably lifts QWK
- Pairwise > pointwise for open-ended quality (LCES result)
- Position swap and aggregate for any pairwise eval
- 0-5 scale, not 1-10 (Hugging Face cookbook, Databricks grading-notes both confirm)
Production patterns from the leading teams
Khan Academy: chat-thread randomization
The single most copyable pattern. Khan Academy A/B-tests at the chat-thread level, not the user level. Each new conversation is its own experimental unit. This roughly 10x's effective sample size relative to user-level randomization and removes user-level confounders. They also run the LLM-as-judge on roughly 20% of conversations nightly with results piped into live dashboards.
MagicSchool: three-stage AI Safety Loop
Framing → Auditing → Refining, with K-12-tuned moderation layers. SOC 2, FERPA, COPPA. Heavy on guardrails, lighter on quantitative pedagogy eval as published. (FERPA/COPPA in detail in the compliance spoke.)
Anthropic and OpenAI cross-evaluation
Joint safety eval of each other's models. First-of-its-kind cross-lab pattern that edtech vendors are likely to imitate as a stronger baseline than self-eval.
Two pipelines, different cadence
The production consensus across Khan, Anthropic, and the LLM-monitoring vendors:
- Offline regression: runs on every prompt or model change. Deterministic, blocks deploys. Frozen ground-truth set. Catches obvious quality regressions.
- Online sampling: 1 to 10% of live traffic (Khan runs ~20% but they have the budget) routed through judges nightly. Drift alarms when scores drop more than 10-20% week over week.
- Shadow traffic: replay production prompts against the candidate model in parallel before promotion.
- Monthly full-suite re-runs against archived ground truth, to catch judge drift.
Chen, Zaharia, Zou , "How is ChatGPT's Behavior Changing Over Time?" documented GPT-4 prime-number identification dropping from 84% to 51% in three months. Drift is not a hypothetical.
Wiring this stack with Respan
A practical eval setup looks like this:
from respan import Respan
client = Respan(api_key=os.environ["RESPAN_API_KEY"])
# 1. Build datasets directly from production traffic, scoped by workflow
asap_set = client.datasets.from_production(
filter={"workflow": "essay-grading", "rubric_version": "v3"},
limit=500,
)
# 2. Run experiments with multiple evaluators per row
exp = client.experiments.run(
name="prompt-v9-vs-v10-pointwise-and-pairwise",
dataset=asap_set,
evaluators=[
"rubric_anchored_pointwise", # 0-5, justification required
"pairwise_swapped", # both orders, aggregated
"faithfulness_to_rubric", # ground in retrieved rubric chunks
"length_stratified_qwk", # bin by output length, surface per-bin
],
)Three details matter:
- Datasets pulled from production keep your eval set in distribution. Frozen 2012 ASAP-AES alone catches some regressions and misses many. Pair it with weekly samples from your own traffic.
- Length-stratified eval is the cheapest fix for length bias. Output a per-bin QWK rather than only the aggregate.
- Run pairwise with position swap on the same row twice and aggregate. The compute cost is 2x; the signal is several times cleaner on close-quality comparisons.
Tracing every eval run with span context lets you go back and see why a specific case was scored low: which prompt, which retrieval, which evaluator. That is the difference between "our QWK dropped 5 points last week" and "our QWK dropped 5 points last week because retrieval started missing trait 3 on prompts longer than 600 tokens."
A reference eval stack for an AI tutor
If you are starting from zero today, this is the smallest defensible setup:
- A frozen ground truth set of 200-500 cases per high-stakes path (math hint, essay grade, science explanation). Double-rated by humans. Tag by grade band and topic. Recompute QWK on every prompt change.
- A pedagogy eval set of 100-200 adversarial cases (do-my-homework, rubric-leak, "just tell me the answer", hint-vs-answer trade-off). Score by an LLM-as-judge calibrated against human teacher labels.
- A safety eval set of 100-200 cases covering SafeTutors taxonomy slices most relevant to your audience (K-8, K-12, undergrad). Block deploys on regressions.
- An online judge running on 5-10% of live traffic, scoring correctness and pedagogy independently, alerting on weekly drops.
- A length-stratified report appended to every offline run.
- A swap-and-aggregate pairwise judge for prompt-vs-prompt experiments.
- A monthly full-suite re-run against the archived ground truth, to catch judge drift before model drift.
For the rest of the Education cluster, see the FERPA/COPPA spoke on compliance. Pillar (industry overview), hallucination spoke, and essay-grader walkthrough are next. To wire the eval stack on Respan, start tracing for free, read the docs, or talk to us.
How Respan fits
Tutor evals fail in production because the three layers (correctness, pedagogy, safety) need independent scoring, length-stratified reporting, and judges calibrated against held-out human anchors. Respan was built for exactly this kind of multi-evaluator, multi-cadence loop.
- Tracing: every tutor turn captured as one connected trace. Auto-instrumented for LangChain, LlamaIndex, Vercel AI SDK, CrewAI, AutoGen, OpenAI Agents SDK. Retrieval spans, rubric-fetch spans, PRM step verdicts, and judge calls all hang off the same root, so when QWK drops you can see whether it was the prompt, the retrieval, or the judge.
- Evals: ten built-in evaluators (faithfulness, citation accuracy, refusal correctness, harmfulness) plus LLM-as-judge and custom Python evaluators. Production traffic flows directly into datasets. CI-aware experiments block regressions on hint-vs-answer leaks, rubric-leak refusals, and length-stratified QWK collapses before deploys ship.
- Gateway: 500+ models behind an OpenAI-compatible interface, semantic caching, fallback chains, per-customer spending caps. Run pedagogy-tuned and reasoning-tuned models side by side without touching the application code, and cap per-school spend so a single classroom cannot blow your budget.
- Prompt management: versioned registry, dev/staging/prod environments with approval workflows, A/B testing in production with one-click rollback. Rubric prompts, Socratic-hint prompts, and refusal prompts each get their own version history and can be rolled back independently when a cohort starts gaming a new rubric.
- Monitors and alerts: aggregate QWK, length-stratified QWK on the worst bin, refusal-correctness rate, do-my-homework leak rate, SafeTutors taxonomy hit rate. Slack, email, PagerDuty, webhook. Drift alarms fire when any tracked metric drops more than 10 to 20% week over week, matching the Khan-style nightly cadence.
A reasonable starter loop for AI-tutor builders:
- Instrument every LLM call with Respan tracing including retrieval, rubric-fetch, PRM step, and judge spans.
- Pull 200 to 500 production tutor conversations into a dataset and label them for correctness, pedagogy (ICAP), and safety.
- Wire two or three evaluators that catch the failure modes you most fear (hint-vs-answer leaks, length-bias inflation, cross-student data exposure).
- Put your rubric and Socratic-hint prompts behind the registry so you can version, A/B, and roll back without a deploy.
- Route through the gateway so you can swap pedagogy-tuned models in and out and enforce per-school spending caps without code changes.
That gives you a defensible eval stack on day one and a clean path to the full Khan-style production cadence as you grow.
To wire any of the patterns above on Respan, start tracing for free, read the docs, or talk to us.
FAQ
Is ASAP-AES still the right benchmark in 2026? Yes, with caveats. It remains the canonical public essay corpus because of double-rating and the QWK convention. But it is 2012 vintage, all English, narrow grade bands, and there is growing concern that frontier models have effectively memorized the prompts. Pair it with ASAP++ (trait-level), ASAP 2.0 (newer argumentative essays), and a held-out sample of your own production traffic.
What QWK should I aim for? ETS sets 0.70 as the acceptability threshold for AES systems. Human-human QWK on essays is typically 0.6 to 0.8. The 2025 synthesis of 65 LLM-AES studies puts human-LLM QWK at 0.30 to 0.80 with frontier models on the high end. Aim for at minimum 0.70 aggregate, and check that your worst length-stratified bin is also above 0.65.
Can I just use one LLM judge for everything? Not safely. Self-preference bias is mechanistic, length bias is conditional, position bias scales with quality similarity. The minimum viable judge stack is one calibrated against human anchors, with position swap on pairwise tasks and length-stratified reporting on pointwise tasks.
How often should I run online evals? Khan Academy runs ~20% sampled nightly. Industry default is 1 to 10% sampled nightly with weekly aggregated dashboards and monthly full-suite re-runs. The minimum is daily sampled judge runs on a fixed slice, not weekly batch.
What about pedagogy benchmarks? MathTutorBench is the strongest published pedagogy-only benchmark as of 2026. BEA 2025 Shared Task results give you the field's macro-F1 ceilings. SHAPE is the integrated safety-helpfulness-pedagogy benchmark. None of these are saturated; the BEA guidance F1 is still 58.34.
Does cross-vendor judging fix self-preference bias? Partially. The mechanism is perplexity-under-self, so a Claude judge of OpenAI output still favors what Claude finds more familiar. The fix is calibration against held-out human anchors, not just rotating vendors.
