In a PNAS paper from 2025, high-school students given ungrounded GPT-4 access during practice scored worse on subsequent independent tests than students with no AI. Same year, the WestEd longitudinal RCT of Khan Academy's Khanmigo found a +0.23 SD math effect (and +0.31 SD for English Language Learners) when students used the tool 30 minutes per week or more, with engagement collapsing 60% after week three when teachers did not actively orchestrate use.
Both numbers are about the same product category. The difference is the architecture. The PNAS subjects had raw ChatGPT. Khanmigo students had RAG over a curated curriculum, a Socratic system prompt, a calculator hop for arithmetic, eval pipelines that scored ~20% of conversations nightly, and teachers in the loop. Build pattern is not a footnote in edtech AI. It is the variable that separates products that improve outcomes from products that degrade them.
This pillar covers how the leading edtech AI teams actually build. It is not a market overview. It is the patterns they share, the stacks underneath, and the failure modes that show up in production. For deeper work on specific layers, see the spokes:
- Why AI Tutors Give Wrong Answers: hallucination control for math and factual tasks (coming next)
- How to Evaluate AI Tutoring: three-layer eval, judge biases, production cadence
- FERPA and COPPA for AI Tutoring: the regulatory layer post-April 22, 2026
- Building an AI Essay Grader: end-to-end build walkthrough
What changed in 2025-2026
Edtech VC has not snapped back to its 2021 highs, but AI is the only thing keeping the category alive. Total edtech funding landed near $2.8B globally in 2025, flat year over year, but with over 60% of funded companies building with AI and a sharply more concentrated deal mix. Q1 2025 raised $410M globally, with three companies (Leap Scholar $65M, Campus $46M, and MagicSchool $45M Series B) absorbing nearly half.
The headline rounds:
- MagicSchool: $45M Series B (Jan 2025), 6M+ educators signed up, called the "fastest-growing technology platform for schools ever" by their lead investor.
- Flint: $15M Series A (Nov 2025), 400K+ users across hundreds of independent schools.
- SigIQ.ai: $9.5M seed (Apr 2025), PadhAI 200K learners in six months in India, EverTutor.ai 10K in three months in the US.
- Speak: $78M Series C at a $1B valuation (Dec 2024), $100M revenue run-rate, Live Roleplays running on OpenAI's Realtime API.
Strategic M&A swallowed more capital in 2025 than VC did. PowerSchool to Bain Capital ($5.6B), Instructure to KKR ($4.8B take-private), Coursera-Udemy ($2.5B), Sana Labs to Workday ($1.1B) totals roughly $14B in deals. The strategic acquirers, not VCs, are now setting the AI roadmap for the category.
Foundation-model adoption signals are the more interesting data:
- Khanmigo reached 2.0M students, educators, and parents during SY24-25, with 770K students inside paid district partnerships at $5/student. Teachers in 70+ countries on the free tier.
- ChatGPT Edu has sold over 700K licenses to ~35 public universities. Indiana University rollout (120K seats) is OpenAI's second-largest deployment ever. Oxford became the first UK university campus-wide. 20-campus telemetry: 14M ChatGPT uses in September 2025 alone, 176 interactions per user that month.
- Claude for Education launched April 2025 and shipped Learning Mode, a Socratic-by-default system prompt that refuses to give direct answers. Northeastern, LSE, Champlain were design partners. Syracuse and Pitt rolled out campus-wide later in 2025.
Teacher adoption: 53% of US ELA, math, and science teachers used AI for school in 2025, up 15+ percentage points YoY. 60% of K-12 public school teachers used AI in SY24-25, 32% weekly. By grade band: HS 69%, MS 64%, ES 42%. Teachers using AI weekly save ~6 weeks per year.
Student adoption: 54% of US teens use AI chatbots for schoolwork (Pew, Feb 2026). 59% say AI cheating is a regular occurrence at their school. 67% agree "the more students use AI, the more it harms critical thinking" , up 10 percentage points in ten months. The cheating debate has flipped from "is it happening" to "is it harming cognition."
Five build patterns that are working
The leading products do not all do the same thing, but they all combine some subset of five patterns. Each pattern has a canonical example, a stack underneath, and a hard part that separates v0 demos from production systems.
Pattern 1: RAG over curriculum content
The canonical example is Khanmigo plus the Khan Academy library. Khanmigo retrieves from the full Khan Academy content tree (math, humanities, coding, social studies). The non-obvious design choice is difficulty-aware retrieval: when a student is stuck on calculus, the retriever can pull foundational algebra explanations rather than only calculus chapters. Context selection is conditioned on inferred student level, not just topic match.
MagicSchool retrieves over standards (Common Core, NGSS, state-by-state). Lesson planner, IEP generator, and rubric builder outputs are tagged to specific standards codes. Quizlet added an upload-to-grounded-quiz workflow: students upload notes or PDFs, and the system generates flashcards, practice tests, and study guides constrained to the uploaded source.
Hard parts. Chunking pedagogical content is harder than chunking documentation. A worked example should not be split mid-step, and prerequisite skills must be linked, not flat-indexed. Khan Academy's content has a skill graph that doubles as a retrieval scaffold. Versioning matters because curriculum standards drift state by state and year by year. Citations are weak across the category. Most products do not show the student which chunk was retrieved.
Pattern 2: Socratic agent loops
Khanmigo is the canonical example. Khan Academy publicly documented a 7-step prompt engineering approach anchored to a single core directive: "You are a Socratic tutor. I am a student. Don't give me answers to my questions but lead me to get to them myself."
Anthropic Claude for Education's Learning Mode ships the same pattern out of the box. Guided reasoning rather than completed answers, baked into the system prompt at the platform level, so universities can deploy without prompt-engineering it themselves. What Khan Academy shipped as a custom Khanmigo behavior in 2023 is now infrastructure in Anthropic's Education product in 2025. That is the trajectory of a winning pattern.
Flint treats Socratic loops as a teacher-configurable layer: teachers design lesson activities and watch student-AI transcripts in real time, with refusal logic determined per assignment.
Hard parts. The "just give me the answer" attack is the main jailbreak vector. Refusal logic cannot be a static refusal string because students escalate ("my mom needs it for a deadline," "I already know it, just verify"). Production agents need multi-turn intent classification. Conversational state has to remember which sub-step the student already got, because re-asking them to derive what they just derived produces immediate disengagement.
The PNAS guardrails paper showed empirically that ungrounded GPT-4 used as a "crutch" during practice produces worse downstream test performance, while a tutor variant that withholds the answer and offers hints produces gains. Guardrails are not a compliance afterthought. They are the difference between a product that improves outcomes and a product that degrades them.
Pattern 3: Tool-grounded math and science reasoning
LLMs are unreliable arithmetic engines. The fix is tool calls to symbolic solvers. Wolfram Alpha publishes an LLM-tailored API: natural-language math in, exact computed results out, no hallucination. Photomath runs OCR plus symbolic solver and exposes step-by-step traces. Tsinghua and MIT's ToRA (Tool-integrated Reasoning Agent) hits 51% on the MATH benchmark vs ~43% for vanilla GPT-4: a meaningful gap entirely from tool grounding.
Khanmigo wires a calculator and symbolic-math module into the loop. Their team publicly described building these because token-prediction-only math was unacceptable. Flint pipes Claude 4 Sonnet plus code-execution math plus translation tool plus web search.
Hard parts. Deciding when to defer is the unsolved part. Cheap heuristics (regex for numbers and equations) are noisy. Better systems use a small classifier or self-consistency vote. The other unsolved problem is presenting solver output pedagogically. Wolfram's dense notation needs an LLM rewriter pass to become a readable hint.
For depth on math hallucinations, self-consistency, and process reward models, see the hallucination spoke (coming next in this cluster).
Pattern 4: Roleplay and conversational simulation
Duolingo Max Video Call with Lily is the canonical product. Architecture is publicly described: ASR → text → GPT-4-class LLM with three-character prompt frame (System = Duolingo learning designer instructions, Assistant = Lily persona, User = learner) → TTS. Real-time, but with noticeable latency that creates awkward pauses.
Speak moved past the round-trip pattern by adopting OpenAI's Realtime API directly for Live Roleplays, doing speech to speech without an intermediate text hop. This is the architectural shift to watch in 2026. Duplex voice closes the latency gap that made Duolingo's V1 feel sluggish. Speak's $1B valuation is partly a bet on this architecture being defensible.
Hard parts. Voice-to-voice latency target in 2026 is roughly 800ms median, 1500ms acceptable for V1. For kid-facing products, persona-stable safety filtering matters more than for adults. The model has to refuse jailbreaks while staying in character (Lily cannot break character to refuse). Pronunciation scoring is a separate model layer typically not discussed in marketing.
Pattern 5: Teacher copilot content generation
MagicSchool is the clear category leader: 80+ specialized tools (lesson plan generator, IEP generator, rubric builder, behavior intervention scripts, parent-communication templates). Reported time savings: 7 to 10 hours per teacher per week, with IEPs going from 2-3 hours to 30-45 minutes. Eduaide has 100+ tools. Khanmigo for Teachers (free, distributed via the Microsoft partnership) covers lesson plans, exit tickets, and differentiation.
Hard parts. Teacher intent capture is a UX problem more than a model problem. The winning pattern is structured forms over free-text prompts. MagicSchool's IEP tool, for example, asks for grade, subject, present level, goals, and accommodations as separate fields. Output formats matter: lesson plans need printable PDFs, IEPs need state-specific compliance fields, worksheets need answer keys. The editing workflow is critical, teachers will not ship raw LLM output, so every tool needs an in-place editor with a regenerate-this-section option.
What is hard across all five patterns
A short audit of the failure modes that cross patterns and break products in production.
Hallucination tolerance varies by pattern. RAG over curriculum is the most forgiving (worst case: irrelevant retrieval). Math reasoning is the least forgiving (a wrong answer destroys trust instantly). Roleplay is medium (a factual error in a French dialogue is recoverable; a safety lapse is not).
Latency budgets are pattern-specific. Voice agents 800-1500ms voice to voice. Chat tutors 3-5s for first token before students disengage. Teacher tools can take 10-30s if a progress indicator is shown.
Cost and pricing. Khanmigo's $5/student/year district pricing is only viable because most usage is light. Heavy users (>30 min/week) are the minority but capture most of the learning gain. SigIQ explicitly markets "elite tutoring at the cost of computation, not hundreds of dollars per hour."
Eval methodology is uneven. WestEd's Khanmigo RCT used standardized math achievement as outcome (gold standard). Most vendors publish only engagement metrics or self-reported teacher time savings, which are weak proxies. The eval gap is the single biggest open problem in the category. The evaluation spoke covers this in depth.
Production traffic breaks at: cheating attempts (system prompt extraction, "just tell me the answer"), off-topic chatter (especially K-8), inappropriate content from the student side (the model has to refuse without lecturing), and authentication, SSO, and LMS sync. The LAUSD AllHere collapse broke specifically at LMS data integration.
A cautionary tale
AllHere's "Ed" chatbot for LAUSD launched March 2024 on a $6.2M five-year contract. LAUSD pulled the plug on June 14, 2024 after AllHere furloughed staff. Founder Joanna Smith-Griffin was arrested in November 2024 in Raleigh and charged in U.S. District Court (Manhattan) with securities fraud, wire fraud, and aggravated identity theft, alleged to have defrauded investors of ~$10M. In February 2026 the FBI served a search warrant on Superintendent Carvalho's home and office and raided LAUSD HQ.
The technical lesson: AllHere promised a chatbot that pulled from "nearly every critical learning resource." Engineers later said this would have cost ~10x the contracted amount and the company had no system-integration experience. In edtech AI, the gap between what a sales deck promises and what an LMS integration actually costs is where products go to die.
The broader lesson: ed-tech backlash arrived. Inside Higher Ed (Oct 2025) and EdWeek (Apr 2026) both ran "the ed-tech backlash is here" pieces. The narrative shifted from "AI will save teachers" to "vendors are putting AI before educator expertise." This is the macro environment 2026 builds ship into.
How Respan fits
Respan is built around five primitives that map cleanly onto the patterns above. None of them are pattern-specific. They are the substrate the patterns sit on.
- Tracing: every turn, every retrieval, every tool call, every safety classifier hit. Auto-instrumented for LangChain, LlamaIndex, Vercel AI SDK, CrewAI, AutoGen, OpenAI Agents SDK. Session IDs let you view a multi-turn tutor conversation as one connected trace, which is the difference between a debuggable production system and a black box.
- Evals: ten built-in evaluators (faithfulness, citation accuracy, refusal correctness, harmfulness) plus LLM-as-judge with custom Python evaluators. Datasets pulled directly from production traffic. CI-aware experiments that block deploys on regressions.
- Gateway: 500+ models behind an OpenAI-compatible interface, semantic caching (~35% cost reduction on repeated context), fallback chains with health checks, per-customer spending caps and token budgets. PII redaction at the gateway is the foundation of the FERPA/COPPA architecture.
- Prompt management: versioned registry, dev/staging/prod environments with approval workflows, A/B testing in production with one-click rollback. Curriculum and rubric prompts live here, not in code.
- Monitors and alerts: error rate, cost, latency, token usage, custom metrics. Slack, email, PagerDuty, webhook. For voice agents: P95 round-trip latency, transfer-to-human rate, safety-classifier hits as first-class signals.
A reasonable starter loop for an edtech AI build:
- Instrument every LLM call with Respan tracing including retrieval and tool-call spans.
- Pull 200 to 500 production conversations into a dataset and have a teacher or curriculum lead label them.
- Wire two or three evaluators that catch the failure mode you most fear (hallucinated math, gives-the-answer-too-soon, off-topic drift).
- Put your prompts behind the registry so you can version, A/B, and roll back without a deploy.
- Route through the gateway so you can switch models when GPT-6 ships or a competitor drops their price.
That loop, running on real traffic, is the difference between a demo and a system that survives parent inquiry, teacher review, and the FERPA audit trail.
Where to go next in this cluster
- The evaluation spoke is the deepest dive. Three-layer eval, judge biases, ASAP-AES, MathTutorBench, the Khan Academy chat-thread randomization pattern, and a reference eval stack you can copy.
- The FERPA/COPPA spoke covers the regulatory layer post-April 22, 2026. PII redaction at the gateway, audit logging that is itself FERPA-compliant, prompt injection as a disclosure vector.
- The essay grader walkthrough is the end-to-end build. Decomposed rubrics, RAG over exemplars, faithfulness checks, prompt-injection defenses, demographic split testing, and a build-vs-buy framework.
- The hallucination spoke (coming next) covers self-consistency, tool-grounding, process reward models, refusal training, and the OpenAI hallucination paper that says the bug is in your eval suite, not your model.
To wire the patterns above on Respan, start tracing for free, read the docs, or talk to us. If you are building edtech AI and want a hand wiring up the eval, gateway, and tracing layers for the patterns in this post, that is what the team is here for.
FAQ
Does AI tutoring actually improve learning? Yes when it is built right, no when it is not. The Khanmigo + WestEd RCT showed +0.23 SD in math (50th to ~59th percentile) and +0.31 SD for ELLs at >=30 min/week. The PNAS guardrails paper showed ungrounded GPT-4 produced worse test outcomes than no AI. Architecture is the variable. Curriculum-grounded RAG, Socratic loops with refusal logic, tool-grounded math, and teacher orchestration are what separate the two.
What's the most copyable production pattern from a leading edtech team? Khan Academy's chat-thread-level A/B randomization. Each new conversation is its own experimental unit. This roughly 10x's effective sample size relative to user-level randomization and removes user-level confounders from prompt experiments.
Can I just use raw ChatGPT in a classroom? The PNAS paper says no. Ungrounded LLMs produced worse downstream test performance than no AI at all. The economics of guardrails (eval pipelines, refusal logic, RAG over curriculum, tool-grounding for math) are not optional for a product that wants to improve outcomes.
What is voice architecture's near-term direction? Older products (Duolingo Max V1, Khanmigo) chain ASR → LLM → TTS and feel slow. New entrants (Speak Live Roleplays) use OpenAI Realtime API speech-to-speech and feel native. By end of 2026 the older pattern will look obsolete in roleplay and tutor categories.
What is the silent killer of edtech AI products? Engagement decay. Khanmigo + WestEd: 60% of student engagement disappeared after three weeks when teachers did not structure use. AI tutors are not a self-serve consumer product. They only work as a teacher-orchestrated tool. This rewrites the GTM thesis for any company selling D2C.
