On March 9, 2026, a federal magistrate judge in the District of Minnesota ordered UnitedHealth Group to produce internal documents on whether nH Predict, the AI algorithm it deploys to manage Medicare Advantage claims, was designed to override the clinical judgment of doctors. The court granted plaintiffs' motion to compel discovery across six of seven document categories. The lawsuit, originally filed November 2023 and surviving multiple motions to dismiss, alleges that nH Predict produced a 90% error rate measured against appeal reversals, that UnitedHealth pressured employees to keep patient rehabilitation stays within 1% of the algorithm's predicted length of stay, and that the AI was deployed in lieu of clinical professionals making coverage determinations. Plaintiffs include the estates of beneficiaries who died after coverage was terminated.
The case is not isolated. Cigna faces a parallel suit alleging its PXDX algorithm allowed doctors to deny claims in batches of hundreds or thousands without individual review. Humana faces similar accusations. Together, these cases signal a regulatory and litigation environment where the mere fact that an AI system operates inside a claims process invites discovery, even where the system produces correct decisions. Carriers without audit-grade infrastructure cannot adequately defend the systems they deploy.
For engineering teams building or operating claims AI in 2026, the lesson is concrete: claims AI is not a low-stakes operational application. It is a Tier 1 system under any reasonable reading of the NAIC Model Bulletin, subject to bias testing requirements, audit trail expectations, and adverse-outcome documentation that will be examined under the AI Evaluation Tool. The architectural and evaluation patterns that work are different from those for less-regulated AI applications.
This post covers what claims AI actually does, where the failure modes that produced the nH Predict lawsuit emerged, the architectural patterns that limit similar exposure, and the evaluation discipline that distinguishes defensible systems from systems that show up in a complaint.
What claims AI actually does
The category covers a wider range of workflows than "AI denies your claim." The taxonomy:
| Workflow | What the AI does | Regulatory exposure |
|---|---|---|
| Claim intake and structuring | Extract structured fields from FNOL submissions, photos, documents | Low: read-only data extraction |
| Triage and routing | Categorize claims by severity, complexity, fraud risk; route to appropriate adjuster queue | Medium: process-level decisions affecting timing |
| Coverage determination | Evaluate whether the claim falls within coverage terms | High: directly affects whether claim is paid |
| Reserves and severity prediction | Estimate likely cost of the claim for accounting | Medium: financial reporting accuracy |
| Utilization review (health) | Determine medical necessity and appropriate care duration | High: directly affects what care is covered |
| Subrogation identification | Identify third-party liability for recovery | Low to medium: affects cost recovery |
| Fraud detection | Flag suspicious patterns for special investigations | Medium-high: affects whether claim is paid and whether customer is reported |
| Settlement valuation | Estimate appropriate settlement amount | High: directly affects payment |
| Adjuster copilot | Surface relevant precedent, guidelines, similar claims | Low: aids human decision |
| Customer service | Handle claim status inquiries, basic FAQ | Low to medium: depending on whether actions are taken |
Most of these are reasonable LLM applications. A few are not, and the "not" category is what produces the nH Predict-style cases.
The pattern that goes wrong: an AI tool starts as a decision-support aid and gradually becomes the decision-maker, either because operational pressure pushes adjusters to defer to its recommendations or because business logic explicitly subordinates human judgment to its outputs. UnitedHealth's alleged practice of measuring adjuster compliance against nH Predict's predicted length of stay is the canonical example. The tool was technically advisory; the operational deployment made it determinative.
How nH Predict-style failures emerge
Three patterns repeat across the claims AI cases that have reached litigation.
The model is treated as ground truth. The claims process is structured so that the AI's output is the default decision and human review is the exception. Adjusters who deviate from the AI face management pressure or performance metrics that punish deviation. Over time, deviation rates drop and the AI is functionally the decisioner.
Validation evidence is thin or absent. When the algorithm is finally examined, the carrier cannot produce robust pre-deployment validation, ongoing performance monitoring against ground truth, or bias testing across protected groups. The 90% reversal rate alleged in the nH Predict suit is exactly the kind of finding that validation should have caught and the kind of finding that suggests it was not running.
Human appeals are filtered out. The carrier knows the appeal reversal rate is high but does not let that signal feed back into the model. The system optimizes for initial denial throughput rather than correctness. When discovery later compels the metrics, the gap between initial decisions and appealed-and-corrected decisions becomes the story.
The architectural defense against these patterns is structural, not procedural. Process changes ("we'll review the AI more carefully") do not survive operational pressure. Architecture changes ("the AI cannot be the decisioner without human approval logged in the system") do.
Architectural patterns that work
Three patterns describe defensible claims AI architectures. The right one depends on the workflow and the stakes.
Pattern A: AI as decision support with mandatory human disposition
The AI surfaces information, recommendations, and risk flags. The human adjuster makes the determination and records their reasoning. The AI's output is captured but is not the determinative record.
When this works. Coverage determination, utilization review, fraud flagging, and any other workflow where the carrier has direct adverse exposure if the AI is functionally the decisioner.
Implementation discipline.
- The AI's recommendation and the adjuster's decision are separate fields in the claims record. Both are required.
- When the adjuster's decision matches the AI's recommendation, that is logged and visible.
- Operational metrics do not penalize adjusters for deviating from the AI. Quality metrics measure adjuster decisions against ground truth (appeal outcomes, audit reviews), not against AI agreement.
- When the appeal reversal rate against the AI exceeds a threshold, the model is automatically flagged for retraining or revalidation, not allowed to continue as-is.
This pattern is what carriers experienced in the nH Predict litigation should have implemented. The technical architecture supports adjuster judgment as the system of record; the operational practice does not penalize judgment that diverges from the AI.
Pattern B: AI as classifier, human as decisioner
The AI classifies claims into operational categories (route to fast-track, route to investigation, route to senior adjuster) without making coverage determinations directly. Human decisioners handle the determinations within whichever queue the AI routed to.
When this works. Triage, intake, routing. The AI's output is operational, not adjudicative.
Implementation discipline.
- The AI's categorization is logged but does not appear in the coverage determination record.
- Misrouting (claims that the AI sent to fast-track but should have gone to senior review) is tracked and triggers retraining when rates exceed thresholds.
- The classifier is bias-tested across demographic groups; routing rate disparities surface as findings.
This pattern provides operational efficiency without creating the determinative-AI exposure of Pattern A applied incorrectly.
Pattern C: AI as deterministic-rule augmentation
The AI's role is to extract structured information that feeds into deterministic rule-based decisioning. The actual coverage determination is made by code that applies policy terms; the AI's contribution is converting unstructured inputs (photos, documents, narrative descriptions) into structured fields.
When this works. Workflows where coverage determination is rule-based and the AI's role is preprocessing, not adjudication.
Implementation discipline.
- The deterministic rules are documented, versioned, and inspectable.
- The AI's structured outputs are validated against source documents (each extracted field is grounded in specific source text).
- Errors in AI extraction are caught downstream by rule consistency checks.
- The AI does not have the authority to bypass or override rule outputs.
This is the most defensible pattern when the underlying coverage decision is genuinely rule-based. It does not work when the coverage decision requires judgment, since the rule engine cannot exercise judgment and the team will be tempted to use the AI to fill the gap.
The evaluation framework specific to claims AI
Claims AI eval has dimensions that less-regulated LLM applications do not.
1. Coverage decision accuracy
The straightforward dimension: does the AI's coverage recommendation align with what a properly trained adjuster would conclude?
Construct the eval set from historical claims with known dispositions, including:
- Claims initially denied and not appealed (presumably correctly denied or at least uncontested)
- Claims initially denied and reversed on appeal (correctly identified as wrongly denied)
- Claims paid in full at the original decision (presumably correctly paid)
- Claims paid after litigation or external review (correctly paid; the original was wrong)
Stratify by claim type, severity, demographic of claimant, and channel. Aggregate metrics hide the failure modes. A model that performs well overall but systematically denies a specific demographic is the model that produces a class action.
2. Reversal rate against final disposition
This is the metric that haunted nH Predict. For every claim the AI was involved in, track:
- The AI's recommendation
- The adjuster's initial decision
- The final disposition after any appeal, external review, or litigation
- The gap between initial and final
A 90% reversal rate means the AI is wrong 9 times out of 10 about claims that get appealed. Carriers with this metric and no remediation plan are buying litigation. The eval framework should compute this continuously, alert when it exceeds thresholds, and feed the disagreement cases back into model retraining.
3. Bias testing across protected characteristics
Required by NAIC Model Bulletin Section 4 and operationalized by the AI Evaluation Tool. The minimum:
- Denial rates per demographic group, with impact ratios
- Appeal-reversal rates per demographic group
- Settlement amounts per demographic group, controlled for claim characteristics
- Claim duration and process metrics per demographic group
Bias can show up at any stage. A claims process where the AI is unbiased but the appeal review is biased still produces disparate outcomes; the audit framework should examine the full pipeline, not just the AI's contribution.
4. Hallucination rate on factual claims
LLM-based claims AI specifically produces text that summarizes claim facts, cites policy provisions, or references case history. Each of these is a factual claim that can be wrong. The eval framework should:
- Validate cited policy provisions against the actual policy
- Validate referenced case history against the actual case file
- Validate factual summaries against source documents
For health claims AI specifically, the litigation environment makes hallucination risk acute. A denial letter that cites a clinical guideline incorrectly is evidence of inadequate AI oversight. Test for this and remediate.
5. Adversarial robustness against gaming
Customers and their representatives increasingly use AI to construct claim narratives. Test the system against:
- Prompt injection in claim narrative fields
- Synthetic but plausible claim photos
- Inconsistent narrative across submitted documents (which a careful human notices but a fast LLM may miss)
- Claim patterns designed to fit known fraud detection blind spots
The defense is not paranoia; it is testing that legitimate claims still process correctly while obvious gaming gets caught.
Audit trail requirements
Every claims AI decision produces a record that supports later examination, internal audit, and litigation defense. The minimum:
claims_ai_decision_record:
decision_id: <uuid>
claim_id: <reference>
policyholder_id: <reference>
timestamp: <ISO>
ai_inputs:
structured_data: <map of claim fields>
documents_referenced: [<list>]
policy_terms_consulted: [<list>]
ai_processing:
model_version: <id>
prompt_version: <id if applicable>
retrieval_context: [<list of sources>]
raw_model_output: <text>
ai_recommendation:
recommended_action: pay | deny | partial | escalate
confidence: <float>
rationale: <structured>
cited_authorities: [<list with source references>]
human_decision:
adjuster_id: <reference>
decision: pay | deny | partial | escalate
rationale: <text>
deviation_from_ai: <boolean>
deviation_reason: <text if applicable>
outcome:
initial_disposition: <text>
appeal_status: <enum>
final_disposition: <text>
final_amount: <decimal>
days_to_resolution: <int>
audit_metadata:
legal_hold: <boolean>
retention_expires_at: <ISO>
cryptographic_hash: <bytes>Records retained for the regulatory period (typically 7-10 years for insurance claims), indexed for query, and exportable on demand. The cryptographic hash supports tamper-evidence; alterations to historical records get caught.
The Estate of Lokken v. UnitedHealth discovery order specifically targeted documents about model design, training, and operational practice. Carriers whose audit trail includes these elements can produce them efficiently. Carriers whose audit trail is partial face discovery costs in the millions and findings that drive settlement amounts higher.
What separates defensible claims AI from the next class action
After watching the litigation environment through 2024-2026:
The AI is decision support, not decisioner. Architecture enforces the distinction; operational practice does not erode it.
Validation evidence is current and complete. Pre-deployment validation, ongoing monitoring, bias testing, and adverse outcome tracking are all in place from day one.
Reversal rates are tracked and acted on. When initial decisions get reversed on appeal, the disagreement informs model retraining. Persistent high reversal rates trigger model retirement, not continued operation.
Bias testing is continuous. Annual is too slow; continuous monitoring with alerts catches issues before they become litigation.
Audit trail supports discovery defense. Decision records are queryable, indexed, and tamper-evident. Discovery requests can be answered in days, not months.
Adjuster judgment is protected operationally. Quality metrics measure decisions against ground truth, not against AI agreement. Adjusters who deviate from the AI are not punished; they are studied for what they saw that the AI missed.
These are the practices that produce claims AI systems that pass examination and survive litigation. Without them, the system is not less effective; it is more exposed.
Build order
Claims AI hardening is sequential, not parallel. The nH Predict pattern emerges when teams ship coverage recommendations before the audit trail and reversal monitoring exist to catch them. Each step below assumes the prior one is in place; skipping forward is what produces the discovery exposure UnitedHealth is now defending.
| Order | What you build | Eval gate before moving on |
|---|---|---|
| 1 | Decision record schema and immutable audit log capturing AI inputs, model version, retrieval context, AI recommendation, adjuster disposition, and final outcome with cryptographic hash | 100% of production claims AI decisions write a complete record; sample of 100 records reproduces the trace end to end |
| 2 | Clinical guideline and policy citation grounding, with every AI-cited authority linked to a specific source passage | Hallucinated citation rate under 1% on a labeled set of 200 production decisions |
| 3 | Adjuster override transparency in the UI and data model, separating AI recommendation from human decision with required deviation rationale | Override rationale captured on 100% of deviations; quality metrics no longer reward AI agreement |
| 4 | Continuous reversal rate monitoring tied to model version, with alerts when initial-vs-final disposition gap exceeds threshold | Reversal rate dashboard live for 30 days; threshold breach routes to retraining queue, not silent continuation |
| 5 | Bias testing across protected characteristics on denial rates, appeal-reversal rates, and settlement amounts | Impact ratios computed weekly; any group below 0.8 four-fifths threshold blocks deploy |
| 6 | Adversarial and gaming test suite plus litigation hold infrastructure that activates on inquiry | Adversarial pass rate above 95%; legal hold tested end to end on a sample claim within 24 hours |
Steps 4 through 6 depend on 1 through 3 being correct; reversal monitoring without a complete decision record cannot attribute disagreement, and bias testing without override transparency cannot separate model bias from operational pressure. Carriers that ship coverage AI before step 1 are the ones whose discovery responses arrive months late and whose models cannot be defended on the record.
How Respan fits
Claims AI lives or dies by what you can produce in discovery, and that means every model recommendation, adjuster decision, and reversal needs to land somewhere queryable. Respan is the substrate underneath the patterns above: trace, eval, gateway, prompt registry, and monitoring wired into a record that survives examination.
- Tracing: every claims AI decision captured as one connected trace, from FNOL ingestion through retrieval, model recommendation, adjuster disposition, and final outcome. Auto-instrumented for LangChain, LlamaIndex, Vercel AI SDK, CrewAI, AutoGen, OpenAI Agents SDK. When discovery asks what nH Predict-style system did on a specific claim on a specific day, you produce the trace in minutes instead of reconstructing it from logs over months.
- Evals: ten built-in evaluators (faithfulness, citation accuracy, refusal correctness, harmfulness) plus LLM-as-judge and custom Python evaluators. Production traffic flows directly into datasets stratified by claim type, severity, demographic, and channel. CI-aware experiments block regressions on coverage decision accuracy, hallucinated policy citations, biased denial rates, and reversal-rate drift before deploys ship.
- Gateway: 500+ models behind an OpenAI-compatible interface, semantic caching, fallback chains, per-customer spending caps. The gateway gives you a single audit-friendly chokepoint where every model call carries the version, prompt, and policyholder reference that the NAIC AI Evaluation Tool will eventually ask for.
- Prompt management: versioned registry, dev/staging/prod environments with approval workflows, A/B testing in production with one-click rollback. Coverage determination prompts, adjuster copilot prompts, denial letter generators, and clinical guideline summarizers belong in the registry so every change is versioned, reviewable, and reversible without a deploy.
- Monitors and alerts: appeal reversal rate against AI recommendation, demographic denial rate disparities, hallucinated policy citation rate, adjuster deviation rate, and length-of-stay prediction drift. Slack, email, PagerDuty, webhook. The system that haunted UnitedHealth was one where a 90% reversal rate ran for years without an alert; this is the inverse.
A reasonable starter loop for claims AI builders:
- Instrument every LLM call with Respan tracing including retrieval spans, policy citation spans, and adjuster disposition spans.
- Pull 200 to 500 production claim decisions into a dataset and label them for coverage accuracy, citation grounding, and demographic distribution.
- Wire two or three evaluators that catch the failure modes you most fear (the model becoming determinative in practice, hallucinated clinical or policy citations, and disparate denial rates across protected groups).
- Put your coverage determination prompts, denial letter templates, and adjuster copilot prompts behind the registry so you can version, A/B, and roll back without a deploy.
- Route through the gateway so model versions, fallback behavior, and per-line spending caps are enforced and logged in one place that downstream audit can reach.
The carriers that survive the next nH Predict-style discovery order are the ones whose tracing, evals, and audit trail were running before the complaint was filed.
To wire any of the patterns above on Respan, start tracing for free, read the docs, or talk to us.
Related reading
- The NAIC AI Evaluation Tool: Engineering for the 2026 Pilot: the regulatory framework
- Evaluating Underwriting LLMs: adjacent insurance AI
- Building an AI Claims Processing Agent: full architecture walkthrough
- How Insurance Teams Build LLM Apps in 2026: pillar overview
- Building Adverse Action Explainability for LLM-Driven Credit Decisions: adjacent fintech compliance pattern
