On March 9, 2026, a federal magistrate judge in the District of Minnesota ordered UnitedHealth Group to produce internal documents on whether nH Predict, the AI algorithm it deploys to manage Medicare Advantage claims, was designed to override the clinical judgment of doctors. The court granted plaintiffs' motion to compel discovery across six of seven document categories. The lawsuit, originally filed November 2023 and surviving multiple motions to dismiss, alleges that nH Predict produced a 90% error rate measured against appeal reversals, that UnitedHealth pressured employees to keep patient rehabilitation stays within 1% of the algorithm's predicted length of stay, and that the AI was deployed in lieu of clinical professionals making coverage determinations. Plaintiffs include the estates of beneficiaries who died after coverage was terminated.

The case is not isolated. Cigna faces a parallel suit alleging its PXDX algorithm allowed doctors to deny claims in batches of hundreds or thousands without individual review. Humana faces similar accusations. Together, these cases signal a regulatory and litigation environment where the mere fact that an AI system operates inside a claims process invites discovery, even where the system produces correct decisions. Carriers without audit-grade infrastructure cannot adequately defend the systems they deploy.

For engineering teams building or operating claims AI in 2026, the lesson is concrete: claims AI is not a low-stakes operational application. It is a Tier 1 system under any reasonable reading of the NAIC Model Bulletin, subject to bias testing requirements, audit trail expectations, and adverse-outcome documentation that will be examined under the AI Evaluation Tool. The architectural and evaluation patterns that work are different from those for less-regulated AI applications.

This post covers what claims AI actually does, where the failure modes that produced the nH Predict lawsuit emerged, the architectural patterns that limit similar exposure, and the evaluation discipline that distinguishes defensible systems from systems that show up in a complaint.

What claims AI actually does

The category covers a wider range of workflows than "AI denies your claim." The taxonomy:

Workflow	What the AI does	Regulatory exposure
Claim intake and structuring	Extract structured fields from FNOL submissions, photos, documents	Low: read-only data extraction
Triage and routing	Categorize claims by severity, complexity, fraud risk; route to appropriate adjuster queue	Medium: process-level decisions affecting timing
Coverage determination	Evaluate whether the claim falls within coverage terms	High: directly affects whether claim is paid
Reserves and severity prediction	Estimate likely cost of the claim for accounting	Medium: financial reporting accuracy
Utilization review (health)	Determine medical necessity and appropriate care duration	High: directly affects what care is covered
Subrogation identification	Identify third-party liability for recovery	Low to medium: affects cost recovery
Fraud detection	Flag suspicious patterns for special investigations	Medium-high: affects whether claim is paid and whether customer is reported
Settlement valuation	Estimate appropriate settlement amount	High: directly affects payment
Adjuster copilot	Surface relevant precedent, guidelines, similar claims	Low: aids human decision
Customer service	Handle claim status inquiries, basic FAQ	Low to medium: depending on whether actions are taken

Most of these are reasonable LLM applications. A few are not, and the "not" category is what produces the nH Predict-style cases.

The pattern that goes wrong: an AI tool starts as a decision-support aid and gradually becomes the decision-maker, either because operational pressure pushes adjusters to defer to its recommendations or because business logic explicitly subordinates human judgment to its outputs. UnitedHealth's alleged practice of measuring adjuster compliance against nH Predict's predicted length of stay is the canonical example. The tool was technically advisory; the operational deployment made it determinative.

How nH Predict-style failures emerge

Three patterns repeat across the claims AI cases that have reached litigation.

The model is treated as ground truth. The claims process is structured so that the AI's output is the default decision and human review is the exception. Adjusters who deviate from the AI face management pressure or performance metrics that punish deviation. Over time, deviation rates drop and the AI is functionally the decisioner.

Validation evidence is thin or absent. When the algorithm is finally examined, the carrier cannot produce robust pre-deployment validation, ongoing performance monitoring against ground truth, or bias testing across protected groups. The 90% reversal rate alleged in the nH Predict suit is exactly the kind of finding that validation should have caught and the kind of finding that suggests it was not running.

Human appeals are filtered out. The carrier knows the appeal reversal rate is high but does not let that signal feed back into the model. The system optimizes for initial denial throughput rather than correctness. When discovery later compels the metrics, the gap between initial decisions and appealed-and-corrected decisions becomes the story.

The architectural defense against these patterns is structural, not procedural. Process changes ("we'll review the AI more carefully") do not survive operational pressure. Architecture changes ("the AI cannot be the decisioner without human approval logged in the system") do.

Architectural patterns that work

Three patterns describe defensible claims AI architectures. The right one depends on the workflow and the stakes.

Pattern A: AI as decision support with mandatory human disposition

The AI surfaces information, recommendations, and risk flags. The human adjuster makes the determination and records their reasoning. The AI's output is captured but is not the determinative record.

When this works. Coverage determination, utilization review, fraud flagging, and any other workflow where the carrier has direct adverse exposure if the AI is functionally the decisioner.

Implementation discipline.

The AI's recommendation and the adjuster's decision are separate fields in the claims record. Both are required.
When the adjuster's decision matches the AI's recommendation, that is logged and visible.
Operational metrics do not penalize adjusters for deviating from the AI. Quality metrics measure adjuster decisions against ground truth (appeal outcomes, audit reviews), not against AI agreement.
When the appeal reversal rate against the AI exceeds a threshold, the model is automatically flagged for retraining or revalidation, not allowed to continue as-is.

This pattern is what carriers experienced in the nH Predict litigation should have implemented. The technical architecture supports adjuster judgment as the system of record; the operational practice does not penalize judgment that diverges from the AI.

Pattern B: AI as classifier, human as decisioner

The AI classifies claims into operational categories (route to fast-track, route to investigation, route to senior adjuster) without making coverage determinations directly. Human decisioners handle the determinations within whichever queue the AI routed to.

When this works. Triage, intake, routing. The AI's output is operational, not adjudicative.

Implementation discipline.

The AI's categorization is logged but does not appear in the coverage determination record.
Misrouting (claims that the AI sent to fast-track but should have gone to senior review) is tracked and triggers retraining when rates exceed thresholds.
The classifier is bias-tested across demographic groups; routing rate disparities surface as findings.

This pattern provides operational efficiency without creating the determinative-AI exposure of Pattern A applied incorrectly.

Pattern C: AI as deterministic-rule augmentation

The AI's role is to extract structured information that feeds into deterministic rule-based decisioning. The actual coverage determination is made by code that applies policy terms; the AI's contribution is converting unstructured inputs (photos, documents, narrative descriptions) into structured fields.

When this works. Workflows where coverage determination is rule-based and the AI's role is preprocessing, not adjudication.

Implementation discipline.

The deterministic rules are documented, versioned, and inspectable.
The AI's structured outputs are validated against source documents (each extracted field is grounded in specific source text).
Errors in AI extraction are caught downstream by rule consistency checks.
The AI does not have the authority to bypass or override rule outputs.

This is the most defensible pattern when the underlying coverage decision is genuinely rule-based. It does not work when the coverage decision requires judgment, since the rule engine cannot exercise judgment and the team will be tempted to use the AI to fill the gap.

The evaluation framework specific to claims AI

Claims AI eval has dimensions that less-regulated LLM applications do not.

1. Coverage decision accuracy

The straightforward dimension: does the AI's coverage recommendation align with what a properly trained adjuster would conclude?

Construct the eval set from historical claims with known dispositions, including:

Claims initially denied and not appealed (presumably correctly denied or at least uncontested)
Claims initially denied and reversed on appeal (correctly identified as wrongly denied)
Claims paid in full at the original decision (presumably correctly paid)
Claims paid after litigation or external review (correctly paid; the original was wrong)

Stratify by claim type, severity, demographic of claimant, and channel. Aggregate metrics hide the failure modes. A model that performs well overall but systematically denies a specific demographic is the model that produces a class action.

2. Reversal rate against final disposition

This is the metric that haunted nH Predict. For every claim the AI was involved in, track:

The AI's recommendation
The adjuster's initial decision
The final disposition after any appeal, external review, or litigation
The gap between initial and final

A 90% reversal rate means the AI is wrong 9 times out of 10 about claims that get appealed. Carriers with this metric and no remediation plan are buying litigation. The eval framework should compute this continuously, alert when it exceeds thresholds, and feed the disagreement cases back into model retraining.

3. Bias testing across protected characteristics

Required by NAIC Model Bulletin Section 4 and operationalized by the AI Evaluation Tool. The minimum:

Denial rates per demographic group, with impact ratios
Appeal-reversal rates per demographic group
Settlement amounts per demographic group, controlled for claim characteristics
Claim duration and process metrics per demographic group

Bias can show up at any stage. A claims process where the AI is unbiased but the appeal review is biased still produces disparate outcomes; the audit framework should examine the full pipeline, not just the AI's contribution.

4. Hallucination rate on factual claims

LLM-based claims AI specifically produces text that summarizes claim facts, cites policy provisions, or references case history. Each of these is a factual claim that can be wrong. The eval framework should:

Validate cited policy provisions against the actual policy
Validate referenced case history against the actual case file
Validate factual summaries against source documents

For health claims AI specifically, the litigation environment makes hallucination risk acute. A denial letter that cites a clinical guideline incorrectly is evidence of inadequate AI oversight. Test for this and remediate.

5. Adversarial robustness against gaming

Customers and their representatives increasingly use AI to construct claim narratives. Test the system against:

Prompt injection in claim narrative fields
Synthetic but plausible claim photos
Inconsistent narrative across submitted documents (which a careful human notices but a fast LLM may miss)
Claim patterns designed to fit known fraud detection blind spots

The defense is not paranoia; it is testing that legitimate claims still process correctly while obvious gaming gets caught.

Audit trail requirements

Every claims AI decision produces a record that supports later examination, internal audit, and litigation defense. The minimum:

claims_ai_decision_record:
  decision_id: <uuid>
  claim_id: <reference>
  policyholder_id: <reference>
  timestamp: <ISO>
  
  ai_inputs:
    structured_data: <map of claim fields>
    documents_referenced: [<list>]
    policy_terms_consulted: [<list>]
  
  ai_processing:
    model_version: <id>
    prompt_version: <id if applicable>
    retrieval_context: [<list of sources>]
    raw_model_output: <text>
  
  ai_recommendation:
    recommended_action: pay | deny | partial | escalate
    confidence: <float>
    rationale: <structured>
    cited_authorities: [<list with source references>]
  
  human_decision:
    adjuster_id: <reference>
    decision: pay | deny | partial | escalate
    rationale: <text>
    deviation_from_ai: <boolean>
    deviation_reason: <text if applicable>
  
  outcome:
    initial_disposition: <text>
    appeal_status: <enum>
    final_disposition: <text>
    final_amount: <decimal>
    days_to_resolution: <int>
  
  audit_metadata:
    legal_hold: <boolean>
    retention_expires_at: <ISO>
    cryptographic_hash: <bytes>

Records retained for the regulatory period (typically 7-10 years for insurance claims), indexed for query, and exportable on demand. The cryptographic hash supports tamper-evidence; alterations to historical records get caught.

The Estate of Lokken v. UnitedHealth discovery order specifically targeted documents about model design, training, and operational practice. Carriers whose audit trail includes these elements can produce them efficiently. Carriers whose audit trail is partial face discovery costs in the millions and findings that drive settlement amounts higher.

What separates defensible claims AI from the next class action

After watching the litigation environment through 2024-2026:

The AI is decision support, not decisioner. Architecture enforces the distinction; operational practice does not erode it.

Validation evidence is current and complete. Pre-deployment validation, ongoing monitoring, bias testing, and adverse outcome tracking are all in place from day one.

Reversal rates are tracked and acted on. When initial decisions get reversed on appeal, the disagreement informs model retraining. Persistent high reversal rates trigger model retirement, not continued operation.

Bias testing is continuous. Annual is too slow; continuous monitoring with alerts catches issues before they become litigation.

Audit trail supports discovery defense. Decision records are queryable, indexed, and tamper-evident. Discovery requests can be answered in days, not months.

Adjuster judgment is protected operationally. Quality metrics measure decisions against ground truth, not against AI agreement. Adjusters who deviate from the AI are not punished; they are studied for what they saw that the AI missed.

These are the practices that produce claims AI systems that pass examination and survive litigation. Without them, the system is not less effective; it is more exposed.

Build order

Claims AI hardening is sequential, not parallel. The nH Predict pattern emerges when teams ship coverage recommendations before the audit trail and reversal monitoring exist to catch them. Each step below assumes the prior one is in place; skipping forward is what produces the discovery exposure UnitedHealth is now defending.

Order	What you build	Eval gate before moving on
1	Decision record schema and immutable audit log capturing AI inputs, model version, retrieval context, AI recommendation, adjuster disposition, and final outcome with cryptographic hash	100% of production claims AI decisions write a complete record; sample of 100 records reproduces the trace end to end
2	Clinical guideline and policy citation grounding, with every AI-cited authority linked to a specific source passage	Hallucinated citation rate under 1% on a labeled set of 200 production decisions
3	Adjuster override transparency in the UI and data model, separating AI recommendation from human decision with required deviation rationale	Override rationale captured on 100% of deviations; quality metrics no longer reward AI agreement
4	Continuous reversal rate monitoring tied to model version, with alerts when initial-vs-final disposition gap exceeds threshold	Reversal rate dashboard live for 30 days; threshold breach routes to retraining queue, not silent continuation
5	Bias testing across protected characteristics on denial rates, appeal-reversal rates, and settlement amounts	Impact ratios computed weekly; any group below 0.8 four-fifths threshold blocks deploy
6	Adversarial and gaming test suite plus litigation hold infrastructure that activates on inquiry	Adversarial pass rate above 95%; legal hold tested end to end on a sample claim within 24 hours

Steps 4 through 6 depend on 1 through 3 being correct; reversal monitoring without a complete decision record cannot attribute disagreement, and bias testing without override transparency cannot separate model bias from operational pressure. Carriers that ship coverage AI before step 1 are the ones whose discovery responses arrive months late and whose models cannot be defended on the record.

How Respan fits

Claims AI lives or dies by what you can produce in discovery, and that means every model recommendation, adjuster decision, and reversal needs to land somewhere queryable. Respan is the substrate underneath the patterns above: trace, eval, gateway, prompt registry, and monitoring wired into a record that survives examination.

Tracing: every claims AI decision captured as one connected trace, from FNOL ingestion through retrieval, model recommendation, adjuster disposition, and final outcome. Auto-instrumented for LangChain, LlamaIndex, Vercel AI SDK, CrewAI, AutoGen, OpenAI Agents SDK. When discovery asks what nH Predict-style system did on a specific claim on a specific day, you produce the trace in minutes instead of reconstructing it from logs over months.
Evals: ten built-in evaluators (faithfulness, citation accuracy, refusal correctness, harmfulness) plus LLM-as-judge and custom Python evaluators. Production traffic flows directly into datasets stratified by claim type, severity, demographic, and channel. CI-aware experiments block regressions on coverage decision accuracy, hallucinated policy citations, biased denial rates, and reversal-rate drift before deploys ship.
Gateway: 500+ models behind an OpenAI-compatible interface, semantic caching, fallback chains, per-customer spending caps. The gateway gives you a single audit-friendly chokepoint where every model call carries the version, prompt, and policyholder reference that the NAIC AI Evaluation Tool will eventually ask for.
Prompt management: versioned registry, dev/staging/prod environments with approval workflows, A/B testing in production with one-click rollback. Coverage determination prompts, adjuster copilot prompts, denial letter generators, and clinical guideline summarizers belong in the registry so every change is versioned, reviewable, and reversible without a deploy.
Monitors and alerts: appeal reversal rate against AI recommendation, demographic denial rate disparities, hallucinated policy citation rate, adjuster deviation rate, and length-of-stay prediction drift. Slack, email, PagerDuty, webhook. The system that haunted UnitedHealth was one where a 90% reversal rate ran for years without an alert; this is the inverse.

A reasonable starter loop for claims AI builders:

Instrument every LLM call with Respan tracing including retrieval spans, policy citation spans, and adjuster disposition spans.
Pull 200 to 500 production claim decisions into a dataset and label them for coverage accuracy, citation grounding, and demographic distribution.
Wire two or three evaluators that catch the failure modes you most fear (the model becoming determinative in practice, hallucinated clinical or policy citations, and disparate denial rates across protected groups).
Put your coverage determination prompts, denial letter templates, and adjuster copilot prompts behind the registry so you can version, A/B, and roll back without a deploy.
Route through the gateway so model versions, fallback behavior, and per-line spending caps are enforced and logged in one place that downstream audit can reach.

The carriers that survive the next nH Predict-style discovery order are the ones whose tracing, evals, and audit trail were running before the complaint was filed.

To wire any of the patterns above on Respan, start tracing for free, read the docs, or talk to us.

The NAIC AI Evaluation Tool: Engineering for the 2026 Pilot: the regulatory framework
Evaluating Underwriting LLMs: adjacent insurance AI
Building an AI Claims Processing Agent: full architecture walkthrough
How Insurance Teams Build LLM Apps in 2026: pillar overview
Building Adverse Action Explainability for LLM-Driven Credit Decisions: adjacent fintech compliance pattern

What claims AI actually does

The category covers a wider range of workflows than "AI denies your claim." The taxonomy:

Workflow	What the AI does	Regulatory exposure
Claim intake and structuring	Extract structured fields from FNOL submissions, photos, documents	Low: read-only data extraction
Triage and routing	Categorize claims by severity, complexity, fraud risk; route to appropriate adjuster queue	Medium: process-level decisions affecting timing
Coverage determination	Evaluate whether the claim falls within coverage terms	High: directly affects whether claim is paid
Reserves and severity prediction	Estimate likely cost of the claim for accounting	Medium: financial reporting accuracy
Utilization review (health)	Determine medical necessity and appropriate care duration	High: directly affects what care is covered
Subrogation identification	Identify third-party liability for recovery	Low to medium: affects cost recovery
Fraud detection	Flag suspicious patterns for special investigations	Medium-high: affects whether claim is paid and whether customer is reported
Settlement valuation	Estimate appropriate settlement amount	High: directly affects payment
Adjuster copilot	Surface relevant precedent, guidelines, similar claims	Low: aids human decision
Customer service	Handle claim status inquiries, basic FAQ	Low to medium: depending on whether actions are taken

Most of these are reasonable LLM applications. A few are not, and the "not" category is what produces the nH Predict-style cases.

How nH Predict-style failures emerge

Three patterns repeat across the claims AI cases that have reached litigation.

Architectural patterns that work

Three patterns describe defensible claims AI architectures. The right one depends on the workflow and the stakes.

Pattern A: AI as decision support with mandatory human disposition

The AI surfaces information, recommendations, and risk flags. The human adjuster makes the determination and records their reasoning. The AI's output is captured but is not the determinative record.

When this works. Coverage determination, utilization review, fraud flagging, and any other workflow where the carrier has direct adverse exposure if the AI is functionally the decisioner.

Implementation discipline.

The AI's recommendation and the adjuster's decision are separate fields in the claims record. Both are required.
When the adjuster's decision matches the AI's recommendation, that is logged and visible.
Operational metrics do not penalize adjusters for deviating from the AI. Quality metrics measure adjuster decisions against ground truth (appeal outcomes, audit reviews), not against AI agreement.
When the appeal reversal rate against the AI exceeds a threshold, the model is automatically flagged for retraining or revalidation, not allowed to continue as-is.

Pattern B: AI as classifier, human as decisioner

When this works. Triage, intake, routing. The AI's output is operational, not adjudicative.

Implementation discipline.

The AI's categorization is logged but does not appear in the coverage determination record.
Misrouting (claims that the AI sent to fast-track but should have gone to senior review) is tracked and triggers retraining when rates exceed thresholds.
The classifier is bias-tested across demographic groups; routing rate disparities surface as findings.

This pattern provides operational efficiency without creating the determinative-AI exposure of Pattern A applied incorrectly.

Pattern C: AI as deterministic-rule augmentation

When this works. Workflows where coverage determination is rule-based and the AI's role is preprocessing, not adjudication.

Implementation discipline.

The deterministic rules are documented, versioned, and inspectable.
The AI's structured outputs are validated against source documents (each extracted field is grounded in specific source text).
Errors in AI extraction are caught downstream by rule consistency checks.
The AI does not have the authority to bypass or override rule outputs.

The evaluation framework specific to claims AI

Claims AI eval has dimensions that less-regulated LLM applications do not.

1. Coverage decision accuracy

The straightforward dimension: does the AI's coverage recommendation align with what a properly trained adjuster would conclude?

Construct the eval set from historical claims with known dispositions, including:

Claims initially denied and not appealed (presumably correctly denied or at least uncontested)
Claims initially denied and reversed on appeal (correctly identified as wrongly denied)
Claims paid in full at the original decision (presumably correctly paid)
Claims paid after litigation or external review (correctly paid; the original was wrong)

2. Reversal rate against final disposition

This is the metric that haunted nH Predict. For every claim the AI was involved in, track:

The AI's recommendation
The adjuster's initial decision
The final disposition after any appeal, external review, or litigation
The gap between initial and final

3. Bias testing across protected characteristics

Required by NAIC Model Bulletin Section 4 and operationalized by the AI Evaluation Tool. The minimum:

Denial rates per demographic group, with impact ratios
Appeal-reversal rates per demographic group
Settlement amounts per demographic group, controlled for claim characteristics
Claim duration and process metrics per demographic group

4. Hallucination rate on factual claims

Validate cited policy provisions against the actual policy
Validate referenced case history against the actual case file
Validate factual summaries against source documents

5. Adversarial robustness against gaming

Customers and their representatives increasingly use AI to construct claim narratives. Test the system against:

Prompt injection in claim narrative fields
Synthetic but plausible claim photos
Inconsistent narrative across submitted documents (which a careful human notices but a fast LLM may miss)
Claim patterns designed to fit known fraud detection blind spots

The defense is not paranoia; it is testing that legitimate claims still process correctly while obvious gaming gets caught.

Audit trail requirements

Every claims AI decision produces a record that supports later examination, internal audit, and litigation defense. The minimum:

claims_ai_decision_record:
  decision_id: <uuid>
  claim_id: <reference>
  policyholder_id: <reference>
  timestamp: <ISO>
  
  ai_inputs:
    structured_data: <map of claim fields>
    documents_referenced: [<list>]
    policy_terms_consulted: [<list>]
  
  ai_processing:
    model_version: <id>
    prompt_version: <id if applicable>
    retrieval_context: [<list of sources>]
    raw_model_output: <text>
  
  ai_recommendation:
    recommended_action: pay | deny | partial | escalate
    confidence: <float>
    rationale: <structured>
    cited_authorities: [<list with source references>]
  
  human_decision:
    adjuster_id: <reference>
    decision: pay | deny | partial | escalate
    rationale: <text>
    deviation_from_ai: <boolean>
    deviation_reason: <text if applicable>
  
  outcome:
    initial_disposition: <text>
    appeal_status: <enum>
    final_disposition: <text>
    final_amount: <decimal>
    days_to_resolution: <int>
  
  audit_metadata:
    legal_hold: <boolean>
    retention_expires_at: <ISO>
    cryptographic_hash: <bytes>

What separates defensible claims AI from the next class action

After watching the litigation environment through 2024-2026:

The AI is decision support, not decisioner. Architecture enforces the distinction; operational practice does not erode it.

Validation evidence is current and complete. Pre-deployment validation, ongoing monitoring, bias testing, and adverse outcome tracking are all in place from day one.

Bias testing is continuous. Annual is too slow; continuous monitoring with alerts catches issues before they become litigation.

Audit trail supports discovery defense. Decision records are queryable, indexed, and tamper-evident. Discovery requests can be answered in days, not months.

These are the practices that produce claims AI systems that pass examination and survive litigation. Without them, the system is not less effective; it is more exposed.

Build order

Order	What you build	Eval gate before moving on
1	Decision record schema and immutable audit log capturing AI inputs, model version, retrieval context, AI recommendation, adjuster disposition, and final outcome with cryptographic hash	100% of production claims AI decisions write a complete record; sample of 100 records reproduces the trace end to end
2	Clinical guideline and policy citation grounding, with every AI-cited authority linked to a specific source passage	Hallucinated citation rate under 1% on a labeled set of 200 production decisions
3	Adjuster override transparency in the UI and data model, separating AI recommendation from human decision with required deviation rationale	Override rationale captured on 100% of deviations; quality metrics no longer reward AI agreement
4	Continuous reversal rate monitoring tied to model version, with alerts when initial-vs-final disposition gap exceeds threshold	Reversal rate dashboard live for 30 days; threshold breach routes to retraining queue, not silent continuation
5	Bias testing across protected characteristics on denial rates, appeal-reversal rates, and settlement amounts	Impact ratios computed weekly; any group below 0.8 four-fifths threshold blocks deploy
6	Adversarial and gaming test suite plus litigation hold infrastructure that activates on inquiry	Adversarial pass rate above 95%; legal hold tested end to end on a sample claim within 24 hours

How Respan fits

Tracing: every claims AI decision captured as one connected trace, from FNOL ingestion through retrieval, model recommendation, adjuster disposition, and final outcome. Auto-instrumented for LangChain, LlamaIndex, Vercel AI SDK, CrewAI, AutoGen, OpenAI Agents SDK. When discovery asks what nH Predict-style system did on a specific claim on a specific day, you produce the trace in minutes instead of reconstructing it from logs over months.
Evals: ten built-in evaluators (faithfulness, citation accuracy, refusal correctness, harmfulness) plus LLM-as-judge and custom Python evaluators. Production traffic flows directly into datasets stratified by claim type, severity, demographic, and channel. CI-aware experiments block regressions on coverage decision accuracy, hallucinated policy citations, biased denial rates, and reversal-rate drift before deploys ship.
Gateway: 500+ models behind an OpenAI-compatible interface, semantic caching, fallback chains, per-customer spending caps. The gateway gives you a single audit-friendly chokepoint where every model call carries the version, prompt, and policyholder reference that the NAIC AI Evaluation Tool will eventually ask for.
Prompt management: versioned registry, dev/staging/prod environments with approval workflows, A/B testing in production with one-click rollback. Coverage determination prompts, adjuster copilot prompts, denial letter generators, and clinical guideline summarizers belong in the registry so every change is versioned, reviewable, and reversible without a deploy.
Monitors and alerts: appeal reversal rate against AI recommendation, demographic denial rate disparities, hallucinated policy citation rate, adjuster deviation rate, and length-of-stay prediction drift. Slack, email, PagerDuty, webhook. The system that haunted UnitedHealth was one where a 90% reversal rate ran for years without an alert; this is the inverse.

A reasonable starter loop for claims AI builders:

Instrument every LLM call with Respan tracing including retrieval spans, policy citation spans, and adjuster disposition spans.
Pull 200 to 500 production claim decisions into a dataset and label them for coverage accuracy, citation grounding, and demographic distribution.
Wire two or three evaluators that catch the failure modes you most fear (the model becoming determinative in practice, hallucinated clinical or policy citations, and disparate denial rates across protected groups).
Put your coverage determination prompts, denial letter templates, and adjuster copilot prompts behind the registry so you can version, A/B, and roll back without a deploy.
Route through the gateway so model versions, fallback behavior, and per-line spending caps are enforced and logged in one place that downstream audit can reach.

The carriers that survive the next nH Predict-style discovery order are the ones whose tracing, evals, and audit trail were running before the complaint was filed.

To wire any of the patterns above on Respan, start tracing for free, read the docs, or talk to us.

The NAIC AI Evaluation Tool: Engineering for the 2026 Pilot: the regulatory framework
Evaluating Underwriting LLMs: adjacent insurance AI
Building an AI Claims Processing Agent: full architecture walkthrough
How Insurance Teams Build LLM Apps in 2026: pillar overview
Building Adverse Action Explainability for LLM-Driven Credit Decisions: adjacent fintech compliance pattern

Building Claims AI Without Becoming the Next nH Predict

What claims AI actually does

How nH Predict-style failures emerge

Architectural patterns that work

Pattern A: AI as decision support with mandatory human disposition

Pattern B: AI as classifier, human as decisioner

Pattern C: AI as deterministic-rule augmentation

The evaluation framework specific to claims AI

1. Coverage decision accuracy

2. Reversal rate against final disposition

3. Bias testing across protected characteristics

4. Hallucination rate on factual claims

5. Adversarial robustness against gaming

Audit trail requirements

What separates defensible claims AI from the next class action

Build order

How Respan fits

Built for AI agents.
Break less.
Ship more.

Building Claims AI Without Becoming the Next nH Predict

What claims AI actually does

How nH Predict-style failures emerge

Architectural patterns that work

Pattern A: AI as decision support with mandatory human disposition

Pattern B: AI as classifier, human as decisioner

Pattern C: AI as deterministic-rule augmentation

The evaluation framework specific to claims AI

1. Coverage decision accuracy

2. Reversal rate against final disposition

3. Bias testing across protected characteristics

4. Hallucination rate on factual claims

5. Adversarial robustness against gaming

Audit trail requirements

What separates defensible claims AI from the next class action

Build order

How Respan fits

Built for AI agents.
Break less.
Ship more.

Building Claims AI Without Becoming the Next nH Predict

What claims AI actually does

How nH Predict-style failures emerge

Architectural patterns that work

Pattern A: AI as decision support with mandatory human disposition

Pattern B: AI as classifier, human as decisioner

Pattern C: AI as deterministic-rule augmentation

The evaluation framework specific to claims AI

1. Coverage decision accuracy

2. Reversal rate against final disposition

3. Bias testing across protected characteristics

4. Hallucination rate on factual claims

5. Adversarial robustness against gaming

Audit trail requirements

What separates defensible claims AI from the next class action

Build order

How Respan fits

Related reading

Built for AI agents. Break less. Ship more.

Building Claims AI Without Becoming the Next nH Predict

What claims AI actually does

How nH Predict-style failures emerge

Architectural patterns that work

Pattern A: AI as decision support with mandatory human disposition

Pattern B: AI as classifier, human as decisioner

Pattern C: AI as deterministic-rule augmentation

The evaluation framework specific to claims AI

1. Coverage decision accuracy

2. Reversal rate against final disposition

3. Bias testing across protected characteristics

4. Hallucination rate on factual claims

5. Adversarial robustness against gaming

Audit trail requirements

What separates defensible claims AI from the next class action

Build order

How Respan fits

Related reading

Built for AI agents. Break less. Ship more.

Built for AI agents.
Break less.
Ship more.

Built for AI agents.
Break less.
Ship more.