If you are building an AI claims processing agent in 2026, the architecture is not a research question. EvolutionIQ runs disability claims at scale; Five Sigma processes claims for major P&C carriers; Cytora Autopilot connects underwriting through to claims with end-to-end agentic workflows; Sixfold integrates into existing carrier workbenches. The patterns have stabilized.
The hard parts are not the LLM call or the document parser. They are the things that determine whether the agent survives the operating environment that produced the UnitedHealth nH Predict litigation: defensible coverage logic, audit-grade decision lineage, bias monitoring across demographic groups, and the human-in-the-loop architecture that protects adjuster judgment as the system of record. A platform built without these properties looks fast in demo and exposes the carrier the moment a complaint or a state examination arrives.
This post walks through the architecture, identifies where teams typically miss, and lays out a 90-day build plan. It assumes you have read the related posts on the NAIC AI Evaluation Tool, claims AI defensibility, and underwriting LLM eval. Those define the context; this is the build.
Architecture overview
The simplified production architecture:
[FNOL or claim submission]
|
v
[Document and data ingestion]
(PDFs, photos, audio, structured forms)
|
v
[Structured extraction with provenance]
|
v
[Coverage analysis]
- Policy retrieval
- Coverage term matching
- Exclusions and conditions check
|
v
[Recommendation generation]
- Pay / partial / deny / refer / investigate
- Cited policy provisions
- Confidence and rationale
|
v
[Verification pass]
- Citations match policy
- Extracted data matches source documents
- No fabricated facts
|
v
[Adjuster review]
- AI recommendation visible
- Adjuster decision recorded separately
- Deviation captured
|
v
[Decision and outcome capture]
|
v
[Audit trail + continuous evaluation + bias monitoring]
Each block is its own subsystem. The hard parts cluster in three places: document extraction (where errors propagate), coverage analysis (where the regulatory exposure concentrates), and audit trail (the difference between defensible and indefensible).
Document and data ingestion
Claims come in many shapes. A health claim arrives as structured EDI plus supporting clinical documentation. An auto claim has photos, repair estimates, police reports, and witness statements. A property claim has photos, contractor estimates, weather data, and policy documents. A disability claim has medical records, vocational assessments, and employer statements.
The ingestion layer needs format-aware processing. Three patterns:
PDF and document parsing. Layout-aware parsing that preserves document structure (Unstructured, Reducto, LlamaParse). Tables, headers, footers, signatures, attachments. The parser produces structured representations of the document content; downstream extraction works on the structure rather than flattened text.
Image and photo analysis. Damage photos, ID documents, scene photos. Computer vision models extract quantitative information (damage area, vehicle identification, building type) and route to LLMs for qualitative reasoning. V7 Go and similar tools provide visual grounding so extracted facts link back to specific image regions.
Audio and call processing. Recorded statements, voicemails, calls with broker or insured. Speech-to-text plus structured extraction. Transcripts retained as evidence; extracted facts link to specific timestamps in the audio.
The ingestion layer needs to capture and preserve every input received. The pattern that fails in litigation: the carrier received a photo or document that was relevant but the system did not process it, and discovery later reveals the gap. Build the ingestion to capture everything; let downstream filtering decide what is material.
Structured extraction with provenance
LLM-based extraction takes the ingested documents and produces structured fields. The structured representation is what the rest of the agent operates on.
The schema for a claim's structured extraction:
claim_extraction:
claim_id: <uuid>
extraction_timestamp: <ISO>
extraction_model_version: <id>
parties:
- role: claimant | insured | beneficiary | third_party
name: <text>
identifier: <if applicable>
contact: <text>
source_documents: [<list>]
provenance: <field-level>
loss_details:
date_of_loss: <ISO>
location: <text>
reported_cause: <text>
initial_severity: low | medium | high | catastrophic
source_documents: [<list>]
provenance: <field-level>
coverage_relevance:
line_of_business: <enum>
policy_id: <reference>
coverage_periods: [<list>]
evidence_inventory:
- evidence_type: photo | document | statement | report | other
evidence_id: <ref>
summary: <text>
relevance_score: <float>
provenance: <full source reference>
extracted_facts:
- claim: <text>
source_evidence: [<list of references>]
confidence: <float>
verified: <boolean>
flags:
- flag_type: missing_information | inconsistency | fraud_indicator | coverage_concern
description: <text>
requires_action: <boolean>Two properties of this schema matter:
Field-level provenance. Every extracted field links to specific source evidence and a specific location within that evidence. A claim that "damage occurred on January 15" cites the police report and the line where that date appears. Hallucinated dates fail provenance validation and surface as flags.
Evidence inventory. All received evidence is logged with relevance assessment, even evidence that does not surface in the recommendation. Discovery later asks "what did you receive and what did you do with it"; the answer is in the inventory.
The verification pass on extraction:
- Every cited evidence reference resolves to a real document or location
- Every claimed fact is supported by the cited evidence
- Confidence scores correlate with verification outcomes (low-confidence extractions get flagged for human review)
- Inconsistencies across documents (date mismatches, factual conflicts) surface as flags rather than getting silently resolved
Failed verification on extraction stops the pipeline rather than passing forward. The downstream coverage analysis cannot trust extractions that did not verify.
Coverage analysis
The coverage analysis layer takes structured extraction and produces a coverage recommendation. This is where the regulatory exposure concentrates.
Three architectural patterns for this layer:
Pattern A: Rule engine with AI input preparation
The coverage analysis is performed by a deterministic rule engine that applies policy terms. The AI's role is preparing the structured inputs the rule engine consumes. The actual coverage determination logic is inspectable and testable independently of the LLM.
When this works. Lines with formal policy language and well-defined coverage rules. Personal lines auto, homeowners, life insurance.
Architecture. Policy terms encoded as rule logic (with references back to the policy text). The AI structures the claim into the inputs the rule engine needs (occurrence type, covered peril, policy period, coverage limits applied). The rule engine produces the coverage decision; the AI generates the human-readable explanation.
Tradeoff. Cleanest defensibility. Rule engine logic is auditable. AI's role is bounded.
Pattern B: AI as coverage recommender with deterministic verification
The AI evaluates the claim against policy terms and produces a recommendation. A separate deterministic verification pass checks the recommendation against critical policy provisions (in-force status, coverage limits, exclusions explicitly listed). The verification can override the recommendation if it conflicts with hard policy requirements.
When this works. Lines where coverage analysis requires judgment that does not reduce cleanly to rules. Commercial lines, complex coverage, claims involving multiple policies.
Architecture. The AI produces a structured recommendation (decision, rationale, cited policy provisions). The verification layer checks each cited provision against the actual policy and confirms the cited language supports the asserted conclusion.
Tradeoff. More AI involvement, more verification work. Strong if the verification is comprehensive; weak if verification gaps allow AI errors to flow through.
Pattern C: AI as adjuster decision support
The AI surfaces relevant policy provisions, similar prior claims, and potential considerations without producing a coverage recommendation. The adjuster reviews the surfaced context and makes the determination.
When this works. High-stakes coverage with significant judgment requirements. Litigation-likely claims, complex commercial, large losses.
Architecture. Retrieval-heavy. The AI's role is finding what the adjuster needs to consider; the adjuster makes the call. The system records what was surfaced and what the adjuster did with it.
Tradeoff. Slowest but most defensible. The adjuster is unambiguously the decision-maker; the AI is unambiguously support.
Choosing among patterns
| If your claims workflow is | Use pattern |
|---|---|
| Personal lines auto/home/life with clear policy language | A (Rule engine with AI input prep) |
| Commercial lines or claims with judgment requirements | B (AI recommends, deterministic verification) |
| Complex commercial, large losses, litigation-likely | C (AI as decision support) |
| Health claims (utilization review especially) | A or C; avoid B given the nH Predict litigation environment |
| Mixed portfolio | A for clear-cut cases, B for moderate, C for high-stakes |
The pattern that produced nH Predict appears to have been Pattern B without adequate verification, deployed operationally in a way that made the AI's recommendation determinative. The architectural pattern is not inherently flawed; the operational deployment is what created the legal exposure.
Recommendation generation with grounding
Whatever pattern is used, the AI's output is structured:
coverage_recommendation:
recommendation: pay_in_full | pay_partial | deny | refer | investigate
rationale:
primary_basis: <text>
cited_policy_provisions:
- provision_id: <ref to policy>
provision_text: <quoted text>
relevance: <text explaining how it applies>
cited_evidence:
- evidence_id: <ref to claim extraction>
relevance: <text>
confidence: <float>
considerations_flagged:
- flag_type: <enum>
description: <text>
recommended_action: <text>
recommended_amount: <decimal or null>
recommended_amount_basis: <text if applicable>
alternative_dispositions:
- disposition: <enum>
conditions_under_which_appropriate: <text>Every cited policy provision and every cited evidence point links back to specific source material. The verification pass confirms each citation resolves and supports the recommendation. Recommendations that fail verification do not reach the adjuster.
The recommended amount, when applicable, includes its basis. "Pay $4,250 based on contractor estimate of $4,500 less $250 deductible" is grounded; "pay $4,250" is not.
Adjuster review architecture
The adjuster review is where the system does or does not protect adjuster judgment. The architectural choices that matter:
Recommendation visible but not pre-filled. The AI recommendation is shown to the adjuster but the decision field is empty. The adjuster has to actively choose, not just confirm a default.
Decision and rationale captured separately. The adjuster's decision and reasoning are recorded as their own fields. Whether the decision matches the AI is a derived fact, not the primary record.
Deviation tracking without penalty. When the adjuster disagrees with the AI, that is logged. Operational quality metrics measure decisions against ground truth (appeal outcomes, audit reviews), not against AI agreement. Adjusters who deviate from the AI in cases where the deviation was correct should be celebrated, not penalized.
Time spent on review tracked. Excessive auto-confirm patterns (adjuster confirming AI recommendations in seconds without review) get flagged. The defense against nH Predict-style operational pressure is detecting when review has degraded into rubber-stamping.
Reversal feedback loop. When a claim is appealed and reversed, the original AI recommendation is reviewed against the corrected disposition. Persistent disagreement between AI and corrected outcomes triggers retraining or model retirement, not continued operation.
Audit trail completeness
Every claim decision produces a record that supports examination, internal audit, and litigation defense. The structure was outlined in Building Claims AI Without Becoming the Next nH Predict; the implementation here is what produces records of that quality.
Key engineering points:
Records are immutable. Once written, a decision record cannot be edited. Corrections are new records that reference the prior. Tamper-evidence (cryptographic hashing) makes alterations detectable.
Cross-referenced. A claim record references its underlying extractions, the policy versions consulted, the model versions used, the adjuster who reviewed, the appeal records if any. Discovery requests for "everything related to claim X" produce a complete graph.
Indexed for query. Common queries ("all claims processed by model version Y in date range Z that resulted in denial") return results in seconds. Without this, examination response is a multi-week project.
Retained for the regulatory period. Insurance retention periods are long (7-10 years is typical, some states require longer). Storage tiering (hot for recent, cold for older) keeps cost reasonable.
Exportable on demand. State examinations and discovery requests come with production deadlines. The audit infrastructure produces compliant exports without disrupting operations.
The Estate of Lokken v. UnitedHealth discovery order specifically targeted documents about model design, training, operational practice, and decision history. Carriers whose audit infrastructure produces all of these efficiently can defend; carriers whose infrastructure has gaps face escalating discovery costs.
Bias monitoring infrastructure
Continuous monitoring across protected demographic groups for the metrics that matter:
| Metric | Frequency | Alert threshold |
|---|---|---|
| Claim denial rate per protected group | Weekly | Disparate impact ratio below 0.80 |
| Average settlement amount per group, controlled for severity | Monthly | Significant disparity at p < 0.05 |
| Time-to-resolution per group | Monthly | Significant disparity |
| Appeal rate per group | Monthly | Disparate appeal patterns |
| Appeal-reversal rate per group | Monthly | Disparate reversal rates |
| Special investigations referral rate per group | Monthly | Disparate referral patterns |
Demographic data flows through an isolated path that does not influence the AI's decisions. The bias monitoring layer joins demographic data to claim outcomes for measurement; the claim processing path never sees demographic data as model input.
The architectural separation matters legally. A model that has read access to demographic data, even unused, creates direct disparate treatment exposure. A model that physically cannot access demographic data has a stronger defense and produces cleaner monitoring.
What separates defensible builds from exposed ones
After watching the claims AI category through 2024-2026, the differentiation reduces to:
The AI is decision support; the architecture enforces it. Pattern A or C as the baseline; Pattern B only with rigorous verification.
Provenance everywhere. Every extracted fact, every citation, every recommendation traces to source material. Hallucinated facts fail verification.
Audit trail is queryable in seconds. Discovery and examination response is fast and complete. The infrastructure was built before the inquiry, not after.
Bias monitoring is continuous and segregated. Demographic data isolated from inference path; bias metrics computed weekly with alerts.
Reversal feedback loops. AI recommendations contradicted by appeals or external review feed back into model retraining. Persistent gaps trigger retirement.
Human review is real. Adjuster judgment is the system of record. Quality metrics measure decisions, not AI agreement. Auto-confirm patterns are detected and addressed.
Build order
Claims AI fails when higher layers are evaluated before lower layers are correct. Coverage recommendations cannot be trusted if extractions hallucinate; bias monitoring is meaningless if the audit trail is incomplete. Build in dependency order and gate each stage on a measurable property of the layer beneath it.
| Order | What you build | Eval gate before moving on |
|---|---|---|
| 1 | Architecture decision and policy sources of truth (Pattern A, B, or C per workflow; canonical policy version registry) | Every line of business mapped to a pattern with rationale; 100% of in-force policies addressable by versioned ID |
| 2 | Document and data ingestion with evidence inventory (PDF, photo, audio; every input captured) | Evidence inventory captures 100% of submitted artifacts on a 200-claim replay; zero silent drops |
| 3 | Structured extraction with field-level provenance and verification pass | Citation resolution rate above 99%; fabricated-fact rate below 0.5% on labeled gold set of 300 claims |
| 4 | Coverage analysis per chosen pattern (rule engine, AI plus deterministic verification, or decision support) | Coverage decision agreement with historical adjudicated outcomes above 92%; zero recommendations citing nonexistent policy provisions |
| 5 | Adjuster review UI and immutable audit trail with cross-referenced indexing | Discovery query ("all decisions by model version Y in date range Z") returns complete results in under 10 seconds; auto-confirm-under-5-seconds rate below 10% |
| 6 | Bias monitoring with isolated demographic data path and reversal feedback loop | Disparate impact ratio computed weekly across protected groups with alerts firing below 0.80; appeal-reversal disagreement signal wired to retraining queue |
Continuous evaluation, quarterly portfolio review, and external validation come after stage 6. Skipping order means a carrier ships the LLM call first and spends the next year rebuilding the foundation while regulatory and litigation risk accumulates.
How Respan fits
Building an AI claims processing agent that survives state examination and nH Predict-style discovery means every extraction, coverage recommendation, adjuster review, and bias metric has to be inspectable years later. Respan is the substrate that makes that traceability, evaluation, and rollback discipline operationally cheap.
- Tracing: every claim journey captured as one connected trace from FNOL through ingestion, structured extraction, coverage analysis, verification, adjuster review, and decision capture. Auto-instrumented for LangChain, LlamaIndex, Vercel AI SDK, CrewAI, AutoGen, OpenAI Agents SDK. When a discovery request asks "what did the system see, what did it conclude, and who decided," the trace is the answer in seconds rather than a multi-week reconstruction.
- Evals: ten built-in evaluators (faithfulness, citation accuracy, refusal correctness, harmfulness) plus LLM-as-judge and custom Python evaluators. Production traffic flows directly into datasets. CI-aware experiments block regressions on hallucinated extracted facts, citations that do not resolve to the cited policy provision, and coverage recommendations that conflict with hard exclusions before deploys ship.
- Gateway: 500+ models behind an OpenAI-compatible interface, semantic caching, fallback chains, per-customer spending caps. Pin the exact extraction and coverage model versions a claim was decided under, route health utilization-review-style workflows away from Pattern B models entirely, and keep model version lineage in the audit trail by default.
- Prompt management: versioned registry, dev/staging/prod environments with approval workflows, A/B testing in production with one-click rollback. Extraction schemas, coverage rationale prompts, verification-pass prompts, and adjuster-facing rationale generators all belong in the registry so a regulator can be told exactly which prompt produced which decision on which date.
- Monitors and alerts: extraction provenance failure rate, citation resolution rate, verification override rate, adjuster auto-confirm time, appeal-reversal disagreement rate, and disparate impact ratio per protected group. Slack, email, PagerDuty, webhook. Bias drift and rubber-stamp patterns surface as alerts the same week they appear, not the same quarter the examination opens.
A reasonable starter loop for claims AI builders:
- Instrument every LLM call with Respan tracing including ingestion, extraction, coverage analysis, verification, and adjuster review spans.
- Pull 200 to 500 production claim decisions into a dataset and label them for citation accuracy, extraction faithfulness, and adjuster-vs-AI agreement against appeal outcomes.
- Wire two or three evaluators that catch the failure modes you most fear (fabricated facts in extraction, citations that do not support the recommendation, and coverage decisions that ignore explicit exclusions).
- Put your extraction prompts, coverage rationale prompts, and verification-pass prompts behind the registry so you can version, A/B, and roll back without a deploy.
- Route through the gateway so model version, fallback behavior, and per-line spending controls are recorded with every claim and pinned for the regulatory retention period.
Skip this loop and the next nH Predict-style complaint lands on a system whose audit trail cannot answer what it saw, what it concluded, or why, and that is where carriers stop being able to defend the build.
To wire any of the patterns above on Respan, start tracing for free, read the docs, or talk to us.
Related reading
- The NAIC AI Evaluation Tool: Engineering for the 2026 Pilot: the regulatory framework
- Building Claims AI Without Becoming the Next nH Predict: the cautionary tale
- Evaluating Underwriting LLMs: adjacent insurance AI
- How Insurance Teams Build LLM Apps in 2026: pillar overview
