In the first four months of 2026, three things happened in insurance AI regulation that are concrete enough to require engineering work and ambiguous enough to require judgment.
In January 2026, the NAIC launched a multistate pilot of the AI Systems Evaluation Tool, running through September 2026. Twelve states are participating: California, Colorado, Connecticut, Florida, Iowa, Louisiana, Maryland, Pennsylvania, Rhode Island, Vermont, Virginia, and Wisconsin. The Tool gives examiners a structured framework for reviewing insurer AI governance during market conduct examinations. Industry trade groups have pushed back on specific provisions. Adoption at the NAIC's Fall 2026 National Meeting is anticipated.
On December 11, 2025, the federal government issued Executive Order 14365 asserting federal authority over AI regulation in ways that, on their face, threaten state insurance department oversight. The NAIC publicly opposed the EO within days, asking the administration to reconsider and at minimum affirm state regulation of AI in the business of insurance. Carriers now operate under both regimes simultaneously. State departments continue issuing bulletins, demanding inventories, and running examinations. The federal government asserts authority that may or may not preempt those activities, depending on litigation that has not yet reached the appellate level.
On March 23, 2026, the NAIC's Third-Party Data and Models Working Group sketched the contours of a vendor registry. The framework, if adopted later in 2026, will require AI vendors selling into insurance to register and disclose. Carrier accountability does not transfer to the registered vendor. The registry creates visibility, not safe harbor.
For carriers and the vendors selling into them, the engineering implications are real. The NAIC Model Bulletin, adopted by 24 states plus DC by late 2025, has been principle-based and lightly enforced through 2024 and 2025. The 2026 Evaluation Tool pilot is what operationalizes it. By the end of 2026, examiners in the bulletin states will arrive with a structured workbook of questions about your AI governance program. Carriers that built the documentation as a byproduct of operations get through cleanly. Carriers that scramble to assemble it during the exam find gaps.
This post is the engineering translation. It covers what the AI Evaluation Tool actually inspects, what evidence each section requires, where the dual-regime complexity matters, and what the vendor registry will mean for procurement.
What the Evaluation Tool inspects
The Tool is a structured framework that gives examiners a consistent way to review insurer AI governance during market conduct examinations. Based on the Model Bulletin's AIS Program requirements, it covers governance, risk management, third-party oversight, and outcome monitoring. The pilot states are using it on actual carriers; the feedback informs the version adopted in Fall 2026.
What examiners are documented to ask, based on Model Bulletin Section 4 and the Tool's structure:
| Examination area | Specific questions |
|---|---|
| AI inventory | What models are in production? What business functions do they serve? Who owns each one? |
| Governance program | Who has accountability for the AIS program? What policies govern AI use? How are they reviewed? |
| Pre-deployment validation | How was each model validated before production? What testing was done? Who approved deployment? |
| Ongoing monitoring | How do you detect drift, bias, or performance degradation? What metrics? What thresholds? |
| Third-party models | What due diligence did you perform on vendor models? What contractual rights do you have? |
| Adverse outcomes | How are AI-driven decisions appealed? How do you measure complaint rates? |
| Bias and fair lending | What testing for unfair discrimination? What categories? What corrective actions? |
| Documentation | Can you produce model cards, validation evidence, and audit trails on demand? |
The questions are deceptively simple. Producing acceptable answers requires infrastructure. An examiner who asks "show me everything that touches underwriting decisions" needs an answer in hours, not weeks. An examiner who asks "demonstrate that this model has been validated against bias for the past two years" needs evidence, not assertions.
What "AIS Program" actually requires
The Model Bulletin's AIS Program (AI Systems Program) is the documented program that makes carriers' AI use defensible under examination. The Bulletin describes it as principle-based; the Evaluation Tool is what makes it concrete.
The minimum components of an AIS Program that survives the 2026 examination cycle:
1. Model inventory
A complete record of every AI model in production, with sufficient metadata for an examiner to understand what each does and how it fits into the business. The minimum fields:
ai_model_inventory_entry:
model_id: <uuid>
model_name: <text>
business_purpose: <text> # underwriting, claims triage, fraud detection, etc.
business_owner: <person or role>
technical_owner: <person or role>
classification:
line_of_business: [<list>] # auto, homeowners, life, health, etc.
decision_type: pricing | underwriting | claims | utilization | fraud | marketing | servicing
impact_level: high | medium | low # bulletin's tiering
consumer_facing: <boolean>
source:
development_type: in_house | third_party
vendor_name: <if third_party>
vendor_model_version: <if third_party>
deployment:
initial_deployment_date: <ISO>
last_updated: <ISO>
pilot_status: pilot | production | retired
validation:
pre_deployment_evidence: <reference to validation file>
last_revalidation_date: <ISO>
next_scheduled_revalidation: <ISO>
monitoring:
monitoring_metrics: [<list>]
monitoring_frequency: <text>
threshold_alerts: [<list>]This is the foundation. The Bulletin (and the Evaluation Tool) explicitly require a model inventory that an examiner can query. Carriers that have an inventory in spreadsheets distributed across business units fail this. Carriers with a single registry pass.
2. Pre-deployment validation file
Every Tier 1 (high-impact) model has a documented validation file before production. The file contains, at minimum:
- Model card describing architecture, training data, intended use, known limitations
- Validation methodology and results
- Bias and fairness testing across protected characteristics
- Performance benchmarks against alternative approaches
- Approval signatures with dates
- Validation reviewer's independence statement
For LLM-based systems, the model card is harder to produce because the underlying model is from a third party and the training data is not yours. The validation file shifts emphasis to:
- How the LLM was prompted, with prompt versioning
- Retrieval-augmented generation source documents and their freshness
- Output schema and constraints
- Hallucination rate from validation testing
- Performance on domain-specific evaluation sets
- Bias testing on outputs (since training data testing is not available)
The Evaluation Tool examiners are aware of LLM-specific limitations and will accept evidence patterns appropriate to the architecture. What they will not accept is the absence of any pre-deployment validation file.
3. Ongoing monitoring evidence
Continuous metrics tracked over the model's production life:
- Prediction or output distribution over time (drift detection)
- Performance against ground truth where available
- Bias metrics across demographic groups (where applicable)
- User override rates (cases where humans changed the model's recommendation)
- Adverse outcome rates (claims denials, coverage refusals, etc.)
- System availability and error rates
The frequency requirements are not specified by the Bulletin but are implied: Tier 1 models need at least monthly review; Tier 2 quarterly; Tier 3 annually or as significant changes occur. The Evaluation Tool examines the monitoring evidence; carriers without continuous monitoring infrastructure cannot produce it.
4. Adverse outcome and consumer complaint records
When an AI-driven decision adversely affects a consumer (claim denial, coverage refusal, premium increase), the record needs to show:
- The decision and the AI model that contributed
- The data inputs the model used
- The model's output and confidence
- Any human review applied
- The notice provided to the consumer (adverse action notice for credit-relevant decisions, claim denial explanation, etc.)
- Any appeal received and its disposition
The UnitedHealth nH Predict litigation (Estate of Lokken v. UnitedHealth) made this requirement vivid. On March 9, 2026, a federal magistrate ordered UnitedHealth to produce internal documents on whether nH Predict was designed to override clinical judgment. The court granted broad discovery across six of seven document categories. The lesson for carriers: assume that a court will order disclosure of how the AI worked, what evidence supported its outputs, and how those outputs were used.
Carriers that built audit-grade decision logs from day one can produce this material. Carriers that built logging as an afterthought face discovery costs in the millions and findings that drive settlement.
5. Third-party model oversight
The Bulletin's Section 4 places the diligence obligation on the carrier, regardless of whether the model is built in-house or licensed from a vendor. The vendor registry, if adopted, adds a regulator-side data layer but does not transfer responsibility.
The minimum vendor evaluation file:
- Model card from the vendor with architecture, training data sources and date ranges, intended use, and limitations
- Vendor's bias testing artifacts and methodology
- Validation evidence specific to your deployment context
- Contractual rights to inspect, audit, and demand updates
- Monitoring SLAs with the vendor
- Remediation pathway if the model produces biased or inaccurate outcomes
- Version history and change notification process
The pattern that has emerged in 2026: carriers are renegotiating vendor contracts to include explicit audit rights, version pinning with change notification, and contractual obligations to participate in the carrier's bias monitoring. Vendors that resist these terms get dropped in favor of those that cooperate.
The dual-regime complexity
Executive Order 14365 created the dual-regime problem. The state departments continue running their AI programs; the federal government asserts authority. Carriers have to comply with both without provoking either side.
The defensible posture for engineering and compliance teams:
Continue meeting state requirements. The 24+ states with bulletins have not withdrawn them. Examination cycles continue. Failing to respond, citing federal preemption, creates an examination posture problem that lands well before any preemption litigation reaches resolution.
Reserve preemption defenses through formal language. Regulatory responses can include reservation-of-rights language preserving the ability to assert preemption later if the EO's authority is upheld. This is legal counsel's job; engineering does not need to think about it directly but should be aware that responses pass through legal review for this reason.
Document dual compliance costs. If the preemption fight eventually reaches the Supreme Court, the cost of dual compliance becomes evidence relevant to the merits. Carriers should be able to produce specific dollar and operational impact figures.
Track litigation. The trade groups will lead amicus filings; carrier counsel monitors and coordinates. Engineering does not lead this work but should be prepared for guidance to shift if a major ruling lands.
Add EO 14365 contingency to vendor contracts. Renewals signed in 2026 should include language addressing the possibility of changed federal requirements during the contract term.
For most engineering teams, the dual-regime complexity translates to: build for the state requirements, document everything, and trust that legal counsel will handle the federal track. The engineering work needed for state compliance is also the engineering work needed for any plausible federal regime that emerges.
State-by-state variation
Twenty-four states plus DC have adopted the Model Bulletin or substantially similar guidance. A few have additional or stricter requirements.
| State | Notable requirement |
|---|---|
| Colorado | SB 21-169 (life insurance) expanded to auto and health Oct 2025; algorithm inventories, bias testing, annual compliance reports with chief risk officer attestation. Colorado AI Act effective February 1, 2026 includes insurance carve-out (insurance regulated under SB 21-169 instead). |
| New York | DFS Circular Letter No. 7 (2024) requires bias testing and explainability for insurance AI; underwriting and pricing focus. |
| California | SB 1120 (effective January 2025) prohibits AI-only health claim denial; requires physician review of medical necessity decisions. CA Department of Insurance issued formal guidance May 2025. |
| Connecticut, Maryland, Massachusetts, etc. | Adopted NAIC bulletin with minor customization. |
The practical engineering implication: build to the strictest interpretation. Colorado SB 21-169's bias testing across consumer demographic groups, NY DFS Circular 7's explainability requirements, and California SB 1120's human-in-the-loop requirements together produce a baseline that satisfies most other regimes. Building separate processes per state is more expensive than a single configurable platform that defaults to strict.
Build order
Each artifact below is a prerequisite for the next. The sequence is what gets a carrier through the NAIC AI Evaluation Tool pilot across the twelve states without a scramble in the weeks before an examiner arrives.
| Order | What you build | Eval gate before moving on |
|---|---|---|
| 1 | Model inventory and tiering: every production AI model categorized into the inventory schema, with business owner, technical owner, line of business, decision type, and Tier 1/2/3 classification recorded | A single registry returns a complete list of underwriting, claims, pricing, and servicing models in under one hour, with zero models flagged "unknown owner" or "unknown tier" |
| 2 | AIS Program governance documents: written policy, accountability assignments, board or committee charter, review cadence, and the formal AIS Program description that maps to Model Bulletin Section 4 | Counsel and the accountable executive sign the AIS Program document; the policy references each Tier 1 model by ID from step 1 |
| 3 | Pre-deployment validation files for every Tier 1 model: model card, validation methodology, bias testing across protected characteristics, approval signatures, independence statement | 100 percent of Tier 1 models have a current validation file on record; any model without one is either revalidated or scheduled for retirement with a dated plan |
| 4 | Continuous monitoring and decision-lineage infrastructure: drift detection, bias metrics, override rates, adverse outcome capture, and per-decision audit trail with inputs, outputs, confidence, and human review | A randomly sampled adverse action notice from the last 90 days can be reconstructed end to end (inputs, model version, output, reviewer) in under one business day |
| 5 | Third-party data and models documentation: vendor evaluation file per Tier 1 vendor model, contractual audit rights, version pinning, change notification SLA, and a remediation pathway | Every Tier 1 third-party model has a vendor evaluation file dated within the last 12 months and a contract clause covering audit, version, and EO 14365 contingency |
| 6 | Exam-cycle readiness drill: a tabletop run of the NAIC AI Evaluation Tool questionnaire against the inventory, validation files, monitoring evidence, and vendor files | The internal drill closes with no Tier 1 finding and a documented response to every Tool section, archived as the baseline for the real exam |
After step 6, the program shifts into steady state: quarterly evidence review, annual Tier 1 revalidation, continuous monitoring with documented disposition of alerts, and vendor contract renewals that incorporate EO 14365 contingency language. Carriers that skip the order, for example standing up monitoring before the inventory exists or chasing vendor files before validation files are current, end the pilot exam explaining gaps that an inventory entry or a signed validation file would have closed.
What about the vendor registry
The Third-Party Data and Models Working Group's March 23, 2026 session sketched a registration regime rather than a licensure regime. Vendors will file information with regulators on a defined cadence; the filings will be visible to insurance departments. The registry creates a regulator-side data layer; it does not change the carrier's diligence obligation under Section 4 of the Bulletin.
For vendors selling into insurance, the implications:
- A registry filing will be required, with content that vendors can largely control (model card, intended use, validation evidence)
- The filing does not satisfy carrier diligence obligations; carriers will continue requiring their own vendor evaluation files
- The cadence (likely quarterly or biannual updates) drives a recurring documentation work stream
- Adoption timeline puts first state implementations in late 2026 or early 2027
For carriers, the implications are smaller. The registry is regulator-facing; the carrier's diligence file is the same one they were already building. The carrier's negotiating leverage with vendors increases somewhat (vendors that cannot or will not produce registry-quality documentation are easier to identify), but the carrier's own work does not change.
The conversation worth having with vendors now: their registry preparation timeline. Vendors that have not started preparing will have a difficult late 2026; carriers procuring from those vendors inherit some of that difficulty. Vendors with documented model cards, validation evidence, and clear update cadences are positioned for the registry; carriers should prefer them in renewals.
How Respan fits
The NAIC AI Evaluation Tool inspects the same artifacts that LLM observability produces as exhaust: model inventories, validation files, monitoring evidence, decision lineage, and vendor oversight. Respan is the substrate that makes those artifacts a byproduct of operations rather than a fire drill before an examiner arrives.
- Tracing: every AI-driven underwriting, claims, and pricing decision captured as one connected trace. Auto-instrumented for LangChain, LlamaIndex, Vercel AI SDK, CrewAI, AutoGen, OpenAI Agents SDK. When an examiner asks for the data inputs, model output, confidence, and human review applied to a specific adverse action notice, the trace is the answer.
- Evals: ten built-in evaluators (faithfulness, citation accuracy, refusal correctness, harmfulness) plus LLM-as-judge and custom Python evaluators. Production traffic flows directly into datasets. CI-aware experiments block regressions on bias drift across demographic groups, hallucinated coverage terms, and unjustified denial rationales before deploys ship.
- Gateway: 500+ models behind an OpenAI-compatible interface, semantic caching, fallback chains, per-customer spending caps. The gateway is where version pinning, change notification, and per-line-of-business model routing get enforced for the third-party LLMs that show up in vendor evaluation files.
- Prompt management: versioned registry, dev/staging/prod environments with approval workflows, A/B testing in production with one-click rollback. Underwriting triage prompts, claims summarization prompts, RAG retrieval prompts, and adverse action explanation templates all belong in the registry so each version maps to a validation file and an approval signature.
- Monitors and alerts: prediction distribution drift, bias metrics across protected characteristics, user override rates, adverse outcome rates, hallucination rate against domain evaluation sets. Slack, email, PagerDuty, webhook. Tier 1 thresholds wire directly to the AIS Program's monthly review cadence.
A reasonable starter loop for insurance AI builders:
- Instrument every LLM call with Respan tracing including underwriting decisions, claims triage spans, and RAG retrieval steps.
- Pull 200 to 500 production claims and underwriting decisions into a dataset and label them for bias, factual accuracy, and adverse action defensibility.
- Wire two or three evaluators that catch the failure modes you most fear (hallucinated coverage terms, biased denials across protected classes, AI overriding clinical or claims judgment without human review).
- Put your underwriting, claims summarization, and adverse action prompts behind the registry so you can version, A/B, and roll back without a deploy.
- Route through the gateway so vendor model versions are pinned, swappable, and auditable when the vendor registry filing arrives.
The 2026 examination cycle rewards carriers whose evidence is generated by the system rather than reconstructed for the exam, and that is what this loop produces.
To wire any of the patterns above on Respan, start tracing for free, read the docs, or talk to us.
Related reading
- Building Claims AI Without Becoming the Next nH Predict: cautionary architecture and eval framework
- Evaluating Underwriting LLMs: Cytora and Sixfold patterns
- Building an AI Claims Processing Agent: full architecture walkthrough
- How Insurance Teams Build LLM Apps in 2026: pillar overview
- The April 2026 Model Risk Overhaul: adjacent fintech regulatory framework
