Sixfold raised $30 million in January 2026 to scale its AI Underwriter, with deployments across Zurich North America, Guardian, Generali Global Corporate & Commercial, and Skyward Specialty. Zurich reported saving up to two hours per submission across 200+ underwriters; Skyward reduced quote response times by 35%. Cytora launched Autopilot in March 2026 to enable end-to-end agentic underwriting workflows; partnerships with Markel, Chubb, Arch, and an alliance with LexisNexis Risk Solutions in April 2026 extended the deployment footprint. Artificial Labs' Ava, Federato, V7 Go, Eigen Technologies, Ki, and DQPro fill out the Lloyd's market and global commercial underwriting stack. The category has stabilized; the differentiation is in execution depth.
The hard parts of underwriting LLM eval are not the same as customer-facing LLM eval. Underwriters are sophisticated reviewers who notice when the AI's risk assessment misses a material consideration. Bad recommendations do not just lose conversion; they bind the insurer to a risk profile that affects loss ratios for years. The NAIC Model Bulletin and AI Evaluation Tool require bias testing and audit trails for any model that supports underwriting decisions. The eval framework has to address all of this.
This post covers the four dimensions specific to underwriting LLM evaluation: risk selection accuracy, calibration, bias and adverse impact, and audit grounding. It includes dataset construction patterns, the metrics that matter for each dimension, and the operational practice that turns one-time benchmarks into continuous evaluation.
What underwriting LLMs actually do
The category covers a wider workflow than "AI underwrites the policy." A practical taxonomy:
| Workflow | What the AI does | Vendor examples |
|---|---|---|
| Submission intake and triage | Parse broker submissions (emails, documents, calls), extract structured risk data, route to right underwriter | Sixfold, Cytora, Artificial Labs Ava |
| Document extraction | Parse MRC slips, cover notes, broker presentations into structured CDR-compliant data | Eigen Technologies, V7 Go (with visual grounding) |
| Risk enrichment | Pull external data (satellite imagery, loss histories, supply chain, regulatory filings), build full risk picture | Cytora + LexisNexis, Cytora + Warren Group |
| Pricing and risk scoring | Take enriched data, produce indicative pricing recommendations | Carrier-specific implementations on Cytora/Sixfold |
| Portfolio monitoring | Watch the live book, alert on accumulation thresholds, rate adequacy drift | Federato, Cytora |
| Compliance and data quality | Check that bound terms match policy administration system | DQPro (45% of Lloyd's) |
| Algorithmic follow and auto-bind | Autonomously bind risks within pre-set parameters | Ki, Artificial Labs Smart Follow |
| Underwriter copilot | Surface relevant precedent, similar risks, appetite guidance | Internal at major carriers |
Most production underwriting LLM systems combine several of these. Cytora's Autopilot connects intake through to underwriting decisions; Sixfold integrates into existing workbenches and policy administration systems. The eval framework has to address the system as it is configured at the carrier, not just any single component.
Why underwriting LLM eval is different
Several properties make standard LLM eval frameworks insufficient.
The training signal is delayed and noisy. A risk written today does not produce a definitive "this was a good or bad risk" signal until claims emerge over months and years. The signals that arrive immediately (underwriter approval, broker satisfaction, win rate) are weak proxies for the signal that matters (loss ratio).
Underwriters are sophisticated reviewers. They will catch obvious errors. The errors that matter for evaluation are subtle: a risk factor the AI missed, a comparable that is the wrong shape, a regulatory consideration that did not surface. Generic evaluation patterns (recall, precision against gold labels) miss these.
Adverse selection exposure. A risk-selection model that is biased in ways that admit bad risks while screening good ones produces adverse selection. Loss ratios degrade in ways that are visible only at the portfolio level over months. Eval has to look at portfolio-level outcomes, not just per-risk accuracy.
Bias testing requirements are explicit. Colorado SB 21-169 (extended to auto and health in October 2025), NY DFS Circular Letter 2024-7, and the broader NAIC Model Bulletin all require bias testing for protected characteristics in underwriting. The eval framework includes these as primary metrics, not afterthoughts.
Auditability matters. When a state insurance department examines underwriting decisions during a market conduct exam, they expect to be able to reconstruct individual risk decisions, see what data the AI relied on, understand how the recommendation was reached, and verify it complies with applicable laws.
These properties shape the eval framework that has emerged.
Dimension 1: Risk selection accuracy
The first-order question: does the AI's risk assessment align with what an experienced underwriter would conclude?
Construct the eval set
Three sources for evaluation cases:
Historical bound risks with known outcomes. Risks the carrier wrote in past years with claim history and loss ratio attached. The AI's recommendation on these (had the AI been operating then) is compared to what actually happened. This is the gold standard but is biased toward the historical underwriting selection (you only see outcomes for risks you wrote).
Historical declined risks where outcomes are knowable. Risks the carrier declined that were placed elsewhere. If the eventual market outcome is observable (the writing carrier had losses), this is signal. Hard to obtain but valuable for catching systematic over-decline patterns.
Senior underwriter annotated cases. A panel of senior underwriters reviews submissions and produces gold-standard recommendations. The AI is evaluated against those recommendations. Captures judgment in a way that historical outcomes alone do not.
Adversarial cases. Submissions designed to test specific failure modes: unusual industry classifications, atypical exposures, edge cases in the appetite, submissions with deliberate ambiguity. The AI's handling of these reveals where its training generalizes and where it does not.
For commercial lines, 500 to 2,000 cases is the typical starting eval set size. Stratification by industry, line of business, premium size, and complexity is essential.
Compute the metrics
For ranking-style outputs (the system surfaces top-N risks to the underwriter):
- Recall@K. Of risks that should have been bound, what fraction surface in the top K?
- Precision@K. Of the top K, what fraction were actually written profitably?
For classification-style outputs (decline, refer, auto-quote):
- Accuracy by class. Performance on declines vs auto-quotes vs referrals, separately
- Confusion matrix. Where the AI most often disagrees with the underwriter
- Cost-weighted accuracy. Weighted by the dollar exposure of the risk; getting big risks right matters more than getting small ones right
For pricing recommendations:
- Pricing alignment. How close are AI-recommended premiums to underwriter-set premiums on bound risks?
- Loss ratio prediction. Among bound risks, do AI predictions of loss ratio correlate with actual loss ratios over time?
Stratify all metrics. Aggregate accuracy hides the cases where the system fails most.
Dimension 2: Calibration
A risk score that is well-calibrated means the score corresponds to a real probability or rate. A 70/100 risk score for "expected loss ratio above target" should mean roughly 70% probability of that outcome.
Calibration matters in underwriting because:
Underwriters interpret scores as probabilities. When a risk score reads "high" or "85/100," underwriters incorporate it as confidence. Poorly calibrated scores deceive even sophisticated users.
Threshold-based decisions depend on calibration. Carriers configure auto-decline thresholds, referral thresholds, and auto-bind thresholds based on score. Miscalibration means thresholds operate on scores that do not mean what they claim.
Calibration affects regulatory defense. When a regulator examines whether AI-driven pricing is fair, calibrated scores are easier to defend than scores that drift in unpredictable ways.
Measure calibration
- Reliability diagrams. Plot predicted probability against observed frequency in bins. Production systems target Expected Calibration Error (ECE) below 5%.
- Calibration over time. Models drift as risk environment changes. Track calibration on rolling windows; investigate when ECE rises.
- Per-segment calibration. Calibration can be good on aggregate while failing in specific segments (small commercial, specific industries). Stratified ECE catches these.
When LLM-based risk scoring is miscalibrated, post-hoc methods (Platt scaling, isotonic regression) can recalibrate. Per-segment recalibration is technically sound but raises concerns when the segments correlate with protected characteristics; consult counsel.
Dimension 3: Bias and adverse impact
The legal frame: state insurance departments under the NAIC Model Bulletin require carriers to test for "unfair discrimination" in AI-driven underwriting and pricing. Colorado SB 21-169 (now covering life, auto, and health) is the strictest in operationalizing this, requiring algorithm inventories, bias testing across consumer demographic groups, and annual compliance reports with chief risk officer attestation. NY DFS Circular Letter 2024-7 requires bias testing and explainability for AI in underwriting and pricing. The EU AI Act classifies insurance pricing as high-risk.
What to test
Selection rate parity. For each protected characteristic relevant to the line of business and jurisdiction, compute the rate at which submissions are accepted, declined, or referred. The four-fifths rule (any group's selection rate below 80% of the highest group's rate) is a starting reference, though insurance bias law is stricter than employment in some respects (specific protected categories may be different).
Pricing parity. Among accepted risks, are premiums systematically different across demographic groups for similar risks? "Similar risks" requires careful definition; using actuarially valid risk factors is permissible, using protected characteristics or their proxies is not.
Proxy detection. Variables that correlate with protected characteristics and are used in pricing (zip code as wealth/race proxy, occupation patterns as gender proxy) require examination. Some are actuarially defensible; some are not. The framework should surface them for review rather than simply allow or block.
Disparate adverse impact. Beyond selection and pricing, are downstream outcomes (claim handling speed, settlement amounts, appeal success rates) different across groups for similar claims? Bias can enter at any stage.
For Colorado specifically, SB 21-169 requires:
- Inventory of every algorithm and external data source used in pricing
- Testing for discriminatory outcomes
- Annual compliance reports with chief risk officer attestation
- Now covers life, auto, and health insurance
Carriers writing in Colorado need this in production; carriers writing nationally find Colorado often the strictest binding constraint, so building to it covers most other states.
What to avoid
Using protected characteristics directly as model inputs is direct discrimination, which is legally prohibited regardless of bias monitoring. Variables that are proxies for protected characteristics require analysis: is the variable actuarially valid (e.g., does it predict loss?), and does its use produce disparate impact, and is the disparate impact justified by the actuarial validity?
This analysis is contestable and case-by-case. The eval framework's job is to surface the variables and their effects, not to make the legal conclusion. Counsel and compliance make the legal call; engineering implements whatever decision is reached.
Dimension 4: Audit grounding
The audit trail dimension specific to underwriting LLMs.
For every risk evaluation, the system needs to produce:
- The submission inputs as received (broker email, documents, attachments)
- The structured data extraction (with provenance back to source)
- Any external data enrichment (with sources and timestamps)
- The model and prompt versions used
- The model's output and rationale
- The underwriter's decision and reasoning
- Any deviation from AI recommendation, with explanation
Cytora's published documentation specifically emphasizes "explainable agentic reasoning" with "every workflow step fully auditable with transparent reasoning records." This is not marketing fluff; it is the architectural requirement for surviving market conduct examinations under the AI Evaluation Tool.
The eval framework includes audit grounding as a quality dimension:
Grounding rate. What fraction of the AI's claims about a risk trace to specific source data? Hallucination rate. What fraction of the AI's claims cannot be verified or are incorrect? Provenance completeness. Can every structured field be traced back to a specific source document and location?
V7 Go's "visual grounding" approach (linking every extracted field back to its exact source location in the document) is one implementation. Carriers that adopt similar approaches produce evidence that withstands scrutiny.
Putting the framework together
A continuous eval pipeline for underwriting LLMs:
Production submissions and outcomes
|
v
Stratified sampling
(by line, premium size, complexity, demographic)
|
v
+-----------------------------------------------+
| Daily / weekly evaluation |
| |
| [Risk selection accuracy] |
| - Recall, precision, accuracy |
| - Cost-weighted metrics |
| - Stratified by segment |
| |
| [Calibration] |
| - Reliability diagrams |
| - ECE per segment |
| - Calibration drift over time |
| |
| [Bias and adverse impact] |
| - Selection rate per protected group |
| - Pricing parity |
| - Proxy variable analysis |
| - Downstream outcome disparities |
| |
| [Audit grounding] |
| - Grounding rate |
| - Hallucination rate |
| - Provenance completeness |
| |
+-----------------------------------------------+
|
v
Dashboards, alerts, regression catches in CI
|
v
[Annual external audit + NAIC AI Eval Tool examination]
The continuous eval is the engineering practice that prevents findings; the annual examination is downstream of it.
Operational practice
Pre-deployment gate. No new model version, prompt change, or retrieval pipeline update reaches production without passing the four-dimension eval. Thresholds are documented and reviewed quarterly.
Continuous monitoring with alerts. Bias metrics computed weekly. Calibration metrics monthly. Grounding rate continuous on sampled production traffic. Alerts on threshold breaches investigated within 5 business days.
Quarterly portfolio review. The slow signals (loss ratios on bound risks) emerge over time. Quarterly review of how AI-influenced selection has affected portfolio performance, with feedback loops back to the model.
Annual external validation. Independent third-party validation of the eval framework itself: are the right metrics being measured, are thresholds appropriate, is the eval set representative? This is separate from the NAIC AI Evaluation Tool exam, which focuses on governance.
Litigation hold readiness. The infrastructure to preserve specific traces and eval data on demand. When an inquiry arrives, the team can isolate relevant records without disrupting ongoing operations.
Build order
Underwriting LLM eval (Sixfold, Cytora Autopilot patterns) only works if each layer is solid before the next is loaded on top. The sequence below is dependency-ordered, not calendar-ordered.
| Order | What you build | Eval gate before moving on |
|---|---|---|
| 1 | Algorithm inventory and tier classification under NAIC Model Bulletin, with validation evidence assessment per system | 100% of underwriting AI systems catalogued; tier and validation status signed off by chief risk officer |
| 2 | Stratified eval set: historical bound risks with claim outcomes, declined risks, senior underwriter annotated cases, adversarial cases | 500 to 2,000 cases stratified by line, premium size, complexity, and demographic; counsel review of demographic data handling complete |
| 3 | Risk selection accuracy and calibration metrics on the eval set | Recall@K, precision@K, cost-weighted accuracy, and ECE below 5% per segment, with baseline documented |
| 4 | Bias and adverse impact infrastructure: selection rate per protected group, pricing parity, proxy variable analysis | Four-fifths rule compliance reviewed by counsel; proxy variables flagged and dispositioned; Colorado SB 21-169 inventory ready |
| 5 | Audit grounding: provenance from every extracted field to source location, hallucination detection on production samples | Grounding rate above 95% and hallucination rate below 2% on sampled production traffic; provenance reconstructable end to end |
| 6 | Operational integration: CI pre-deployment eval gates, alert routing, quarterly portfolio review cadence, independent annual validator engaged | Pre-deployment evals block on regression; alerts on bias, calibration, and grounding breaches reach owners within minutes with dispositions tracked |
After order 6, the loop is continuous: monitoring, quarterly portfolio review feeding back into model iteration, annual external validation, and preparation for NAIC AI Evaluation Tool examination. Skipping order, especially shipping risk scoring before bias and grounding gates exist, is how teams end up explaining themselves to regulators or under oath.
How Respan fits
Underwriting LLMs sit at the intersection of regulatory scrutiny, delayed loss-ratio signals, and underwriter trust. Respan gives the eval, observability, and governance primitives that turn the four-dimension framework above into a continuous practice rather than an annual scramble.
- Tracing: every underwriting submission captured as one connected trace. Auto-instrumented for LangChain, LlamaIndex, Vercel AI SDK, CrewAI, AutoGen, OpenAI Agents SDK. Intake, extraction, enrichment, scoring, and underwriter decision spans hang together so you can reconstruct a bound risk during a market conduct exam without grepping logs.
- Evals: ten built-in evaluators (faithfulness, citation accuracy, refusal correctness, harmfulness) plus LLM-as-judge and custom Python evaluators. Production traffic flows directly into datasets. CI-aware experiments block regressions on miscalibrated risk scores, ungrounded extractions, and selection-rate disparities before deploys ship.
- Gateway: 500+ models behind an OpenAI-compatible interface, semantic caching, fallback chains, per-customer spending caps. Route extraction, scoring, and rationale generation through different models without rewriting integration code, and keep a fallback chain ready when a primary model degrades mid-quarter.
- Prompt management: versioned registry, dev/staging/prod environments with approval workflows, A/B testing in production with one-click rollback. Risk-scoring prompts and appetite guidance templates get reviewed by compliance, A/B tested against senior underwriter annotated cases, and rolled back the moment loss-ratio signals turn.
- Monitors and alerts: grounding rate, hallucination rate, Expected Calibration Error per segment, selection rate per protected group, pricing parity drift. Slack, email, PagerDuty, webhook. Threshold breaches reach the underwriting AI lead and compliance officer within minutes, with the offending trace attached for disposition.
A reasonable starter loop for underwriting LLM builders:
- Instrument every LLM call with Respan tracing including submission intake, document extraction, enrichment, risk scoring, and underwriter decision spans.
- Pull 200 to 500 production bound and declined risks into a dataset and label them for risk selection accuracy, calibration, bias exposure, and audit grounding.
- Wire two or three evaluators that catch the failure modes you most fear (ungrounded risk claims, miscalibrated scores at threshold boundaries, selection-rate disparities across protected groups).
- Put your risk-scoring and rationale prompts behind the registry so you can version, A/B, and roll back without a deploy.
- Route through the gateway so you can swap extraction and scoring models, hold semantic cache for repeated submissions, and cap spend per carrier program.
The result is an eval pipeline that survives NAIC AI Evaluation Tool examination, holds up to plaintiff discovery, and gives underwriters scores they can actually trust.
To wire any of the patterns above on Respan, start tracing for free, read the docs, or talk to us.
Related reading
- The NAIC AI Evaluation Tool: Engineering for the 2026 Pilot: the regulatory framework
- Building Claims AI Without Becoming the Next nH Predict: the cautionary tale
- Building an AI Claims Processing Agent: full architecture walkthrough
- How Insurance Teams Build LLM Apps in 2026: pillar overview
