Fraud detection has been a machine learning workflow longer than most other applications in finance. Stripe Radar, Sift, Sardine, Resistant AI, and the in-house systems at every major payment processor all use gradient-boosted trees or neural networks trained on hundreds of millions of historical transactions. These systems work, they hit single-digit-millisecond inference latencies, and they cost a tiny fraction of an LLM call per decision.
The question for engineering teams in 2026 is not "should we use an LLM for fraud" framed as a binary. It is "which fraud workflows get LLMs, which do not, and how do we evaluate the difference." Recent research consistently shows that LLMs alone do not beat domain-tuned tabular ML for raw fraud classification on transaction data. Benchmarks across multiple 2025 papers found that LLMs applied directly to tabular fraud detection produced only marginal improvements over random guessing on common test sets. The structural reason is that tabular fraud signals are dense numeric patterns over long histories; LLMs are designed for sequences over text and pay an efficiency tax for tabular inputs.
But this is not the whole picture. LLMs unlock workflows that tabular ML cannot do, and evaluating fraud systems in 2026 means evaluating the workflows separately, not the models against a single benchmark. This post covers the workflow taxonomy, the evaluation dimensions that actually matter for each, the cost and latency math at scale, and the architectural patterns that have stabilized across the industry.
The workflow taxonomy
Fraud detection in a modern fintech is not one problem. It is six or seven distinct workflows, each with different latency budgets, different stakes, and different tolerance for errors. The right LLM-vs-ML decision differs across them.
| Workflow | Latency budget | Stakes | Best primary model | LLM role |
|---|---|---|---|---|
| Card auth fraud (real-time decline / approve) | 100ms hard | Customer experience, false-positive impact | Tabular ML | None in the auth path |
| Account takeover detection (login, session) | 200-500ms | Account security | Tabular ML + rules | Risk explanation only |
| Synthetic identity at onboarding | 1-3 seconds | KYC, regulatory | Tabular ML + document AI | Document understanding, anomaly explanation |
| AML alert triage (post-alert) | Seconds to minutes | Regulatory, investigator efficiency | LLM | Primary investigator copilot |
| SAR (Suspicious Activity Report) drafting | Minutes to hours | Regulatory accuracy | LLM | Primary drafting role |
| Investigation deep-dive (case work) | Hours | Investigation quality | LLM | Primary research and synthesis role |
| Customer communication (decline explanation) | Seconds | Customer experience, fair lending | LLM | Primary explanation role |
The three workflows where LLMs are genuinely the right primary tool (alert triage, SAR drafting, investigations) all share properties: latency budgets that allow seconds-to-minutes, output that is text or structured rationale, and tasks where the LLM's strength (reading and synthesizing varied evidence) is what produces value.
The four workflows where LLMs are not the primary tool share different properties: latency budgets in milliseconds, decisions on numeric tabular signals, and tasks where domain-tuned ML has a 10-year head start in accuracy. Trying to put an LLM in the auth path because "AI is the future" produces worse fraud detection at higher latency at significantly higher cost.
Cost math at scale
For fintechs running material transaction volume, the cost difference between tabular ML and LLMs in the auth path is not a small optimization. It is a difference of three to four orders of magnitude.
A simple back-of-the-envelope.
| Metric | Tabular ML (real-time inference) | LLM call (frontier model) |
|---|---|---|
| Cost per inference | $0.00001 to $0.0001 | $0.001 to $0.05 |
| Latency p99 | 5-50ms | 500-5000ms |
| Throughput per host | 10,000+ TPS | 5-50 TPS |
For a fintech processing 1 billion transactions per month, the cost differential ranges from $10K (tabular) to $50M (frontier LLM at 5 cents per call) per month. The latency differential makes the LLM unsuitable for the auth path regardless of cost.
For workflows where the volume is not 1 billion but 10,000 (AML alerts per month at a mid-sized fintech), the math inverts. At 10,000 alerts per month, even an expensive LLM call ($0.05) costs $500/month. The investigator time saved (5 minutes of triage saved per alert, at $50/hour fully loaded) is $40,000/month. The ROI is not close.
The principle: LLM cost-effectiveness in fraud workflows is a function of two things, the cost per decision and the value per decision. Card auth has a small value per decision and a huge volume; LLMs lose. AML triage has a much larger value per decision and a smaller volume; LLMs win.
Evaluation dimensions per workflow
The evaluation metrics for an LLM in fraud workflows have to fit the workflow. A precision-recall curve on tabular fraud data is not the right evaluation for an investigator copilot. The table below is what teams have converged on.
| Workflow | Primary metric | Secondary metrics | LLM-specific concerns |
|---|---|---|---|
| AML alert triage | Investigator time saved | False clear rate (alerts incorrectly cleared), recall vs prior process | Hallucinated transaction details, missed sanctions hits |
| SAR drafting | Regulatory accuracy of draft | Investigator edit rate, draft completeness | Fabricated narrative details, missing required SAR sections |
| Investigation deep-dive | Investigation quality (sampled review) | Time to completion, evidence comprehensiveness | Source grounding, conflation of similar entities |
| Decline explanation | Specificity of reason produced | Customer comprehension scores, complaint rate | Generic reasons, fair lending exposure |
A few of these deserve detail.
AML alert triage
The standard pattern: an LLM-based agent takes an alert, retrieves the relevant transaction history, sanctions screening hits, prior alerts on the entity, and produces a structured triage output (clear, escalate to investigator, escalate to immediate response). The investigator reviews the output rather than the raw alert.
Why this works. AML alerts are dense in context and require synthesis. The work is reading a history and forming a judgment. LLMs are good at that.
The hard part. False clears. An alert that should have been escalated but the LLM cleared. This is a regulatory issue (the bank has an SAR filing obligation that survives the LLM's recommendation) and a process issue (the human-in-the-loop has to be structured to catch false clears).
Eval design. Maintain a dataset of historical alerts with known correct dispositions (the bank's prior decisions, validated). The LLM's triage outputs are scored against the gold dispositions. False clear rate is the primary regulatory metric. Investigator time saved is the productivity metric. Both are tracked over time; a model update that improves time saved at the cost of false clear rate is a regression, not an improvement.
A simplified evaluation flow:
Test alert (from gold dataset)
|
v
LLM triage agent produces:
disposition: clear | investigator | immediate_response
rationale: <text>
evidence_cited: [<list of facts from retrieved context>]
|
v
Compare:
- disposition == gold disposition
- rationale references correct evidence
- no fabricated evidence
|
v
Aggregate over 1,000+ alerts:
- confusion matrix
- false clear rate stratified by alert type
- rationale grounding rate
SAR drafting
LLMs draft Suspicious Activity Reports based on the case file an investigator has assembled. The draft is reviewed and edited by the investigator before submission to FinCEN.
Why this works. SAR drafting is repetitive, structured, and time-consuming. A high-quality draft saves 30 to 60 minutes of investigator time per SAR.
The hard part. Regulatory accuracy. A SAR that fabricates a transaction detail or misstates a date can create regulatory exposure for the bank. Hallucination tolerance is essentially zero.
Eval design. Compare LLM drafts against gold-standard human-written SARs for the same case files. Score on completeness (all required sections), factual accuracy (every claim grounded in case file evidence), and regulatory compliance (correct narrative structure per FinCEN guidance). The investigator edit rate is a useful productivity metric but not a quality metric; investigators may edit fluent-but-wrong drafts the same as fluent-and-right ones.
Decline explanation
When a transaction is declined, an LLM produces the consumer-facing explanation. The decline decision was made by the tabular ML model in the auth path; the LLM's job is the translation step, taking the model's feature attributions and producing a specific, fair-lending-compliant explanation.
Why this works. ML model attributions are inscrutable to consumers. LLMs translate them into actionable text.
The hard part. The same fair lending and specificity issues covered in Building Adverse Action Explainability for LLM-Driven Credit Decisions. The LLM cannot introduce new reasons, only translate existing ones. Validation and testing follow the patterns in that post.
Latency budget and architectural placement
The single most common architectural mistake is putting an LLM call inside a latency-critical path. The right pattern is asynchronous: the auth decision is made by tabular ML in 30ms; the LLM-generated explanation is computed in parallel or after the fact and surfaced when the customer asks.
A correct architecture for high-volume fraud:
Transaction request
|
v
Tabular ML scoring (30-50ms)
|
----+----
| |
v v
Decision Async LLM enrichment
returned (explanation generation,
to user case file synthesis,
alert routing rationale)
|
v
Persisted to case management
for downstream investigator
or customer communication
The latency-critical decision and the LLM-augmented context are separate concerns. The synchronous path stays fast; the asynchronous path adds the value LLMs provide without holding up customer-facing latency.
The same pattern applies to AML monitoring. The alert generation runs on a real-time stream with rule-based or ML-based scoring. The alert triage with an LLM happens after the alert is generated, with seconds-to-minutes latency budget. The investigator never sees the raw alert; they see the LLM-triaged alert with structured rationale.
What to evaluate before deployment
For each LLM-bearing fraud workflow, the pre-deployment evaluation has to cover at minimum these dimensions.
Accuracy on a held-out gold set. A representative sample of historical cases with known correct dispositions, large enough to detect performance differences (typically 500-2000 cases per workflow). The LLM is run on the cases without seeing the labels. Aggregate metrics (precision, recall, F1 for triage; edit rate, error rate for drafting) and stratified metrics (by alert type, transaction value, customer segment).
Adversarial robustness. A specific test suite of edge cases and attack patterns: prompt injection attempts in transaction memo fields, malformed inputs, conflicting context, off-distribution alerts. Each should produce a graceful failure (escalation to human) rather than a confident wrong answer.
Hallucination rate. For workflows that cite evidence (triage rationales, SAR drafts), measure the rate at which cited evidence is verifiable in the case file vs fabricated. Tolerance for fabricated evidence in fraud workflows is essentially zero.
Latency under load. p99 latency at expected production throughput, including upstream context retrieval and downstream post-processing. If the LLM call is fast but the surrounding pipeline is slow, the workflow is still slow.
Cost per decision. Total cost (LLM call plus retrieval plus post-processing) per workflow execution, aggregated to expected monthly volume. Compared against the value per decision (investigator time saved, customer experience improvement, regulatory risk reduction).
Drift detection plan. Production traffic distribution changes over time as fraudsters adapt. The eval set has to be refreshed on a defined cadence (typically quarterly), with new cases pulled from recent production traffic and annotated. A frozen eval set from launch is a false signal six months in.
Vendor model risk for fraud LLMs
Fraud workflows in regulated finance are subject to the same April 2026 model risk framework that covers credit decisioning. Specific implications for the LLM-based fraud workflows above:
Tier 1 versus Tier 2. AML triage and SAR drafting are typically Tier 1 (regulatory exposure if they fail). Investigation copilots used internally are typically Tier 2 (productivity tool, but with potential to influence regulatory work product). Decline explanations may be Tier 1 if they constitute adverse action notices, Tier 2 otherwise.
Vendor model contracts. Same requirements as covered in The April 2026 Model Risk Overhaul: zero-data-retention, no-train, version pinning, change notification, fallback path. For AML specifically, ensure your model provider's data handling supports OFAC and BSA/AML data classifications.
Evidence as byproduct. Every fraud LLM execution produces a trace that can be queried later. For AML triage, the trace lets the bank reconstruct why the LLM cleared an alert if a regulator later asks. For SAR drafting, the trace shows exactly which case file evidence the LLM relied on for each section of the draft.
Common architectural mistakes in 2026
Three patterns that show up in audits and post-mortems.
LLM in the auth path. A team replaces or augments tabular ML with LLM calls in the synchronous transaction path. The auth latency p99 jumps from 50ms to 1500ms. Customer experience degrades. Cost spikes. The team rolls back. This is increasingly rare but still happens to teams new to LLM economics.
No human-in-the-loop on triage. A team deploys LLM-based AML triage as fully automatic alert disposition. False clear rate creates regulatory exposure. The fix is straightforward (require human approval for triage decisions) but should have been the architecture from day one.
Eval set is the demo set. A team evaluates the fraud LLM against the same set of cases they used to demo the product to leadership. Performance looks great. Production performance is much worse because the demo set was not representative. The fix is a held-out gold set built from random production samples, annotated independently.
Fluency-truth tradeoff in SAR drafts. A team optimizes SAR drafting for investigator-perceived quality (which correlates with fluency) without separately measuring factual accuracy. Investigators accept fluent-but-fabricated drafts. The bank submits SARs with errors. Audit catches it. Two evals, separately tracked, fix this: fluency and accuracy are different metrics with different acceptance thresholds.
What to ship and in what order
If you are introducing LLMs into a fraud stack today, the priority order:
- AML triage with mandatory human approval. Highest-leverage workflow, lowest engineering risk if structured correctly. Produces immediate productivity gains and gives the team experience with LLM evaluation patterns.
- SAR drafting from triaged cases. Builds on the first investment. Eval is harder (factual accuracy verification) but the value per case is large enough to justify it.
- Investigation deep-dive copilot. Lower regulatory exposure, primarily a productivity tool. Useful for teams that want LLM experience without immediate regulatory pressure.
- Decline explanation. Cross-functional with the credit decisioning team if you have one. Implementation depends on how the auth-path ML model produces feature attributions.
- Auth-path augmentation. Probably never. If you are tempted, run the cost and latency math first.
Tabular ML stays the engine of the auth path. LLMs are the engine of everything that comes after.
How Respan fits
Evaluating LLMs across fraud workflows (AML triage, SAR drafting, investigation copilots, decline explanation) requires the same substrate underneath each one: traces of every LLM call, evals tied to gold dispositions, and a way to roll back when a model update regresses false clear rate. Respan is that substrate.
- Tracing: every fraud LLM execution captured as one connected trace, including the alert payload, retrieved transaction history, sanctions hits, prior alerts, and the final disposition with rationale. Auto-instrumented for LangChain, LlamaIndex, Vercel AI SDK, CrewAI, AutoGen, OpenAI Agents SDK. When a regulator later asks why the LLM cleared an alert, the trace is the evidence the bank reconstructs the decision from.
- Evals: ten built-in evaluators (faithfulness, citation accuracy, refusal correctness, harmfulness) plus LLM-as-judge and custom Python evaluators. Production traffic flows directly into datasets. CI-aware experiments block regressions on false clear rate, fabricated SAR narrative details, missing required SAR sections, and ungrounded triage rationales before deploys ship.
- Gateway: 500+ models behind an OpenAI-compatible interface, semantic caching, fallback chains, per-customer spending caps. AML triage and SAR drafting are Tier 1 workloads where vendor failure matters, so the gateway gives you provider redundancy and version pinning without rewriting the agent code.
- Prompt management: versioned registry, dev/staging/prod environments with approval workflows, A/B testing in production with one-click rollback. The triage agent prompt, the SAR drafting template, and the decline explanation prompt all belong in the registry so risk and compliance can review changes before they touch a Tier 1 path.
- Monitors and alerts: false clear rate stratified by alert type, rationale grounding rate, SAR draft factual accuracy, p99 latency under load, cost per decision against expected monthly volume. Slack, email, PagerDuty, webhook. Drift on any of these is the first signal that the eval set has gone stale and needs a refresh from recent production traffic.
A reasonable starter loop for fraud LLM builders:
- Instrument every LLM call with Respan tracing including the alert payload, retrieved evidence, sanctions screening hits, and the structured disposition output.
- Pull 200 to 500 production AML alerts and SAR cases into a dataset and label them for correct disposition, evidence grounding, and regulatory completeness.
- Wire two or three evaluators that catch the failure modes you most fear (false clears on alerts that should escalate, fabricated transaction details in SAR drafts, ungrounded rationales citing evidence not in the case file).
- Put your triage prompt, SAR drafting template, and decline explanation prompt behind the registry so you can version, A/B, and roll back without a deploy.
- Route through the gateway so you get provider redundancy, version pinning, and per-workflow cost caps that match the cost-per-decision math above.
Tabular ML stays the engine of the auth path; Respan is how you keep the LLM-augmented workflows around it auditable, evaluated, and regulator-ready.
To wire any of the patterns above on Respan, start tracing for free, read the docs, or talk to us.
Related reading
- The April 2026 Model Risk Overhaul: the regulatory framework for LLM fraud workflows
- Building Adverse Action Explainability for LLM-Driven Credit Decisions: for decline explanation workflows
- Building a Financial Research Agent: adjacent agent architecture patterns
- How Fintech Teams Build LLM Apps in 2026: pillar overview
