On April 17, 2026, the Federal Reserve, FDIC, and OCC jointly rescinded SR 11-7 and the related issuances (OCC 2011-12, FIL-22-2017) that have governed bank model risk for fifteen years. The new framework is principles-based, risk-tiered, and explicitly intended to cover the systems banks actually use today, including LLM-based underwriting assistants, AML triage agents, and customer-facing copilots.
If you build LLM applications inside a bank, or you sell LLM tooling into one, this is not an academic shift. The new framework formalizes what supervisors had been doing informally since 2023: applying model risk expectations to GenAI systems "by analogy." With the analogy now codified, vendor security reviews and internal validation processes are being rewritten across the industry. Teams that built LLM products without MRM scaffolding are scrambling. Teams that already built lineage, validation evidence, and effective challenge into their stack are getting through procurement faster.
This post is the engineering translation. Each section maps a piece of the new framework to the technical artifact your team has to produce. It does not include integration code; the goal is to be useful to anyone evaluating or implementing observability for LLMs in regulated finance, regardless of stack.
What changed in one paragraph
The old SR 11-7 was prescriptive in some places and silent in others, and the silences were the problem. It defined "model" in 2011 terms (a quantitative method that processes input data into quantitative estimates) which made supervisors stretch the definition to cover LLM systems whose outputs are text, decisions, or actions rather than numbers. The new framework explicitly states that models, including GenAI and agentic systems, share one risk-management substrate. It is risk-tiered by materiality (Tier 1 models get strong controls; lower-tier models get proportional ones), it requires lifecycle evidence as a byproduct of how systems are built rather than reconstructed after the fact, and it expects "effective challenge" to be operationalized rather than narrative. Vendor and third-party models are explicitly in scope.
Tiering: where does your LLM use case land
The first practical question is which tier your LLM application falls into. The framework is principles-based, so tiering is the bank's decision, but supervisors have signaled what they expect to see. The pattern below reflects what the major banks are using.
| Tier | Materiality | Examples | Implications |
|---|---|---|---|
| 1 | High | Credit underwriting decisions, AML alert disposition, fraud auto-decline at scale, customer-facing compliance disclosures, regulatory filings | Full validation, MRM approval before production, dual control on changes, monthly outcome analysis, reproducible challenger models |
| 2 | Medium | Internal research copilots used in deal decisions, AML investigator assistants, KYC document extraction, customer service AI with limited scope | Validation by independent reviewer, quarterly outcome analysis, change logs, vendor risk assessment if third-party |
| 3 | Low | Code generation for non-customer-facing systems, marketing content drafts, internal knowledge search not used for decisions | Process verification, periodic spot checks, lighter documentation |
Two things to notice. First, an LLM that helps an investigator triage AML alerts is not Tier 3 just because the human still makes the final decision. If the LLM's triage is what determines whether the human looks at an alert at all, the LLM is in the decision path and likely Tier 1 or 2. Second, "tier" is auditable. If you assign a use case to Tier 3 and the regulator disagrees, you defend the assignment with evidence (volume, dollar exposure, customer impact). A vague "we considered it low-risk" does not survive examination.
The most common mistake under the new framework is under-tiering customer-facing copilots. An LLM that answers customer questions about account terms is Tier 1 if its responses can constitute a regulatory disclosure or a misrepresentation. It does not matter that the model is "just answering questions."
Lifecycle evidence: what each stage produces
The new framework expects every model, including LLMs, to produce evidence at each lifecycle stage. The evidence is not retrospective documentation; it is the natural output of how the model is built and operated. The table below is what a Tier 1 LLM application has to produce.
| Stage | Required evidence | What this looks like for LLMs |
|---|---|---|
| Conceptual soundness | Documented purpose, alternative approaches considered, design rationale | "Why an LLM agent here vs a rule engine vs a classical ML model"; benchmarked alternatives |
| Data and inputs | Provenance, data quality controls, scope of use | Training and fine-tuning data lineage, RAG corpus inventory, PII handling, prompt template versioning |
| Development | Implementation matches design, code review, testing | Prompt engineering history, retrieval pipeline tests, structured output validation |
| Pre-deployment validation | Independent assessment, benchmark performance, edge case behavior | Held-out eval set scores, adversarial prompt suite results, fairness analysis if applicable |
| Production deployment | Approved configuration, dual-control sign-off, rollback plan | Versioned prompts in production, model and prompt fingerprint per request, canary or blue-green deployment evidence |
| Ongoing monitoring | Performance over time, drift detection, outcome analysis | Live metrics on grounding rate, refusal rate, user feedback rate; eval re-runs on model provider updates |
| Effective challenge | Versioned challenger models, sensitivity analysis | Alternative model A/B tests, prompt ablations, model provider switch tests |
| Retirement | Deprecation plan, archive of decisions made | Audit log retention policy, ability to reconstruct historical decisions for the retention period |
The phrase that matters most is "evidence as byproduct." Banks under the old framework spent quarters of effort reconstructing documentation from existing systems for examination. The new framework treats that pattern as a failure of architecture. If your team is putting together a binder for the validators after the model is in production, you have already failed the test. The instrumentation has to capture evidence as the model runs.
Effective challenge for non-deterministic systems
"Effective challenge" was the heart of SR 11-7 and remains central to the new framework. It means a critical, technically competent review of the model by parties independent of the model developers, who can identify limitations and propose changes. Under SR 11-7, effective challenge for a logistic regression underwriting model meant: comparing it against a champion-challenger framework, running sensitivity analysis on key features, validating against held-out periods.
For LLMs, the structure of effective challenge has to adapt. Three patterns are emerging as the standard.
Versioned challenger models. At any time, the production LLM has at least one challenger LLM running in shadow mode on a sample of production traffic. The challenger uses a different model provider, a different prompt structure, or both. Outputs are compared, divergences are logged, and the challenger periodically becomes the candidate for the next production version. This requires infrastructure (a routing layer that supports shadow traffic, evaluation that compares structured outputs across runs) but the requirement is the same as the underlying SR 11-7 principle.
Prompt ablation as sensitivity analysis. Under SR 11-7, sensitivity analysis tested how much the model's output changed when an input feature changed. For LLMs, the equivalent is prompt ablation: hold the user input constant, vary the prompt structure (system message phrasing, few-shot examples, retrieval instructions), and measure how much the output distribution moves. A model that is highly sensitive to prompt phrasing has hidden risk; the validator should see this.
Adversarial robustness testing. A pre-deployment test suite that probes specific failure modes: prompt injection, jailbreaks, off-distribution inputs, conflicting context. The test suite is versioned, reused on every model upgrade, and the results are part of the validation package.
The validators reviewing the package want to see that all three were run, what the results were, and how the team responded. "We tested it and it worked" is not a defensible answer.
Vendor LLM oversight
The new framework explicitly extends model risk principles to vendor and third-party models. For LLMs, this means: if you use OpenAI, Anthropic, Google, or any other model provider, you are responsible for managing the risk of that vendor's model in your system. You cannot delegate the responsibility by saying "we use a frontier model."
The practical requirements that have emerged in the first two weeks of the new framework:
Provider contract terms. Banks need explicit zero-data-retention agreements, no-train clauses, data residency commitments, and breach notification SLAs from every model provider. These have been negotiable for a year; under the new framework they are required, and procurement teams are auditing existing contracts for gaps.
Version pinning and change notification. A frontier model that "improves" overnight without warning is a risk that has to be managed. Banks are insisting on version pinning (a specific model version stays available for a contractually defined period) and change notification (the provider notifies the bank before the underlying model behavior shifts in production-relevant ways).
Fallback and continuity. Single-provider dependency is now a documented risk. Tier 1 LLM applications need a documented fallback to a secondary provider, with a tested switchover path. Some banks are running active-active across two providers for the most critical workflows.
Independent evaluation. The bank's MRM function needs to be able to evaluate the vendor model on the bank's own data, against the bank's own metrics. This means access to a model endpoint that supports bank-side evaluation runs without contractual hurdles.
Audit trail and lineage as architecture
The new framework's strongest expectation, and the one most banks are weakest on, is end-to-end lineage. Every Tier 1 LLM decision needs to be reconstructable: what was the input, which prompt template version was used, which model and version generated the output, what context was retrieved (with sources and retrieval scores), what post-processing was applied, what was returned to the user, what action did the user take.
This is a tracing problem first and a logging problem second. Logs that are not structured around the unit of work (the request, the agent step, the tool call) cannot be reassembled into a coherent decision history. Structured tracing, where each step in the agent's execution is captured as a span with its inputs, outputs, and metadata, is the architecture that makes lineage possible without after-the-fact reconstruction.
The minimum lineage record for a Tier 1 LLM decision:
decision_id: <uuid>
timestamp: 2026-04-22T14:32:11Z
user_id, session_id, matter_or_account_id
input:
raw_user_input
enriched_context (retrieved docs, structured data)
processing:
prompt_template_version
retrieval_pipeline_version
model_provider, model_version
model_parameters (temperature, max_tokens, etc.)
raw_model_output
decision:
parsed_decision
post_processing_rules_applied
final_action
verification:
guardrails_passed
grounding_check_result
policy_check_result
outcome:
user_action (accepted, rejected, escalated)
downstream_outcome (where applicable)
Records are retained for the regulatory period, indexed for query, and exportable to the validators on demand. The retention period varies by use case (5 to 7 years is typical for credit decisions, longer for some fraud and AML records).
Common mistakes in the first wave of MRM redesigns
Two weeks into the new framework, several patterns are already visible across banks doing the migration.
Conflating logs with lineage. A bank that has 500 GB of LLM request logs but cannot answer "what was the prompt template version on April 14 at 9:32 AM" has logging, not lineage. The fix is structured tracing with explicit version capture, not more logs.
Treating eval as a pre-deployment exercise. A 200-case eval that ran once before launch and has not been touched since is not "ongoing monitoring." The framework expects continuous eval, including against a frozen golden set on every model provider update.
Ignoring the third-party clause. Teams using a model API and assuming the provider's compliance is sufficient. The bank is responsible for vendor model risk; the provider's compliance does not transfer.
Tier inflation or deflation. Putting everything in Tier 1 (paralyzes the team) or putting decision-supporting LLMs in Tier 3 (fails examination). Both are common in the first month. The fix is a documented tiering rubric that is itself reviewed by MRM.
No challenger. Production LLMs running with no parallel challenger model. This was acceptable under SR 11-7 in some interpretations; under the new framework it is hard to defend for Tier 1 use cases.
Build order
Aligning Tier 1 LLM applications with the April 2026 Fed/FDIC/OCC framework is a dependency chain, not a calendar. Each step produces the artifact the next step depends on, and skipping order is what fails examination.
| Order | What you build | Eval gate before moving on |
|---|---|---|
| 1 | Use case registry with tier assignment for every LLM application in the bank, scored against the tiering rubric | 100% of LLM use cases registered; tier rationale documented and reviewed by MRM for every Tier 1 and Tier 2 entry |
| 2 | Lineage gap analysis per Tier 1 use case against the minimum decision record (input, prompt template version, model and version, retrieval context, post-processing, final action, outcome) | Documented gap list with named owner per gap; zero Tier 1 use cases without a gap report on file |
| 3 | Structured tracing instrumentation for Tier 1 use cases, with prompt template version, model version, retrieval spans, and guardrail spans captured per request | Reconstruct any sampled Tier 1 decision from the last 30 days in under 60 seconds via trace query; 99%+ trace completeness on Tier 1 traffic |
| 4 | Pre-deployment validation package: held-out eval set, adversarial prompt suite, prompt ablation results, fairness analysis where applicable | Validation report signed by independent reviewer; adversarial suite passes with documented residual risk for every Tier 1 model in production |
| 5 | Effective challenge infrastructure: shadow traffic to a challenger model on a different provider or prompt structure, divergence logging, scheduled prompt ablation runs | Challenger live on 5%+ of Tier 1 traffic; divergence rate measured weekly; one challenger-to-candidate promotion cycle completed |
| 6 | Vendor controls: zero-data-retention contracts, version pinning, change notification, documented fallback to a secondary provider with tested switchover | Contract gaps closed for every Tier 1 provider; switchover drill executed end-to-end with measured time-to-recover under the documented target |
Continuous eval re-runs on every model provider update, quarterly outcome analysis, and annual full validation refresh follow once the chain above is in place. Reordering or skipping (for example, shipping challenger infrastructure before lineage is reconstructable) leaves the validation package incomplete and is the pattern supervisors are flagging in the first wave of examinations under the new framework.
How Respan fits
The April 2026 framework reframes LLM model risk as a lineage and effective-challenge problem, and Respan is the substrate underneath the Tier 1 controls your team has to ship. Tracing, evals, gateway, prompt registry, and monitors are the pieces that turn "evidence as byproduct" from a slogan into an architecture.
- Tracing: every LLM decision captured as one connected trace, from raw user input through retrieval, prompt template version, model and version, structured output, post-processing, and final action. Auto-instrumented for LangChain, LlamaIndex, Vercel AI SDK, CrewAI, AutoGen, OpenAI Agents SDK. For a Tier 1 underwriting assistant or AML triage agent, this is what makes a single decision_id reconstructable years later when validators come asking.
- Evals: ten built-in evaluators (faithfulness, citation accuracy, refusal correctness, harmfulness) plus LLM-as-judge and custom Python evaluators. Production traffic flows directly into datasets. CI-aware experiments block regressions on grounding rate drops, customer-disclosure misrepresentation, and prompt-injection bypass before deploys ship.
- Gateway: 500+ models behind an OpenAI-compatible interface, semantic caching, fallback chains, per-customer spending caps. Fallback chains and version pinning are exactly the controls examiners now ask for when a single-provider Tier 1 dependency comes up, and the OpenAI-compatible interface lets MRM run independent evaluation without renegotiating contracts.
- Prompt management: versioned registry, dev/staging/prod environments with approval workflows, A/B testing in production with one-click rollback. Customer-facing copilot prompts, AML triage system messages, and adverse-action explanation templates belong in the registry so prompt_template_version is a first-class field on every lineage record and dual-control sign-off has a real artifact behind it.
- Monitors and alerts: grounding rate, refusal rate, user feedback rate, challenger divergence rate, and provider drift on every model upgrade. Slack, email, PagerDuty, webhook. Quarterly outcome analysis becomes a standing dashboard instead of a binder reconstructed at examination time.
A reasonable starter loop for fintech MRM builders:
- Instrument every LLM call with Respan tracing including retrieval spans, guardrail spans, and post-processing spans so the lineage record above is captured by construction.
- Pull 200 to 500 production AML triage and underwriting decisions into a dataset and label them for grounding, refusal correctness, and policy compliance.
- Wire two or three evaluators that catch the failure modes you most fear (prompt-injection bypass on customer copilots, ungrounded citations in adverse-action letters, off-distribution AML alert dispositions).
- Put your customer-facing disclosure templates and AML triage system prompts behind the registry so you can version, A/B, and roll back without a deploy.
- Route through the gateway so a frontier model "improvement" overnight does not silently ship into a Tier 1 path, and so a secondary provider is one config away when a primary outage hits.
The teams that get through procurement faster under the new framework are the ones who built lineage and effective challenge into the stack instead of bolting them on for examination.
To wire any of the patterns above on Respan, start tracing for free, read the docs, or talk to us.
Related reading
- Building Adverse Action Explainability for LLM-Driven Credit Decisions: the FCRA and ECOA technical translation
- Evaluating LLMs for Real-Time Fraud Detection: where LLMs fit in fraud workflows and where they do not
- Building a Financial Research Agent: long-horizon agent architecture
- How Fintech Teams Build LLM Apps in 2026: pillar overview
