In February 2024, the British Columbia Civil Resolution Tribunal ordered Air Canada to pay damages to a customer named Jake Moffatt after the airline's chatbot invented a bereavement-fare refund policy that did not exist. Moffatt had asked the chatbot whether he could apply for a discounted bereavement fare retroactively after booking. The bot told him yes, with a 90-day window. Air Canada's actual policy required the discount to be requested before travel. When Moffatt filed for the refund, the airline refused, then argued in tribunal that the chatbot was a "separate legal entity" responsible for its own statements. The tribunal called that defense "remarkable" and ruled that Air Canada was responsible for all information on its website, whether it came from a static page or a generative model. Moffatt got the refund the chatbot had promised, plus tribunal fees, plus interest. The case has been cited ever since as the moment policy hallucinations stopped being a quality issue and became a liability issue.
The Air Canada ruling crystallized something every support AI vendor and customer-experience leader had been worrying about: policy hallucinations create binding customer expectations, sometimes legally enforceable ones. An AI that promises a 90-day return window when the actual policy is 30 days has shipped a real liability event, not just a bad answer. Klarna's public retreat from its all-AI customer-support push, walked back through 2024 and into 2025 as the company rebuilt human capacity, shows the operational dimension. Even a vendor that publicly celebrated AI replacing 700 agents found that policy edge cases, escalation judgment, and customer trust required humans back in the loop.
This piece is for engineers building customer support AI products. It covers the failure modes, why policy hallucination is structurally harder than general-domain QA, and the six engineering fixes that close the gap. For the wider Customer Support cluster, see the pillar, the privacy spoke, the build walkthrough, and the eval spoke.
How a policy hallucination escapes
A policy hallucination becomes a liability event the moment it crosses from the model into a customer-binding statement. Most support stacks have a verification gap between those two points. The diagram below shows where the gap usually lives.
The red node is the verification gap. In the Air Canada case there was no verifier, the chatbot's output went straight to the customer, and "you can apply for the bereavement fare retroactively within 90 days" became a binding promise the moment Moffatt read it. Every fix in this guide is, at some level, a different way of closing that gap.
Trace the verifier, not just the model
The verification gap is invisible until you can replay the full chain: retrieval, policy lookup, model output, grounding check, send. Wire every span into Respan tracing so when a dispute arrives you can reproduce exactly which policy version the model saw, whether the verifier fired, and why a claim slipped through.
The failure modes
Five recurring patterns of policy hallucination in shipped support AI.
Generous policy fabrication. The AI invents a more generous policy than the company actually has. "Our return window is 90 days" when actual is 30. "We accept returns on used items" when policy is unopened only. The customer holds the company to the AI's promise.
Conditional policy drift. The AI states a policy without the conditions. "Yes, you can get a refund" without mentioning the proof-of-purchase requirement, the time window, or the channel-specific rules. The customer expects an unconditional refund, the company has to argue against the AI's commitment.
Account-specific drift. The AI applies a generic policy when the customer has a specific contract that overrides it. Enterprise customers with negotiated terms get told the standard policy, the AI gets it wrong because it does not know about the contract.
Outdated policy. The AI's training or RAG corpus has a stale version of the policy. Policy changed last quarter, the AI is still quoting last year's version. The customer holds the company to the older, often more generous, terms.
Policy ambiguity over-confidence. The actual policy has gray areas (case-by-case manager discretion). The AI confidently states a black-and-white answer, foreclosing the human judgment the policy was designed to allow.
Failure pattern reference
The five patterns above show up in five concrete liability surfaces. Engineering owners need to know what each one looks like in production, what it costs when it escapes, and which fix closes it.
| Hallucination type | Example | Liability | Engineering fix |
|---|---|---|---|
| Refund window | Bot tells customer "you have 90 days to request a refund" when policy is 30 days | Direct cash refund honored to keep customer trust, tribunal exposure (Air Canada precedent) | Deterministic policy lookup keyed on customer.region and product.category, hedged LLM presentation |
| Return policy | Bot says "used items can be returned" when policy is unopened only, or omits restocking fee | Reverse logistics cost, restocking fee waived, repeat customer complaints | Structured returns engine with eligibility flags, citation enforcement on returns claims |
| Warranty terms | Bot promises "two-year manufacturer warranty" when actual is one year on category | Replacement or repair costs in the second year, FTC and state-AG attention on warranty misstatements | Versioned warranty registry per SKU, refusal on missing SKU lookup |
| Shipping commitment | Bot promises "free overnight shipping" or "delivery by Friday" without checking carrier or cutoff | Service credit or refund, expedited-shipping cost, NPS damage on missed dates | Real-time carrier API call, no LLM-generated shipping promises, calibrated abstention on cutoff edge cases |
| Service-level commitment | Bot tells enterprise customer "24/7 follow-the-sun support" when contract is business-hours only | Breach claim against the master agreement, account credits, churn on renewal | Account-aware policy resolution merging contract overrides, contract-engine lookup before any SLA statement |
Every row maps to one of the six engineering fixes below. The pattern of failures is small, the surface area is bounded, and a disciplined team can close the highest-cost rows first.
Why this is structurally hard
Five reasons.
Policies are written for humans, not models. Help center articles and policy docs are full of conditional language ("usually", "in most cases", "exceptions apply"). Models trained on this content learn to confidently state conditional rules as absolutes.
Policies change. Return windows shift, warranty terms update, refund policies tighten or loosen. The AI's grounding has to update with the policy, not stay at training time.
Customer pressure shifts the response. A frustrated customer asking "you promised a refund" gets a different LLM response than a calm customer asking "what is your refund policy". Sycophancy bias compounds with policy ambiguity to produce overly generous answers. The OpenAI 2025 hallucination paper documents this: as long as the eval rewards completion, the model fills gaps with fabrications.
Account context matters. Enterprise contracts, loyalty tiers, regional variations, channel-specific rules. Generic policy retrieval misses all of this.
Liability is asymmetric. A wrong-but-generous policy answer creates a customer commitment. A wrong-but-strict policy answer creates a customer complaint and possibly a public-relations incident. Either way the company eats the cost.
Six engineering fixes
1. Deterministic policy lookup, not LLM generation
Policies are structured data. Refund window: 30 days. Restocking fee: 15%. Eligible categories: list. The LLM does not generate this from policy documents at inference time, the LLM looks it up from a structured policy database and presents the result.
import os
from datetime import datetime
from respan import Respan
client = Respan(api_key=os.environ["RESPAN_API_KEY"])
@client.workflow(name="refund-eligibility")
def check_refund_eligibility(customer_id: str, order_id: str):
customer = customers_db.get(customer_id)
order = orders_db.get(order_id)
# Deterministic lookup, not LLM generation
policy = policy_engine.lookup(
policy_type="refund",
customer_tier=customer.tier,
product_category=order.product.category,
region=customer.region,
contract_overrides=customer.enterprise_contract,
)
days_since_purchase = (datetime.utcnow() - order.purchased_at).days
eligible = days_since_purchase <= policy.window_days
return {
"eligible": eligible,
"policy_version": policy.version,
"window_days": policy.window_days,
"days_used": days_since_purchase,
"days_remaining": max(policy.window_days - days_since_purchase, 0),
"restocking_fee_pct": policy.restocking_fee_pct,
"manager_discretion": policy.manager_discretion,
}The LLM does not say "yes you can get a refund" from intuition. The LLM presents the structured eligibility result with appropriate hedging. This pattern is the single biggest reduction in policy hallucination liability, because it removes generation entirely from the high-stakes path.
2. Policy RAG with citation enforcement
For policy questions that go beyond what the structured engine handles (clarifications, edge cases, complex scenarios), the AI retrieves from authoritative policy docs and cites them. Every policy claim in the AI response is sourced to a specific section. Claims that fail to ground get flagged for human review or rephrased to remove the unsourced assertion.
The eval that catches ungrounded claims before they ship is an LLM-as-judge over (claim, policy_corpus). Run it against every candidate response in CI, and on a sampled stream in production:
import json
from respan import Respan
from respan.evaluators import llm_as_judge
client = Respan(api_key=os.environ["RESPAN_API_KEY"])
POLICY_GROUNDING_RUBRIC = """
You are auditing a customer support response for policy claims.
Inputs:
- response: the agent's answer to the customer
- policy_snippets: authoritative policy passages retrieved for this query
- policy_engine_result: the structured policy lookup (or null)
For every policy claim in the response (refund window, return rules,
warranty terms, shipping commitments, SLA), decide:
GROUNDED claim is directly supported by policy_snippets or policy_engine_result
CONTRADICTED claim conflicts with policy_snippets or policy_engine_result
UNSUPPORTED claim has no source in either input
Return JSON: {"claims": [{"text", "verdict", "evidence"}], "overall": "pass|fail"}
A response with any CONTRADICTED or UNSUPPORTED claim must overall=fail.
"""
@client.evaluator(name="policy-claim-grounding")
def policy_claim_grounding(example, output):
verdict = llm_as_judge(
model="claude-opus-4-7",
rubric=POLICY_GROUNDING_RUBRIC,
inputs={
"response": output["text"],
"policy_snippets": example["policy_snippets"],
"policy_engine_result": example.get("policy_engine_result"),
},
response_format={"type": "json_object"},
)
parsed = json.loads(verdict)
failed_claims = [c for c in parsed["claims"] if c["verdict"] != "GROUNDED"]
return {
"score": 1.0 if parsed["overall"] == "pass" else 0.0,
"metadata": {"failed_claims": failed_claims},
}Wire this evaluator into your CI experiments and your production sampling. A regression in policy-claim-grounding blocks a deploy. A spike in production failures pages on-call before disputes start landing.
Block fabrications in CI, not in customer threads
Pair the grounding evaluator above with Respan experiments so every prompt or model change runs against your production-mined dataset before merge. CI-aware experiments mean a regression on the refund-window claim never ships, instead of being discovered by the customer who screenshots it.
3. Refusal on policy ambiguity
The OpenAI 2025 hallucination paper applies here too: as long as your eval rewards completion, your model learns to fill policy gaps with fabrications. Train your eval to reward calibrated abstention.
Practical pattern: when the policy is ambiguous (case-by-case manager discretion), the AI says so explicitly. "Your situation may qualify for a refund under our manager-discretion policy. I am escalating to a senior agent who can make that call." Refusal with escalation is a correct answer, not a failure.
4. Account-aware policy resolution
Every policy lookup includes the customer's account context. Tier, contract overrides, region, channel. The policy engine handles overlapping rules:
def resolve_policy(policy_type, customer):
base = policy_engine.base(policy_type, region=customer.region)
if customer.enterprise_contract:
contract = contract_engine.lookup(customer.enterprise_contract.id)
base = merge_policies(base, contract.overrides)
if customer.tier in ["gold", "platinum"]:
tier_overrides = tier_engine.lookup(customer.tier, policy_type)
base = merge_policies(base, tier_overrides)
return baseThe merge logic is policy-specific. Document it. Test it. The most expensive bug class is the merge logic getting overrides backwards, where an enterprise customer gets the standard policy instead of their negotiated terms, or vice versa.
5. Policy version freshness
The policy engine has a clear authoritative source (the policy team's database, not the help center, not the chatbot's training data). When the policy team updates the database, the AI's behavior updates within minutes.
Audit: every AI response that cites a policy logs the policy version it referenced. When a customer disputes a response, you can reproduce exactly what the AI saw and why it said what it said. Without versioned logging, every dispute becomes "we cannot verify what the bot told you," which is the worst possible posture in a tribunal-adjacent conversation.
6. Continuous capture from disputes and overrides
Every customer dispute, every agent override, every "the AI said something wrong" report becomes a labeled datum. The dataset is both your eval set and your prompt-iteration corpus.
Specifically capture:
- The original AI response
- The actual policy at the time
- The customer's claim
- The agent's resolution
- The cost (refund amount, escalation time, customer-satisfaction impact)
Pattern analysis on this dataset surfaces whether hallucinations cluster in specific policy areas, customer cohorts, or prompts. The Klarna walk-back in 2024 and 2025 was, at root, a story about a feedback loop that did not exist: the system shipped, the disputes accumulated, and there was no labeled stream feeding back into evals fast enough to catch the regressions.
Disputes are your highest-value dataset
Pipe every override and every "the AI got the policy wrong" ticket into a Respan dataset with the response, the policy version, and the resolution. That dataset becomes the eval set the next prompt change has to beat, and the prompt registry holds the hedged-language and citation templates that win it.
A reference architecture
[Customer query]
|
v
[Authentication + customer context load]
|
v
[Intent classification: policy question, account-specific, general]
|
v
[Branch: policy -> deterministic engine, general -> KB RAG]
|
v
[Policy engine: structured lookup with account overrides]
|
v
[LLM presentation: hedged language, citation to policy version]
|
v
[Pre-send compliance check: policy claim grounded?]
|
v
[Customer response with policy citation]
|
v
[Trace capture: policy version, customer context, response]
|
v
[Continuous capture from disputes and agent overrides]
What to ship and in what order
A staged rollout:
- Week 1. Structured policy database for the highest-volume policy questions (refunds, returns, warranties). Auth-aware customer context.
- Week 2. Replace LLM-generated policy answers with policy-engine-driven responses for the high-volume policies. Hedged-language prompts for ambiguous cases.
- Week 3. Policy RAG for the long tail of policy questions outside the structured engine. Citation enforcement and the grounding evaluator wired into CI.
- Week 4. Continuous capture from agent overrides. Eval suite for policy-claim accuracy running in production sampling, with alerts on regression.
Account-aware policy merging is a stretch goal for week 5, it is the most expensive layer and the highest-stakes bug surface.
How Respan fits
Policy hallucination is the failure mode where a single ungrounded sentence becomes a binding refund. Respan gives you the trace, eval, and prompt-control surface to keep policy claims grounded in your structured policy engine and to catch fabrications before they reach a customer.
- Tracing: every support conversation captured as one connected trace. Auto-instrumented for LangChain, LlamaIndex, Vercel AI SDK, CrewAI, AutoGen, OpenAI Agents SDK. Spans cover policy-engine lookups, contract-override merges, RAG retrievals, citation checks, and the final LLM presentation, so you can replay exactly which policy version the AI saw when it made a claim.
- Evals: ten built-in evaluators (faithfulness, citation accuracy, refusal correctness, harmfulness) plus LLM-as-judge and custom Python evaluators. Production traffic flows directly into datasets. CI-aware experiments block regressions on policy fabrications, conditional drift, account-context misses, and over-confident answers on ambiguous policies before deploys ship.
- Gateway: 500+ models behind an OpenAI-compatible interface, semantic caching, fallback chains, per-customer spending caps. Route policy-presentation calls through the gateway so you can swap the underlying model when a new release hallucinates less, cap spend per enterprise account, and keep a unified audit log of every policy-bearing response.
- Prompt management: versioned registry, dev/staging/prod environments with approval workflows, A/B testing in production with one-click rollback. Hedged-language prompts, refusal templates, and citation-enforcement instructions live in the registry, so a policy-team change to refund language ships without a redeploy and rolls back in one click if dispute rates spike.
- Monitors and alerts: ungrounded policy-claim rate, unsourced refund/return assertions, policy-version staleness, dispute-driven correction rate, account-override miss rate. Slack, email, PagerDuty, webhook. Wire alerts to the on-call channel so a sudden uptick in policy fabrications pages a human before it becomes the next Air Canada headline.
A reasonable starter loop for support AI builders:
- Instrument every LLM call with Respan tracing including policy-engine lookup spans, contract-override merge spans, and citation-check spans.
- Pull 200 to 500 production support conversations into a dataset and label them for policy-claim accuracy, citation grounding, and account-context correctness.
- Wire two or three evaluators that catch the failure modes you most fear (generous policy fabrication, conditional policy drift, account-specific override misses).
- Put your hedged-language and citation-enforcement prompts behind the registry so you can version, A/B, and roll back without a deploy.
- Route through the gateway so every policy-bearing response is logged, capped per enterprise account, and swappable across models when a better-grounded option ships.
The result is a support AI where every refund-eligibility answer is traceable to a policy version, every fabrication shows up in an eval before it reaches production, and every dispute becomes a labeled datum that hardens the next release.
To wire any of the patterns above on Respan, start tracing for free, read the docs, or talk to us.
CTA
To wire the policy stack on Respan, start tracing for free, read the docs, or talk to us. For the rest of the Customer Support cluster: the pillar, the privacy spoke, the build walkthrough, and the eval spoke.
FAQ
Can my AI's answer create a binding promise? In some jurisdictions and circumstances, yes. The Air Canada Civil Resolution Tribunal ruling is the canonical example. Even where the AI's promise is not legally binding, it creates customer expectations the company often has to honor for relationship reasons.
Should the LLM ever generate policy text? Not for high-stakes policies (refunds, returns, warranties). Use a deterministic policy engine. The LLM presents and explains the structured result, the LLM does not invent the policy.
How do I handle policy ambiguity? Build the ambiguity into the policy engine output. "Manager discretion: case-by-case" as an explicit field. The AI surfaces this and offers escalation, instead of confabulating a black-and-white answer.
What's the right way to handle enterprise contract overrides? A separate contract engine that produces overrides, merged with the base policy at lookup time. Test the merge logic carefully, it is the highest-stakes bug surface.
Can I just train a model on our policy docs? You can, but training bakes in a snapshot. When policy changes, the trained model is wrong until retrained. Policy RAG with versioned source-of-truth is the maintainable architecture.
What did Klarna's 2024-2025 walk-back actually teach the field? Klarna's public announcement that AI did the work of 700 agents was followed in 2024 and into 2025 by a reversal where the company rehired humans for the long tail of policy edge cases and escalation judgment. The lesson for engineering teams is that volume metrics flatter the easy 80%, and the failure modes in this guide live in the remaining 20% where policy ambiguity, account context, and customer pressure compound.
