Customer support AI is unusual because the success metric is contested. Vendor pitches lead with deflection rate (the percentage of tickets resolved without human escalation). Customers measure CSAT (the percentage of resolutions where the customer was actually satisfied). The two correlate weakly, sometimes negatively. A deflection agent that resolves 60% of tickets with a 20-point CSAT drop has shipped a worse product than the human team it replaced.
The most public version of this story is Klarna's 2024 announcement that its OpenAI-powered assistant was handling two-thirds of customer service chats with the work of 700 agents, followed by a 2025 reversal where the company hired humans back after CSAT and quality declined. The pattern is now common enough that the engineering question has shifted from "can AI deflect" (yes) to "what is the architecture, eval loop, and operational discipline that catches CSAT regressions before they hit the press release."
This piece is for engineers and PMs at support orgs (in-house) and customer-experience platforms (vendors) building LLM features. It covers the five architectural patterns, the metrics that actually matter, and the engineering loop that catches policy hallucinations and CSAT regressions before they reach customers.
For deeper engineering work on specific layers, see the spokes:
- Why Support AI Hallucinates Refund Policies: the highest-liability failure mode and how to fix it
- Customer Data Privacy for Support AI: GDPR, CCPA, CASL, multi-jurisdiction
- Building a Customer Support Deflection Agent: end-to-end build walkthrough
- Evaluating Customer Support AI: the eval framework
The vendor landscape
Customer support AI sorts into four product shapes plus a shared retrieval substrate. Each shape has a different eval target, compliance exposure, and pricing model.
| Vendor | Shape | Pricing | Strongest at |
|---|---|---|---|
| Retell AI | Voice deflection | Per-minute | Voice-first support agents |
| Decagon | Deflection | Annual + per-deflection | Mid-market consumer apps |
| Sierra | Deflection + voice | Outcome-based | Enterprise consumer voice |
| Intercom Fin | Deflection | $0.99 per resolution | Intercom-native orgs |
| Ada | Deflection | Annual | Mid-market chat |
| Salesforce Einstein | Agent assist | Per-seat add-on | Salesforce orgs |
| Zendesk AI | Agent assist + deflect | Per-seat add-on | Zendesk orgs |
| PolyAI / Replicant | Voice | Per-call | Phone IVR replacement |
| Klaus / MaestroQA | Ticket QA | Per-seat | Post-resolution scoring |
Pricing and feature data are directional; verify with the vendor before procurement. The product fit decision is "which shape do you need first" before "which vendor in that shape."
Pattern 1: Deflection agents
The flagship use case. A customer asks a question; the AI responds; the customer gets resolution without a human ticket.
The naive implementation is "system prompt with the company's KB plus the user's question." It produces the demo. It also produces three classes of failure: hallucinated policies (the AI invents a refund or return policy that does not exist), wrong-account responses (the AI gives information about the wrong customer's account), and false confidence (the AI confidently gives wrong information when it should escalate).
A production deflection agent has three layers under the chat UI:
Each layer maps to a different failure mode: weak grounding produces hallucinated policies, weak policy enforcement produces wrong-account responses, weak verification ships false confidence to customers.
The five things that actually matter:
RAG over the actual knowledge base. The AI retrieves from the company's authoritative help center, policy docs, and product documentation. It does not generate policies from training data.
Customer context grounding. The AI knows which customer is asking, scoped to their account. It can fetch their order history, subscription status, recent activity, but only for the authenticated customer.
Policy enforcement at the response layer. Refund policies, return windows, escalation thresholds are enforced in code, not in the LLM prompt. The LLM does not decide whether to issue a refund; the LLM gathers context and triggers the appropriate workflow.
Escalation logic. The AI knows when to escalate. Confidence below threshold, customer frustration signals, account-sensitive issues, complex compound issues all trigger handoff to a human agent with full context.
Conversation memory across channels. A customer who was on chat yesterday and email today should not have to re-explain. Multi-channel memory is non-trivial and where many vendors fall short.
The deflection-vs-CSAT trade-off
The single most-cited number in vendor pitches is "deflection rate." It is the wrong target if not paired with CSAT. The right target is something like (deflected_volume × CSAT_score), and the curve is highly cohort-specific.
Approximate ranges from public deployments and what we have seen at customers:
| Customer cohort | Deflection at neutral CSAT | Aggressive deflection (CSAT impact) |
|---|---|---|
| Consumer SaaS, billing FAQ | 60-70% | 80% (-2 to -4 CSAT) |
| Consumer marketplace, returns | 35-45% | 60% (-6 to -10 CSAT) |
| Mobile app, account-help | 50-60% | 70% (-3 to -5 CSAT) |
| B2B integration support | 20-30% | 45% (-1 to -3 CSAT, but volume too low to matter) |
| Financial services, complex inquiries | 15-25% | not recommended |
These ranges are directional, not benchmarked. Run your own per-cohort segmentation before setting a global threshold.
A working LLM-as-judge eval for the highest-liability failure mode (hallucinated policy claims) looks like this:
# evals/hallucinated_policy.py
from respan import evaluate
JUDGE = """You grade a customer-support reply for hallucinated policy claims.
A policy claim is any statement about refund windows, return policies,
shipping times, warranty terms, or service-level commitments.
Mark `pass` only if every policy claim in the reply is supported by the
authoritative KB excerpts. Otherwise mark `fail` and quote the unsupported
claim.
KB excerpts retrieved by the agent:
{kb}
Agent reply:
{reply}
"""
@evaluate(name="hallucinated_policy_claim", model="claude-sonnet-4-6")
def hallucinated_policy(trace):
return JUDGE.format(
kb="\n\n".join(trace.retrieval.chunks),
reply=trace.output,
)Wire this on a 1% production sample. The pass rate becomes a first-class deploy gate.
Respan ships this evaluator out of the box
Hallucinated-policy-claim eval grades responses against retrieved KB chunks, flags claims with no retrieval support, and pushes the pass-rate to your dashboard. The Klarna 2024 launch did not have this gate.
Decagon, Ada, Intercom Fin AI, and Sierra are platform leaders. Each has a slightly different opinion on deflection-vs-CSAT trade-off and how aggressive to be on automated resolution.
Pattern 2: Agent assist
The augmentation layer. Human agents stay in the loop; AI suggests responses, surfaces relevant knowledge, summarizes long ticket threads.
The pattern that works:
- Suggested responses with explicit citations to the knowledge base. The agent reviews and accepts, edits, or rejects.
- Real-time summarization of long customer conversations. The agent gets a one-paragraph "what is going on" before they engage.
- Knowledge surfacing in-context. As the agent types, the AI surfaces relevant KB articles. The agent does not have to search.
- Customer sentiment and risk signals. Live indicators of customer frustration, churn risk, escalation likelihood.
Why agent assist often beats deflection on CSAT: the human still owns the resolution. The AI accelerates the agent's work without putting itself in the position of being wrong. For high-stakes verticals (financial services, healthcare-adjacent, complex B2B), agent assist is the right architecture.
The acceptance rate is the eval to watch. Below 30% and the suggestion engine is hurting more than it helps. Approximate acceptance targets by capability:
| Capability | Healthy acceptance | Action if below |
|---|---|---|
| Suggested reply | 45-65% | Tune retrieval; add brand-voice prompt |
| Real-time summarization | 70-85% | Reduce summary length; pin facts to source |
| Knowledge surfacing | 30-50% (clicked) | Improve query expansion + ranking |
| Sentiment / risk flag | 60-80% (no false-flag) | Calibrate threshold; gate on confidence |
Acceptance below the floor is a signal the model is generating noise the agent has to clean up, strictly worse than no AI for that capability.
Pattern 3: Voice agents
The phone-channel layer. Retell AI leads on voice-first deployments with sub-800ms round-trip and barge-in handling out of the box. PolyAI, Replicant, and Voiceflow compete at the enterprise tier, plus internal builds at major brands like Bank of America's Erica and the voice agents Sierra rolled out across enterprise consumer brands in 2025.
A typical production stack:
| Stage | Component | Latency budget | Notes |
|---|---|---|---|
| ASR | Deepgram Nova-3 streaming, AssemblyAI Realtime | 150-250ms | Streaming, partial transcripts |
| Endpoint detection | Custom or built into ASR | 200-400ms | The "did the customer stop talking" decision |
| LLM TTFT | GPT-4o-mini, Claude Haiku, Gemini Flash via gateway | 300-500ms | First-token latency matters more than total |
| TTS | ElevenLabs Flash, Cartesia Sonic, OpenAI TTS-1 | 150-300ms | Streaming TTS, sub-200ms first-byte |
| Telephony | Twilio Media Streams, Vonage, Plivo | 50-150ms | Network round-trip |
The total round-trip target is sub-1 second to first audible token. Above 1.2 seconds, the conversation feels broken. Customers start talking over the agent. The fix is not "use a faster model" but architectural: stream tokens, start TTS on partial LLM output, handle barge-in by canceling in-flight TTS when the customer interrupts.
Voice has its own constraints:
- Latency is unforgiving. Above 1.2 seconds round-trip, the conversation feels broken.
- Accent and dialect handling. Production voice AI has measurable accuracy gaps on accented English, AAVE, and non-English. Test bias and accuracy on your specific customer population.
- Disclosure laws. Some states (and the EU AI Act) require disclosing the caller is AI. Build the disclosure as a first-class feature.
- Escalation to human. Voice handoffs are harder than text. The human picks up cold without the conversation context unless you engineer the handoff carefully.
Voice trace as one span tree
Respan captures ASR partials, endpoint decisions, LLM TTFT, TTS first-byte, and telephony RTT as a single span tree, so when P95 round-trip jumps you can see which stage regressed instead of guessing.
Pattern 4: Ticket QA
The post-resolution layer. Review completed tickets for quality, compliance, coachable moments. Surfaces coaching opportunities to managers, flags compliance issues to QA leads.
Pattern:
- Auto-grading on rubric. Every closed ticket scored against a rubric (resolution quality, tone, policy adherence, escalation correctness). Replaces or augments the QA team's manual sampling.
- Compliance flags. PII handling, regulatory disclosures, complaint handling. Auto-flag for QA review.
- Coaching surface. Aggregate patterns across an agent's tickets surface to their manager. "Agent X is missing the escalation criteria on billing disputes."
A working ticket-QA rubric (the kind teams paste into an LLM-as-judge prompt):
| Dimension | Pass criterion | Failure example |
|---|---|---|
| Resolution quality | Customer's stated problem is addressed and verified | "Issue resolved" with no verification step |
| Policy adherence | Quoted policies match the authoritative KB | Refund window stated as 60 days when policy is 30 |
| Escalation correctness | Account-sensitive, fraud, or legal flags handed to human | Agent issues refund above approval threshold |
| Tone match | Matches brand-voice reference set (30-50 approved replies) | Casual reply on a complaint ticket |
| Compliance | PII handled per region; required disclosures present | EU customer reply with no GDPR disclosure |
Klaus (acquired by Zendesk), MaestroQA, and Tymeshift compete here. The build-vs-buy decision turns on whether your QA rubric is generic enough to fit a vendor's schema or specific enough to need custom evaluators (regulated verticals, multi-brand orgs, and unusual escalation criteria usually need custom).
Pattern 5: Knowledge base RAG
The infrastructure layer underneath every other pattern. The KB has to be searchable, retrievable, and current.
What separates production-grade KB RAG:
- Versioned content. When policies change, the AI's knowledge changes. Retrieval pulls the current version, not a stale embedding.
- Multi-format ingestion. Help center articles, internal SOPs, product documentation, past resolved tickets, customer-facing emails. Everything that documents the company's truth gets indexed.
- Permission-aware retrieval. Internal docs are not surfaced to external customers. The retrieval engine respects the security boundary.
- Citation enforcement. AI responses cite the specific KB articles they draw on. Customers (and agents) can verify the source.
The eval that catches the most damage: retrieval recall on a held-out set of "tickets that should have been answered from the KB but weren't." Track this weekly. When recall drops, the KB has stale or missing content, not a model regression.
Why 2024-style deployments failed: the Klarna lessons
Klarna's reversal was the most public version of four failure modes the category has now seen repeatedly. Treat each as a constraint on your architecture, not a postmortem to study after launch.
Aggregate metrics hid edge-case failures. Klarna's AI handled 2.3 million chats and reported strong overall numbers. The 5-10% of interactions that involved fee disputes, payment hardships, or fraud claims got the same treatment as the 90% routine queries. Average CSAT looked fine; edge-case CSAT was catastrophic. The customers experiencing those bad cases were the ones who churned and complained publicly. Architectural fix: stratify every metric by query type and customer segment; never publish a single global CSAT number internally.
Hallucinations on the cases that mattered most. Routine questions get answered from grounded retrieval. Complex questions invite the model to extend beyond grounded sources. In fintech, an AI that confidently states an incorrect fee or fabricates a policy creates immediate compliance damage. Industry hallucination rates of 15-27% in live deployments translate directly to customer harm. Architectural fix: explicit refusal/escalation paths for high-stakes query types; structured policy data that the LLM renders rather than interprets.
Deflection became the optimization target. When containment is the headline KPI, the system biases toward not escalating. The customer with a complex dispute who got "resolved" by the AI is recorded as a containment success and a customer-service failure simultaneously. Architectural fix: track (deflected_volume × CSAT) and the false-deflection rate (cases AI handled that should have escalated) as first-class metrics.
Human agent tacit knowledge was eliminated alongside the team. Experienced agents carry pattern recognition the AI cannot replicate from logs alone: "this kind of question usually means the customer is actually asking about X." When Klarna laid off the team, that institutional knowledge went with it. Rebuilding it required not just rehiring but retraining. Architectural fix: hybrid by design from day one (AI handles routine, humans own complex), not a fallback retrofitted after the AI fails.
The 2026 reset that the leaders run, Sierra at $10B / $150M ARR, Decagon at $4.5B, Intercom Fin past $100M ARR, treats each of these as architectural constraints, not retrospective lessons.
What is hard across patterns
The deflection-vs-CSAT trade-off. Aggressive deflection drops CSAT. Conservative escalation keeps CSAT but hurts deflection. The right balance is product-specific, customer-cohort-specific, and time-of-day-specific. A single global threshold misses.
Hallucinated policies. "Our return window is 90 days" when it is actually 30 days. "We accept returns on used items" when the policy is unopened only. These are the highest-liability hallucinations because they create binding promises (or at least customer expectations) the company has to honor or argue against. The 2024 Air Canada chatbot ruling is the legal anchor: the tribunal found the airline liable for what its chatbot promised.
Multi-channel memory. A customer who started on chat, escalated to email, and is now on phone has a continuous context that most vendor stacks fail to maintain. Memory that crosses channels with appropriate redaction is non-trivial.
Privacy across jurisdictions. GDPR, CCPA, LGPD, CASL, PIPEDA, plus emerging state laws. Customer data has to be handled to the strictest applicable standard.
Tone calibration. AI responses that are technically correct but tonally wrong (corporate-speak when the customer is frustrated, casual when the customer is formal) drop CSAT. Brand voice tuning is harder than people think; the practical pattern is a brand-voice prompt registry plus an LLM-as-judge evaluator that grades tone match against 30-50 reference responses your brand team has already approved.
Auto-grade every closed ticket
Respan runs the rubric above as an LLM-as-judge eval over 100% of closed tickets, pushes aggregated patterns to the QA team, and flags individual tickets for human review. Replaces the 1-3% manual sampling most QA orgs run today.
How Respan fits
Same observability and evaluation backbone applies across all five patterns; the failure modes change but the substrate is constant. Tracing makes the multi-step deflection or voice flow debuggable. Evals catch hallucinated policies, missed escalations, and CSAT drift. The gateway routes between cheap and frontier models depending on confidence. Prompt management versions the policy and brand-voice prompts that change weekly. Monitors fire when deflection-vs-CSAT divergence crosses a threshold.
A starter loop tied to the patterns above:
- Instrument every customer interaction with Respan tracing including auth, retrieval, policy lookup, LLM, verifier, and escalation spans.
- Pull 200 to 500 production conversations into a dataset and have QA leads label them on resolution quality, hallucinated policy claims, and escalation correctness.
- Wire two or three evaluators that catch the failure mode you most fear: hallucinated policy claim, false confidence, wrong account, missed escalation.
- Put your policy enforcement prompts and brand-voice templates in the registry so support ops can update them without a deploy.
- Route through the gateway so per-customer spending caps, fallback chains, and BAA-compliant model selection live in one place.
That loop is the difference between a launch press release and a Klarna-style reversal twelve months later.
To wire any of the patterns above on Respan, start tracing for free, read the docs, or talk to us.
Where to go next
- Why Support AI Hallucinates Refund Policies: the highest-liability failure mode
- Customer Data Privacy for Support AI: the multi-jurisdiction build
- Building a Customer Support Deflection Agent: end-to-end walkthrough
- Evaluating Customer Support AI: the four-dimension eval framework
- Customer Service Agent Architecture: Decagon AOPs, Sierra Agent OS, helpdesk-native patterns
FAQ
How can AI be used for customer support? Five patterns: deflection (AI resolves the ticket), agent assist (AI augments a human), voice (phone-channel AI), ticket QA (AI grades closed tickets), and KB RAG (the retrieval substrate underneath). Most orgs start with agent assist (lower risk) before layering deflection on the easy-resolve subset.
Which AI is best for customer service? The right answer depends on which pattern fits your operation. For Intercom-native orgs, Fin. For voice-first enterprise consumer brands, Sierra. For deep customization with technical CX teams, Decagon. For agent assist on Salesforce or Zendesk, the native AI products. The vendor comparison table above is a directional starting point.
Is AI going to replace customer service workers? The empirical answer from 2024-2025 is no. Klarna's reversal, the Air Canada ruling, and broad CSAT-vs-deflection data point at AI augmentation outperforming AI replacement on the metrics that matter (CSAT, retention, complaint volume). The right framing is "what work does AI do well" plus "what does the human still own."
What is the 10-20-70 rule for AI? A consulting heuristic for AI program success: 10% on algorithms, 20% on technology and data, 70% on people and processes. Useful as a budget-allocation reminder but not a substitute for the engineering questions: which pattern, which vendor, which evals.
What deflection rate should I target? Depends on the product and the cohort. The cohort table above gives directional ranges. The right target is whatever maximizes (deflected_volume × CSAT) rather than deflection alone. Run a 4-week A/B before committing to a global threshold.
Is agent assist or deflection the right starting point? For most orgs, agent assist first. It improves agent productivity and CSAT immediately, with much lower failure-mode risk than deflection. Once agent assist is solid, layer deflection on the easy-to-resolve subset.
What's the most common production failure? Hallucinated policies. The AI generates a refund window, return policy, or warranty claim that does not exist or contradicts the actual policy. The customer holds the company to it. See the policy hallucination spoke.
How do I handle customer data across regions? GDPR, CCPA, and emerging state and provincial laws apply. Build to the strictest standard, route data based on the customer's region. PII redaction at the gateway, audit logging, retention policies aligned to each jurisdiction. Full breakdown in the privacy spoke.
What's the right escalation logic? Confidence below threshold, customer frustration signals, account-sensitive issues (refunds above a threshold, fraud claims, legal threats), repeat ticket on the same issue. Escalate with full context, not as a cold handoff.
