Ambient clinical documentation is the largest production GenAI category in healthcare. Abridge closed a $300M Series E in mid-2025 at a $5.3B valuation. Ambience Healthcare raised $243M Series C at $1.25B. Nabla is in 150+ health systems. Microsoft DAX Copilot bundles into Microsoft 365. Epic shipped its own native scribe in 2025, putting commoditization pressure on standalone players.
The interesting question for builders is not whether to build an ambient scribe. It is what the architecture has to look like to survive a pilot. Whisper hallucinations were documented in 2024 inventing medications and sentences in clear audio across millions of clinical conversations. The CREOLA framework measured production scribes at 1.47% hallucination and 3.45% omission, with 44% of hallucinations classified "major" and 20% landing in the Plan section. Clinicians can spot bad notes fast, and a tool that produces them gets uninstalled within the pilot.
This piece is the build walkthrough. It assumes you have read the hallucination spoke for the safety framework and the HIPAA spoke for the compliance architecture. It covers ASR selection, speaker diarization, structured note generation, ICD-10 and CPT coding, EHR write-back, longitudinal record retrieval (the Contextual Reasoning Engine pattern), eval, and the build-vs-buy line.
For context on where ambient scribes sit in the broader healthcare AI stack, the pillar covers all seven core use cases.
The architecture in one diagram
[Encounter audio]
|
v
[1. ASR with diarization] reject low-confidence segments
|
v
[2. Transcript redaction] strip PHI not needed for note
|
v
[3. Specialty-aware extractor] subjective, objective, assessment, plan entities
|
v
[4. Longitudinal context] retrieve prior visits, problems, meds, labs
|
v
[5. Note generator] specialty-specific template
|
v
[6. Coding suggestion] ICD-10, CPT, HCC
|
v
[7. Faithfulness check] every note claim grounded in transcript or chart
|
v
[8. Clinician review and edit] capture as labeled data
|
v
[9. EHR write-back] FHIR / HL7v2 / vendor API
Nine components, four safety layers, one feedback loop. The two most-skipped pieces are ASR confidence handling (1) and faithfulness check (7), which are also the two that surface in production failures.
Step 1: Pick the ASR carefully
The ASR layer is its own hallucination surface, separate from the LLM. Whisper is the most accessible option but has documented hallucinations: invented medications ("hyperactivated antibiotics"), fabricated speaker turns, racial commentary that was never said. Downstream LLM grounding cannot recover what the ASR fabricated. Treat ASR plus LLM as a chained risk.
The 2026 options:
- Whisper (OpenAI). Strong general-purpose transcription, but documented hallucinations on clear audio. Use only with confidence thresholding and re-listen verification on flagged segments.
- Deepgram Nova / Medical. Healthcare-tuned, strong on noisy environments.
- AssemblyAI Universal Medical. Healthcare-tuned, includes diarization.
- NVIDIA Parakeet / Riva Medical. On-prem option, strong for VPC-only deployments.
- Nuance / Microsoft Dragon Medical One ASR layer. Mature, deep medical vocabulary; integration with Microsoft Dragon Copilot (March 2025).
- Proprietary medical ASR. Abridge, Nabla, and Suki have invested in domain-tuned ASR; the cost only justifies at scale.
Selection criteria for a build:
- Word error rate on medical terminology (drug names, anatomy, procedure terms). Test on your specialty mix, not generic benchmarks.
- Speaker diarization accuracy. Mistaking patient speech for clinician speech (or family speech for patient speech) corrupts the downstream note.
- Confidence scores per segment. You need them to flag re-listen candidates. ASR layers that do not expose per-segment confidence make defensive engineering impossible.
- Streaming vs batch. Streaming enables real-time draft updates during the encounter. Batch is simpler and often sufficient post-encounter.
- HIPAA eligibility. Verify per-provider and per-deployment.
from respan import Respan
client = Respan(api_key=os.environ["RESPAN_API_KEY"])
@client.workflow(name="ambient-scribe")
def transcribe_and_diarize(audio_uri, encounter_id, attending_id):
transcript = client.asr.transcribe(
provider="deepgram-nova-medical",
audio_uri=audio_uri,
diarize=True,
confidence_threshold=0.85,
on_low_confidence="flag_for_review",
)
# Reject segments below confidence threshold
flagged = [s for s in transcript.segments if s.confidence < 0.85]
if len(flagged) / max(len(transcript.segments), 1) > 0.10:
# >10% low-confidence segments: do not auto-generate note
return require_clinician_re_listen(transcript, flagged)
return transcriptBlock-on-low-confidence matters. An ASR that silently passes garbled audio through to the LLM produces fluent fabrications.
Step 2: Specialty-aware extraction
A pediatric encounter, a surgical post-op note, and an emergency triage have different structures. SOAP fits primary care; APSO is preferred in some specialties; surgery has its own templates; ED uses different headings. Building a single generic note template and applying it everywhere is the most common cause of clinician dissatisfaction.
The pattern that works:
- Specialty classifier at the start of the workflow. Often known from the EHR context; if not, classify from the audio metadata or first 60 seconds of speech.
- Specialty-specific entity schema. What entities to extract differs: a peds note needs growth percentiles; a surgical note needs operative findings; an ED note needs disposition reasoning.
- Template library with the structured schema each specialty expects. Each template is versioned in your prompt registry.
specialty = classify_specialty(transcript, encounter_metadata)
template = client.prompts.get(f"scribe/template/{specialty}", env="prod")
schema = client.prompts.get(f"scribe/schema/{specialty}", env="prod")
extracted = client.chat.completions.create(
model="auto",
messages=build_extraction_prompt(transcript, schema),
response_format={"type": "json_schema", "schema": schema},
)The structured-output requirement on extraction is the seed of your faithfulness check. If the schema requires every clinical claim to include a transcript_span field with verbatim text from the transcript, you can verify each claim against a substring match. If the schema allows freeform prose, fluent fabrications slip through.
Step 3: Longitudinal record retrieval
The 2025-2026 frontier in ambient scribes is integrating the patient's longitudinal record into the note. Abridge calls this the Contextual Reasoning Engine, launched March 2026. Ambience calls theirs Chart Awareness, launched February 2026. The pattern: pull from prior visits, problem list, medications, labs, imaging, and external knowledge bases (UpToDate via the Abridge / Wolters Kluwer partnership).
Architecturally:
- FHIR-based retrieval. Most modern EHRs expose FHIR APIs (R4 is the standard). Pull the relevant resources (Patient, Condition, MedicationStatement, Observation, Encounter, DiagnosticReport) at note-generation time, scoped to the encounter context.
- Patient-scoped retrieval. Cross-patient retrieval should be impossible architecturally. The retrieval layer enforces the patient_id filter at the database level, not in the prompt.
- Time-bounded retrieval. A 5-year-old lab result is relevant for some encounters, irrelevant for others. The retrieval pipeline understands the encounter type and time-bounds accordingly.
- External knowledge. UpToDate, NCCN guidelines, Cochrane reviews. Licensed sources only, with citation enforcement so every clinical recommendation in the note traces to a source.
@client.workflow(name="longitudinal-context")
def fetch_chart_context(patient_token, encounter_id, specialty):
fhir_resources = ehr_client.fetch(
patient_id=patient_token, # hashed; mapping table stored separately
resource_types=specialty_relevant_resources(specialty),
time_window=specialty_time_bounds(specialty),
)
knowledge = client.retrieve(
index="uptodate-nccn",
query=build_clinical_question(fhir_resources),
require_citations=True,
)
return ChartContext(fhir=fhir_resources, knowledge=knowledge)Step 4: Note generation with faithfulness check
The generation step takes the extracted entities, the longitudinal context, and the specialty template, and produces the structured note. Two patterns matter.
Two-pass generation. Pass 1 extracts entities and grounds them in transcript spans. Pass 2 writes the note prose using the grounded entities. The two passes catch different failure modes: pass 1 catches hallucinations from the LLM inventing entities not in the transcript; pass 2 catches stylistic regressions where the prose drifts from the grounded entities.
Faithfulness check. Every clinical claim in the generated note must trace to a transcript span or a chart resource. The check is a substring match on the transcript and an entity match on the chart. Claims that fail validation flag the segment for clinician review rather than silently dropping.
def faithfulness_check(note, transcript, chart):
for claim in note.claims:
source = claim.get("source")
if source["type"] == "transcript":
if source["span"] not in transcript.text:
claim["faithfulness"] = "fail_no_transcript_grounding"
elif source["type"] == "chart":
if not chart.has_resource(source["resource_id"]):
claim["faithfulness"] = "fail_no_chart_grounding"
else:
claim["faithfulness"] = "fail_unsourced"
return noteThe CREOLA data tells you 20% of hallucinations land in the Plan section, and 44% are major. Apply the strictest faithfulness threshold there. A Plan section claim with no transcript or chart grounding should never auto-publish.
Step 5: ICD-10, CPT, and HCC coding
Coding suggestion is one of the highest-value scribe outputs because it accelerates revenue cycle. It is also one of the highest-risk: a hallucinated ICD-10 code corrupts the patient's record and triggers payer scrutiny.
The pattern:
- Deterministic codebook lookup. Every suggested code resolves against the canonical ICD-10-CM, CPT, and HCC tables. Codes that do not resolve are dropped, not silently passed.
- Confidence threshold per code. Suggested codes below a threshold are flagged for clinician review, not auto-applied.
- Specialty-specific code priors. A peds encounter has a different distribution of expected codes than a cardiology encounter. The classifier uses specialty as a feature.
- Linkage to transcript spans. Every suggested code traces to a specific clinical statement in the transcript or chart. Codes without linkage are not safe to auto-apply.
The compliance angle: payer audits target codes. A scribe that suggests a code without linked clinical evidence is creating documentation that does not stand up to audit.
Step 6: EHR write-back
Three integration paths, in order of effort:
- FHIR R4 API. Modern Epic, Cerner, and Athenahealth installations expose this. The cleanest write-back path, with structured DocumentReference and Encounter resources.
- HL7v2 messaging. Older but ubiquitous. ADT and ORM messages for orders; MDM for documents.
- Vendor-specific APIs. Epic App Orchard, Athenahealth Marketplace, and Cerner's API portal each have their own conventions. Tighter integration, more deployment work per customer.
Write-back is where the business model lives. A scribe that drafts notes well but cannot push them into the EHR loses to one that integrates natively.
The supervisory layer matters here too. ABA 512-style "supervisory responsibilities" in healthcare translate to: a clinician must explicitly accept the AI-generated note before it writes back. Skipping the human-in-the-loop checkpoint without explicit override should be impossible.
Step 7: Eval and the clinician edit rate
Most ambient scribe products have a demo that looks good and a roadmap that says "improve quality." That is not an eval.
The production-grade signal is clinician edit rate per section, sliced by specialty and clinician. CREOLA's data tells you the Plan section is the hot spot for hallucinations. Tracking edit rate there as a first-class metric, with weekly drift alarms, surfaces problems before they become sentinel events.
A reasonable eval suite:
- Faithfulness pass rate on every note pre-write-back.
- Section-level edit rate sliced by specialty, clinician, and EHR.
- Hallucinated entity rate measured against transcript span grounding.
- Omission rate measured against a clinician-annotated gold set of 100-300 encounters.
- Coding accuracy measured against the clinician-finalized codes.
- Time-to-completion from encounter end to clinician acceptance.
client.monitors.create(
name="scribe-clinical-safety",
workflow="ambient-scribe",
sample_rate=0.05,
evaluators=[
"plan_section_edit_rate",
"hallucinated_entity_rate",
"coding_accuracy",
"asr_low_confidence_rate",
],
alert_on={
"plan_section_edit_rate": ">0.30", # >30% of plan sections edited = drift
"hallucinated_entity_rate": ">0.05",
},
slice_by=["specialty", "clinician_id"],
)The slice-by is essential. Aggregate metrics hide the truth. You want to see surgical scribe performance separately from primary care scribe performance, because the failure modes are different.
Real failure modes from production
A short list of the failure modes that actually break ambient scribe products in deployment.
Hallucinated medications, dosages, allergies. The CREOLA Plan-section concentration is real. Generators invent medications when the transcript is ambiguous about what was prescribed, or fabricate dosages when the prescription is verbal and inexact.
Mishearing in noisy environments. ED, OR, and labor and delivery are noisy. ASR error rates climb 2-3x in these environments. Specialty-aware ASR tuning helps; defensive thresholding helps more.
Hallucinations on non-English speakers and accented speech. Same content, lower accuracy on non-Standard English. The bias compounds because you also get fewer safety guardrails on non-Standard English inputs.
Family chatter mistakenly attributed to patient. Diarization that confuses speakers corrupts the patient's history. A spouse's comment about a different condition can end up in the patient's note.
Schema rigidity breaking on atypical encounters. A standard SOAP template fails on unusual encounters (a complex multi-problem visit, a patient transitioning between specialties, a follow-up that turns into a new diagnosis). The template needs to flex.
Drift across model updates. Same prompt plus same audio produces different notes after a provider's silent model upgrade. Pin model versions in the gateway and run weekly QWK regression on a frozen golden set.
Build vs buy
The line flips around 100,000 patient encounters per month. Below that, vendor pricing wins on time-to-value. Above it, build economics improve.
Buy when:
- Single hospital or small health system, generic specialty mix
- No dedicated MLOps team
- Time-to-value matters. Vendor pricing typically lands at $100-300 per clinician per month for the leading products
- You are okay with vendor model drift and your data flowing through their infrastructure
Build when:
- Large health system or multi-state network where licensing 5,000+ clinicians becomes meaningful annual spend
- Specialty mix that vendors cover poorly (subspecialties, surgical specialties with unusual workflows)
- Tight EHR integration requirements (proprietary data models, custom decision support hooks)
- Existing labeled corpus from your historical clinician edits that becomes a moat
Hybrid pattern (most common in 2026). License a strong domain-tuned ASR (Deepgram Medical, AssemblyAI Universal Medical, or Microsoft Dragon Medical One ASR layer). Build the LLM layer, faithfulness checks, longitudinal context, and EHR integration in-house. Lower compliance burden than full build, more differentiation than full buy.
A reference build checklist
Before you ship a scribe to clinicians:
- ASR with per-segment confidence scoring; block-on-low-confidence at the workflow level
- Speaker diarization tested on multi-speaker recordings including family-present encounters
- Specialty classifier with a registered specialty-specific template per supported specialty
- Two-pass generation (extract grounded entities first, write prose second)
- Faithfulness check requiring transcript-span or chart-resource grounding for every clinical claim
- Longitudinal record retrieval scoped to patient_id at the database level
- Citation enforcement on any external-knowledge claim (UpToDate, NCCN, etc.)
- ICD-10 / CPT / HCC suggestion with deterministic codebook lookup and per-code confidence threshold
- EHR write-back via FHIR R4 with explicit clinician acceptance gate before publish
- Tracing every step with hashed patient identifiers; HIPAA-aligned audit logs (see HIPAA spoke)
- Clinician edit-rate dashboards sliced by specialty, clinician, and section
- Frozen golden set of 100-300 encounters for offline regression testing
- Online sampling (5-10%) through faithfulness, hallucinated-entity, and coding-accuracy evaluators with weekly drift alerts
- Pinned model versions with weekly regression against the frozen golden set
- Adversarial regression suite of planted-fact transcripts (see the hallucination spoke)
CTA
To wire the scribe stack on Respan, start tracing for free, read the docs, or talk to us. For the rest of the Healthcare cluster: the pillar, the hallucination spoke, and the HIPAA / BAA engineering spoke. The clinical eval spoke is next.
How Respan fits
Ambient scribe stacks fail in production when the ASR, retrieval, generation, coding, and write-back layers are observed in isolation. Respan gives you one connected view across the whole nine-step workflow so faithfulness regressions and Plan-section drift surface before clinicians uninstall.
- Tracing: every encounter captured as one connected trace. Auto-instrumented for LangChain, LlamaIndex, Vercel AI SDK, CrewAI, AutoGen, OpenAI Agents SDK. ASR confidence segments, FHIR retrieval calls, two-pass generation, ICD-10 lookups, and EHR write-back all hang off the same encounter span with hashed patient identifiers preserved end to end.
- Evals: ten built-in evaluators (faithfulness, citation accuracy, refusal correctness, harmfulness) plus LLM-as-judge and custom Python evaluators. Production traffic flows directly into datasets. CI-aware experiments block regressions on hallucinated medications, ungrounded Plan-section claims, and mis-attributed speaker turns before deploys ship.
- Gateway: 500+ models behind an OpenAI-compatible interface, semantic caching, fallback chains, per-customer spending caps. Pin specific model versions per specialty template, fall back from a primary clinical model to a secondary if latency or error budgets blow, and cap spend per health system without rewriting code.
- Prompt management: versioned registry, dev/staging/prod environments with approval workflows, A/B testing in production with one-click rollback. Specialty templates, extraction schemas, and coding prompts stay versioned per specialty so a peds template change cannot silently affect surgical notes.
- Monitors and alerts: Plan-section edit rate, hallucinated entity rate, ASR low-confidence rate, coding accuracy, faithfulness pass rate. Slack, email, PagerDuty, webhook. Slice every alert by specialty and clinician_id so a regression in cardiology surfaces without being averaged out by primary care volume.
A reasonable starter loop for ambient scribe builders:
- Instrument every LLM call with Respan tracing including ASR, diarization, FHIR retrieval, extraction, generation, and coding spans.
- Pull 200 to 500 production encounter records into a dataset and label them for transcript faithfulness, omission, and Plan-section accuracy.
- Wire two or three evaluators that catch the failure modes you most fear (hallucinated medications, ungrounded Plan claims, mis-coded ICD-10).
- Put your specialty templates and extraction schemas behind the registry so you can version, A/B, and roll back without a deploy.
- Route through the gateway so model versions stay pinned across health systems and silent provider upgrades cannot drift your golden-set scores.
This is the same loop the leading ambient scribe teams converge on; Respan compresses the wiring so you can spend cycles on clinical quality rather than plumbing.
To wire any of the patterns above on Respan, start tracing for free, read the docs, or talk to us.
FAQ
Should I use Whisper for the ASR layer? With caution. Whisper is accessible and good on general transcription but has documented hallucinations on clear audio, including invented medications. If you use it, run with confidence thresholding, low-confidence flagging, and a re-listen verification path. For production, healthcare-tuned ASR (Deepgram, AssemblyAI Universal Medical, NVIDIA Parakeet Medical, Microsoft Dragon Medical One) is the safer floor.
One-pass or two-pass generation? Two-pass. Pass 1 extracts entities and grounds them in transcript spans. Pass 2 writes the prose using grounded entities only. The two passes catch different failure modes and combine to give you a workable faithfulness check.
How do I handle non-English encounters? Healthcare-tuned ASR has better non-English support than Whisper. For multilingual encounters, run language detection up front and route to a language-appropriate ASR plus LLM stack. Test bias and accuracy on your specific patient population, not just general benchmarks.
Can I auto-publish notes without clinician review? No. ABA 512-style supervisory responsibilities translate directly: a clinician must explicitly accept the AI-generated note before it writes back to the EHR. Auto-publish without review is a patient safety violation, not just a compliance one.
What is the most important production metric? Clinician edit rate per section, sliced by specialty and clinician. Track Plan-section edit rate as a primary safety metric. CREOLA data shows 20% of hallucinations and 44% of major errors land there.
How do longitudinal record patterns differ across vendors? Abridge's Contextual Reasoning Engine and Ambience's Chart Awareness both pull from prior visits, problem list, medications, labs, and external knowledge bases. The implementation differs in retrieval scoring (recency, specialty relevance, encounter type), in time-bounding, and in citation surface UX. The architectural pattern is the same: FHIR-based retrieval scoped to the patient at the database level, with external knowledge layered on top.
