Healthcare is the single largest production GenAI vertical right now. In Q1 2026 alone, digital health startups raised $4B, and 54% of all digital health funding in 2025 went to AI-enabled companies, up from 37% the year before. Generative AI in healthcare is on a 26.7% CAGR through 2035, climbing from $4.7B in 2026 toward roughly $40B in a decade.
But the gap between a working prototype and a clinically deployed product is wider here than almost anywhere else. HIPAA, FDA, EU AI Act, plus a fast-moving patchwork of state laws that now restrict AI in payer decisions. Hallucination rates that are merely inconvenient in a marketing chatbot can cause real harm in a clinical note. Latency budgets that look comfortable in a SaaS app are deal-breakers for a voice intake agent.
This playbook is for the founders and engineering leads building inside that pressure. It walks through the AI use cases that are actually working in healthcare today, the companies shipping them, the technical patterns underneath, and how to use Respan to get from prototype to production without rebuilding your observability and eval stack from scratch.
For deeper engineering work on specific layers, see the spokes:
- Clinical AI Hallucination: six layers of defense for the Mount Sinai 83% planted-fact failure mode
- HIPAA, BAAs, and PHI Engineering: the BAA tier comparison and the redaction architecture every healthcare AI builder has to ship
- Building an AI Medical Scribe: end-to-end ambient scribe walkthrough, from ASR to FHIR write-back
- How to Evaluate Clinical AI: benchmarks, judges, and the production cadence
The seven use cases that are actually shipping
1. Ambient clinical documentation
The breakout category. AI scribes listen to the patient-clinician conversation and write structured SOAP notes, billing codes, and after-visit summaries directly into the EHR. Adoption is moving from pilot to system-wide rollouts at major academic centers.
The leaders are pulling away. Abridge raised a $300M Series E in June 2025 at a $5.3B valuation, deeply integrated with Epic, and shipped a Contextual Reasoning Engine that pulls from the patient's longitudinal record and insurer guidelines. Ambience Healthcare raised $243M Series C in late 2025 at $1.25B and is rolling out across Cleveland Clinic, UCSF, Houston Methodist, Memorial Hermann, and Ardent Health. Nabla is in 150+ health systems, including a recent system-wide M Health Fairview deployment, with five-second note generation and 35+ language support. Suki and Microsoft DAX Copilot round out the field, and Epic launched its own native scribe in 2025, putting commoditization pressure on standalone players.
Stack: speaker-diarized ASR feeds a frontier LLM with domain-tuned heads, structured output schemas for SOAP and ICD-10/CPT, and rule-based post-processing for billing accuracy. Almost all production traffic flows through Azure OpenAI or AWS Bedrock under a BAA. The frontier teams have moved beyond single-pass generation to RAG over the patient's chart and external knowledge bases like UpToDate.
The hard parts: clinician edit rates are the metric that actually matters, not BLEU scores. Specialty coverage is non-trivial because peds, surgery, and emergency medicine each have their own conventions. Hallucinated medications or diagnoses creeping into a chart is a patient-safety event, not a bug ticket.
2. Voice agents for intake, scheduling, and follow-up
Hippocratic AI is the category leader and has moved fast. They closed a $126M Series C in November 2025 at $3.5B valuation, and have completed 115M+ patient interactions with no reported safety issues across 50+ health systems and payers in six countries. Notable Health, Infinitus, and Decagon all have meaningful healthcare deployments too.
Stack: real-time ASR plus a low-latency LLM (sub-second round-trip is the target) plus TTS, with a constitutional safety layer to prevent hallucinations on clinical content. Telephony lives on Twilio or Retell. The agent reads from and writes back to the EHR or CRM.
Hard parts: latency. The voice industry treats 800ms total round-trip as the threshold where conversation feels natural. That budget evaporates fast once you add safety checks, RAG, and tool calls. Accent and dialect handling matters more than in text. Several states now require disclosing that the caller is an AI agent.
3. Clinical decision support and medical Q&A
The breakout story of 2026. OpenEvidence closed a $250M Series D in January 2026 at a $12B valuation, with backers including Google Ventures, Nvidia, Sequoia, Blackstone, and Mayo Clinic. As of late 2025 they had 760K registered US physicians and roughly 18M consultations per month, with a NEJM and JAMA licensed corpus and a Sutter Health partnership to embed the answer engine inside Epic. Glass Health does narrower differential-diagnosis generation. iatroX and Medwise serve the UK and EU markets.
Stack: RAG over peer-reviewed literature with citation-required generation, frontier LLMs (GPT-5 class), and strict provenance checks. Teams typically run a retrieval pipeline against a curated medical corpus, then a generation step constrained to cite the retrieved passages.
Hard parts: hallucinated citations are the recurring failure mode. Outdated evidence is the next one, since recommendations change and the model can confidently quote a guideline that has been revised. The line between an "answer engine" and an FDA-regulated CDS device is fuzzy and matters for go-to-market.
4. Medical imaging analysis
In January 2026, Aidoc got FDA clearance for CARE, the first comprehensive multi-condition foundation model for CT triage, covering 14 acute findings on a single abdominal CT at 97% mean sensitivity and 98% specificity. That marks a real shift from single-condition CNNs to multi-task foundation models. Rad AI, Viz.ai, and PathAI cover reporting, stroke triage, and digital pathology respectively.
Stack: vision foundation models (CARE, RadFM, MedSAM derivatives), DICOM pipelines, PACS integration, and an FDA 510(k) for every cleared indication. Reimbursement is improving but slow, with builders relying on CPT category III codes and NTAP add-on payments.
Hard parts: generalization across scanner vendors and protocols remains brittle. Alert fatigue is a real clinical concern when triage models flag too aggressively. The regulatory burden of submitting and maintaining clearances per indication is non-trivial, which is exactly why Aidoc's multi-condition approach matters.
5. Prior authorization and claims
The $31B/year prior-auth burden is one of the highest-ROI applications of LLMs in healthcare, and also the most regulatorily charged. Cohere Health, Humata Health (now embedded in Optum's Digital Auth Complete), Rhyme, and Innovaccer are the active players. Optum and other payer-side teams use ML for utilization review and denial workflows.
Stack: LLM extraction from clinical notes, structured matching against payer requirement schemas, RAG over payer policy documents, and automated form generation.
Hard parts: regulation is the headwind. Texas SB 815 (effective January 1, 2026) and similar laws in California, Arizona, Maryland, and Nebraska prohibit AI as the sole decision-maker for medical-necessity denials. CMS's WISeR Model, also live January 1, 2026 in six states, requires human clinical review. The CMS electronic prior-auth mandates for Medicare Advantage and Medicaid coming online in 2026 are a tailwind for builders, but compliance architecture has to assume human-in-the-loop and full auditability from day one.
6. Drug discovery
Different shape from the others, longer time horizons, larger capital requirements. Insilico Medicine posted positive Phase IIa data on Rentosertib for IPF (+98.4 mL FVC vs −20.3 mL placebo over 12 weeks, published in Nature Medicine), with an Eli Lilly partnership. Isomorphic Labs showed their IsoDDE engine beats AlphaFold 3 by more than 2x on protein-ligand benchmarks and has Eli Lilly, Novartis, and J&J collaborations, though no compounds in clinic yet. Recursion is deeper into clinical readouts but had its lead REC-994 discontinued in May 2025.
Stack: AlphaFold derivatives, diffusion models for molecule generation, foundation models for cell imaging (Recursion's Phenom-2), reinforcement learning for synthesis routes, plus large GPU clusters.
Hard parts: 2026 is the year of truth as multiple AI-discovered drugs hit Phase II/III readouts. Even wins do not return capital quickly. The translational gap between a model that proposes a molecule and a drug that holds up in humans remains real.
7. Mental health support
The space has consolidated hard. Wysa holds an FDA Breakthrough Device Designation and has published RCTs in JMIR for chronic pain, depression, and anxiety. Woebot has 14 RCTs and a Breakthrough Device for postpartum depression, but shut down its direct-to-consumer app and pivoted to enterprise and health plans only. Youper and Limbic compete in similar lanes.
Stack: hybrid by necessity. Scripted CBT modules handle the safety-critical paths, with safety-filtered LLMs covering free-form turns, plus strict suicidality detection and escalation. Validated symptom scales like PHQ-9 and GAD-7 anchor evaluation.
Hard parts: the FDA pathway for fully LLM-based therapy is not yet defined. The Tessa/NEDA shutdown chilled the consumer side. The economics work on enterprise and health-plan contracts, not direct-to-consumer.
How Respan fits
Builders shipping ambient scribes, voice intake agents, and clinical decision support have the same substrate problem: every clinician edit, every safety-classifier hit, and every BAA-bound model call has to be observable, evaluable, and routable. Respan is the layer underneath Abridge-style scribes, Hippocratic-style voice agents, and OpenEvidence-style answer engines that turns those requirements into infrastructure.
- Tracing: every patient encounter captured as one connected trace. Auto-instrumented for LangChain, LlamaIndex, Vercel AI SDK, CrewAI, AutoGen, OpenAI Agents SDK. When a clinician flags a hallucinated medication or a SOAP note that misattributed a chief complaint, you need to see the ASR transcript, the retrieved chart context, the prompt version, and the model response in one timeline rather than spelunking across five dashboards.
- Evals: ten built-in evaluators (faithfulness, citation accuracy, refusal correctness, harmfulness) plus LLM-as-judge and custom Python evaluators. Production traffic flows directly into datasets. CI-aware experiments block regressions on hallucinated medications, missed diagnoses, and hallucinated billing codes before deploys ship.
- Gateway: 500+ models behind an OpenAI-compatible interface, semantic caching, fallback chains, per-customer spending caps. Route PHI traffic only to BAA-tier endpoints (OpenAI Healthcare tier, Azure OpenAI, AWS Bedrock, Anthropic, Google Vertex) with redaction policies enforced at the gateway so a misconfigured downstream call cannot leak protected fields to a non-BAA model.
- Prompt management: versioned registry, dev/staging/prod environments with approval workflows, A/B testing in production with one-click rollback. SOAP note templates, ICD-10 and CPT coding prompts, and clinician handoff scripts belong in the registry so a clinical reviewer can tighten guidance without a code deploy and so an auditor can see the exact prompt that produced any given note.
- Monitors and alerts: clinician edit rate, hallucination rate, P95 latency for voice agents, cost per encounter, BAA compliance signals. Slack, email, PagerDuty, webhook. Latency drift on a voice intake agent past the 800ms conversational threshold should page on-call before patients notice the awkward pauses.
A reasonable starter loop for healthcare AI builders:
- Instrument every LLM call with Respan tracing including ASR, retrieval, FHIR write-back, and safety-classifier spans.
- Pull 200 to 500 production patient encounters into a dataset and label them for clinician edit rate, hallucinated facts, and billing code accuracy.
- Wire two or three evaluators that catch the failure modes you most fear (hallucinated medications, missed diagnoses, fabricated lab values).
- Put your SOAP, coding, and clinician handoff prompts behind the registry so you can version, A/B, and roll back without a deploy.
- Route through the gateway so PHI redaction, BAA-compliant model selection, and per-tenant spending caps live in one place.
Skip this loop in a HIPAA setting and the failure mode is not a bad demo, it is a patient-safety event in a chart that a regulator can subpoena.
To wire any of the patterns above on Respan, start tracing for free, read the docs, or talk to us.
What is different about shipping AI in healthcare
A few things change the architecture in ways that are easy to underestimate.
HIPAA and BAAs. Any LLM provider that processes PHI is a Business Associate. OpenAI (their Healthcare tier launched January 2026), Azure OpenAI, AWS Bedrock, Google Vertex, and Anthropic all offer BAAs. But "HIPAA-eligible" is not "HIPAA-compliant," and that distinction is where most builders trip. The BAA covers the infrastructure. You are still responsible for de-identification, audit logging, prompt-injection defenses, and ensuring no PHI leaks via system prompts, evals, or logs.
FDA AI/ML guidance. The defining document is the FDA's December 2024 final guidance on Predetermined Change Control Plans (PCCPs). PCCPs let you pre-specify allowable model updates in your 510(k) submission, so you can refresh training data and retrain without filing a new submission for every change. If you are building anything that meets the SaMD definition, designing your eval and monitoring stack around a PCCP from day one will save you 12 to 18 months of regulatory pain.
EU AI Act. AI-enabled medical devices are categorically high-risk. The Digital Omnibus has pushed enforcement of the high-risk obligations to December 2027 for standalone systems and August 2028 for embedded medical-device AI, but the AI literacy obligation under Article 4 is enforceable August 2, 2026 regardless. If you sell into the EU, your team needs documented AI literacy training this year.
State laws on payer decisions. California SB 1120 (effective 2025), Texas SB 815 (effective 2026), and similar laws in Arizona, Maryland, and Nebraska prohibit AI as the sole decision-maker in medical-necessity denials. If you are building anything that touches utilization review, your architecture needs human-in-the-loop checkpoints with full audit trails as a load-bearing requirement, not a feature flag.
Hallucination tolerance is not a knob. A 1.47% hallucination rate sounds low until it shows up in 1 of every 70 charts. The mitigation patterns that work in production are constrained generation (force the model to cite retrieved passages), negative evaluators that hunt for unsupported claims, and shadow-traffic clinical review queues that catch what automated evals miss.
Where to start
If you are building in healthcare, the smallest end-to-end loop that proves the system is worth investing in looks like this.
- Instrument every LLM call with Respan tracing, including the retrieval and tool-call spans.
- Pull 200 to 500 production cases into a dataset and have a clinician label them.
- Wire two or three evaluators that catch the failure mode you most fear (hallucinated meds, missed contraindications, wrong section in the note).
- Put your prompts behind the registry so you can version, A/B, and roll back without a deploy.
- Route through the gateway so you can switch models when GPT-6 ships or a competitor drops their price.
That loop, running on your real traffic, is the difference between a demo and a system you can defend in an FDA submission or a hospital security review.
Start tracing for free. Read the docs. Talk to us if you are building in healthcare and want a hand wiring up the eval stack.
