Legal AI is the rare vertical where the architecture has both stabilized and become economically interesting at the same time. Harvey closed a $200 million round at an $11 billion valuation in March 2026, putting it among the small set of AI startups past the ten-billion mark. Spellbook is on 4,000 legal teams across 80 countries. EvenUp's Series B at $350 million covers personal injury demand letters across 200+ law firms. Ironclad rebranded ContractPodAi to Leah and shipped Jurist agents into AmLaw 100 firms. Total legal tech funding crossed $2.4 billion in 2025.
At the same time, Q1 2026 brought a court sanctions wave: roughly $145,000 in penalties for AI-fabricated citations, a record $110,000 single sanction in Oregon, the first attorney license suspension in Nebraska, and a Sullivan & Cromwell apology to a federal bankruptcy judge in April. The Charlotin database now lists over 1,353 incidents globally and the pace is accelerating.
The combination matters because it reflects the real state of legal AI engineering: capital is flowing, products are landing in production, and the engineering loop that catches failures is still under construction at most firms. The teams that build the loop right are the ones that get to keep the deals. The teams that ship demos and call it a day end up in the next sanctions list.
This post is the engineering view. It covers the five architectural patterns that legal AI products converge on in 2026, who is using them, what the hard parts are, and where the loop usually breaks. It is written for the engineers and PMs who build legal AI, not the lawyers who use it.
The market in one paragraph
By mid-2026, the legal AI market has structured into four distinct product shapes. General-purpose legal copilots like Harvey, CoCounsel, and Spellbook augment the lawyer's existing workflow with research, drafting, and review. Workflow-embedded AI like Ironclad's Leah, Lexion, and ContractPodAi layers AI features on top of contract or matter management platforms. AI-native specialists like EvenUp (personal injury), DraftWise (precedent search), and Lex Machina (litigation analytics) target specific practice areas with deep verticalization. In-house counsel platforms like GC AI fit between these shapes, optimized for the corporate legal department's mix of contracts, research, and matter memory. Each shape has its architecture, its eval target, and its competitive moat. Engineers building for this market need to know which shape they are building.
Pattern 1: RAG over case law and statutes
The foundation pattern. Every legal AI product that touches research or drafting includes some form of retrieval over a corpus of legal authority: case law, statutes, regulations, treatises, and the firm's own work product.
The naive implementation is a vector database over the corpus, top-k retrieval by embedding similarity, generation conditioned on the retrieved context. This produces the demo. It also produces the hallucinations Stanford's RegLab measured in 2024: even with this RAG architecture, Lexis+ AI and Westlaw AI-Assisted Research hallucinate on more than 1 in 6 queries.
What distinguishes production-grade legal RAG from the naive version:
Hybrid retrieval with jurisdictional filtering. Legal questions are matched by reasoning about a fact pattern, not by surface similarity to the question text. Production systems combine BM25 keyword search with dense embedding retrieval, then filter by the matter's jurisdiction (a 9th Circuit case is not relevant for a 2nd Circuit filing), then rerank by recency and authority weight.
Character-level citation grounding. Output sentences are constrained to point to specific spans in retrieved sources. This is what GC AI markets as "character-level citation" and what Harvey's Vault implements as part of its document workflows. The implementation forces the model to emit citation tokens that index into the retrieval set; a post-processing step validates each citation maps to a real source location.
Citation existence and alignment verification. After generation, every citation runs through a two-stage check. Existence: does the cited case actually exist (validated against Westlaw, Lexis, CourtListener, or the firm's own indexed corpus)? Alignment: does the case actually support the asserted proposition (validated by an LLM-as-judge against the case text)?
Continuous eval capture. Every citation rejected by a lawyer in production becomes an entry in the eval set. Every hallucination caught becomes a regression test. The eval set evolves, which means the regression catch rate improves over time rather than degrades.
The teams that ship this pattern at scale (Harvey, GC AI, CoCounsel) all have a domain-specific evaluation benchmark behind it. Harvey publicly noted in 2025 that they scrapped their proprietary fine-tuned legal model after frontier models started outperforming it on their own internal BigLaw Bench. The benchmark mattered more than the custom model.
For the engineering depth on this pattern, see Why Legal AI Still Hallucinates Citations and Building a Citation Grounding Eval for Legal AI.
Pattern 2: Long-horizon agents for transactional workflows
The 2025-to-2026 shift in legal AI was the move from single-shot completions to agentic workflows. Harvey reports more than 25,000 custom agents running on its platform, processing 400,000+ agentic queries per day. These are not chat completions; they are multi-step workflows that execute over minutes or hours and complete tasks like fund formation, M&A due diligence, and SEC filing preparation.
The architectural pattern has converged on a recognizable shape:
Workflow specification layer. A no-code or low-code interface where users (Harvey calls them "legal engineers"; in-house teams might call them legal ops) define the workflow steps. Harvey's Agent Builder, launched March 2026, is the canonical reference. Spellbook Associate competes in adjacent territory, Ironclad Jurist on the workflow management side.
Step execution with state management. Each step runs an LLM call (or a tool call, or a retrieval), passes structured output to the next step, and persists the full state. State management matters because long-horizon agents need to be pausable and resumable; due diligence on a $500M acquisition can run for days, with human review checkpoints in between.
Tool integration. The agents call external tools: document search, contract repository queries, public records lookup, financial filings databases, calendar systems, e-signature workflows. Each tool call is a span in the trace, with input, output, and timing.
Human review checkpoints. Long-horizon agents have explicit pause points where a senior lawyer reviews and approves before the agent continues. This is supervisory responsibility under ABA 512 (more on this in ABA Formal Opinion 512 for Engineers) but it is also product-grade quality control: the agent is more accurate when senior lawyers shape its mid-execution decisions.
Audit trail per matter. Every step, every tool call, every model output, every human edit, all attributed to the matter and queryable later. This is what allows the firm to answer "how was AI used on the Acme acquisition" with a complete reconstruction rather than a shrug.
The architecture is straightforward to describe and hard to ship. The hard parts are state management at scale (Harvey's reported 9.75M active files implies a serious infrastructure investment), prompt versioning (a prompt change in step 3 of a 12-step agent affects all downstream behavior, and you need to know which version ran on which matter), and eval at the workflow level rather than the step level (a workflow can have every step pass eval and still produce a wrong final output).
Pattern 3: Playbook enforcement for contract review
Contract review is the highest-volume legal AI workflow. Spellbook, GC AI, Ivo, LegalOn, LEGALFLY, Definely, Gavel Exec, and Luminance Eve all compete for the in-house counsel buyer; Harvey, Spellbook, and Ironclad compete on the firm side. The architecture is similar across them.
Document parsing with structure preservation. A docx or pdf parser that preserves headings, numbered sections, defined terms, and cross-references. Flattening to plaintext loses the structural cues clause segmentation depends on.
Clause segmentation and classification. Each clause in the counterparty draft is identified, extracted, and classified into a taxonomy (indemnification, governing law, IP assignment, limitation of liability, etc.). The classification step is a small fine-tuned model or a few-shot LLM prompt; it has to be accurate because everything downstream is conditioned on it.
Playbook retrieval, structured. For each classified clause, retrieve the matching playbook entry. The playbook is a clause-type-keyed structure containing preferred language, acceptable variations, unacceptable patterns, and risk levels. Spellbook's Compare to Market feature backs this with a database of 200,000+ executed agreements; a new product can seed with design partner contracts.
Per-clause analysis with citation grounding. The model takes the clause text and the playbook context, produces a structured analysis (issues, risk level, redline, rationale). Every issue and every redline cites a specific section of the playbook (character-level, not document-level).
Cross-clause consistency check. A second pass catches conflicts between clauses: governing law in California with arbitration in New York; "Confidential Information" defined narrowly in one section but used broadly in indemnification; payment terms inconsistent between the body and the SOW. Rule-based checks catch the well-known patterns; an LLM pass catches novel conflicts.
Word-native or platform-native surface. Spellbook lives in Microsoft Word as an add-in, which is where most lawyers draft. Ironclad lives in its CLM platform. GC AI is web-based but integrates with Word via a plugin. The surface choice is strategic: tools that force lawyers to leave their existing workflow get adopted slowly. Tools that meet lawyers where they already work get adopted fast.
The full code-level walkthrough of this pattern is in Building an AI Contract Review Agent.
Pattern 4: Embedded surfaces and Microsoft 365 integration
A pattern that crystallized in 2025 and is the strategic battleground for 2026: legal AI inside Microsoft Word, Outlook, Teams, and increasingly Microsoft 365 Copilot itself.
Harvey's Microsoft 365 Copilot integration, announced for Q2 2026, lets lawyers invoke Harvey from inside Copilot for agreement analysis. Spellbook is Word-native by design. Definely lives in the same surface. The bet across these products is that lawyers will not adopt a separate application; they will adopt the AI that shows up in the application they already use.
Engineering implications:
Add-in architecture. Office add-ins have specific constraints (document model limits, manifest requirements, sideloading versus AppSource distribution, sandbox limitations). The team building this needs Office add-in expertise, which is not the same skill set as building a SaaS web app.
Copilot extension architecture. Microsoft 365 Copilot extensions follow a different model (Microsoft Graph integration, plugin manifest, Teams app packaging). Building for both Word add-in and Copilot extension surfaces effectively means two products with shared backend.
Latency budget. Embedded surfaces have lower user tolerance for latency than dedicated apps. A 30-second contract review is acceptable in a standalone web app where the lawyer expects to wait; in a Word add-in, that 30 seconds reads as the tool being broken. Streaming results progressively, parallelizing per-clause analysis, and caching aggressively are required, not optional.
Authentication and tenant isolation. Office add-ins often inherit identity from the user's Microsoft 365 tenant, which is good for SSO but creates careful security work around tenant-scoped data isolation. A multi-tenant SaaS that uses tenant identity as the authentication primitive needs to enforce data isolation at every storage layer.
The strategic trade-off worth thinking about: building an embedded surface drives adoption but constrains differentiation. Spellbook is differentiated partly by being Word-native; if every legal AI tool eventually becomes a Word add-in, that differentiation erodes. The current view from product strategists is that the embedded surface is the entry point and the standalone product is where the depth lives.
Pattern 5: Domain eval as the moat
The single biggest shift in legal AI engineering between 2024 and 2026 is the recognition that the eval is the moat, not the model.
Harvey's BigLaw Bench story is the canonical example. Through 2024 and early 2025, Harvey invested in a proprietary fine-tuned legal model, the assumption being that domain-specific fine-tuning would beat frontier models on legal tasks. By mid-2025, frontier models from OpenAI, Anthropic, and Google had improved fast enough that they outperformed Harvey's custom model on Harvey's own benchmark. Harvey scrapped the custom model, kept the benchmark, and shifted to model routing: customers can route tasks to Claude, Gemini, or OpenAI through a Model Selector. The benchmark is what tells Harvey which model is best for which task at any given time.
The lesson generalizes. The model layer is becoming a commodity that improves under your feet without your help; the eval layer is a permanent asset that compounds over time. A team that invests in a deep, domain-specific eval set built from production failures and lawyer annotations will have an advantage that does not erode when frontier models improve. A team that invests in fine-tuning a custom model will have a depreciating asset that gets surpassed by GPT-6 in eighteen months.
Practically, this means:
- Treat your eval set as a strategic asset. Version it, expand it deliberately, document what each case tests.
- Annotate with lawyers, not engineers. The annotations are the ground truth and they need legal judgment.
- Capture from production. The most valuable eval cases are the ones that broke in production, because they encode failure modes you would not have predicted.
- Run on every change. Prompt change, model upgrade, retrieval pipeline update. CI runs the eval; regression blocks deploy.
- Publish the metrics that matter. Harvey publishes BigLaw Bench results; Spellbook publishes accuracy on standard contract clauses. Public benchmarks are competence documentation that lawyers cite when their AI use gets challenged.
The depth on this pattern is in Building a Citation Grounding Eval for Legal AI.
Where the loop usually breaks
Across the products in this market, the most common engineering failure modes are predictable.
No tracing. A surprising number of legal AI products have inadequate observability. When a lawyer reports a bad output, the team cannot reconstruct what the system did. They cannot answer "did the retrieval find the right cases" or "which prompt version was running" or "did the alignment judge catch this and we missed it in review." Without tracing, every other engineering layer is operating blind.
Eval is a one-time exercise. A team builds a 100-case eval set in month one, ships the product, and never updates the eval. By month six, the eval represents a snapshot of the product's behavior six months ago and tells the team nothing about current state. Eval has to be a living asset, with continuous capture from production and regular lawyer annotation.
Confusing "RAG" with "grounded." Adding a retrieval step to the pipeline does not make the output grounded. Grounding requires character-level citation, citation existence verification, alignment verification, and a refusal pathway when the retrieval set does not support the question. Stanford RegLab's 1-in-6 hallucination rate on Lexis+ AI and Westlaw is what RAG without grounding looks like.
Underestimating the supervisory layer. Legal AI products that ship without explicit human review checkpoints work fine in demos and fail in deployment. ABA 512 supervisory obligations are not optional; they are how the firm justifies using the tool. Build review gates as a workflow primitive, not as an afterthought.
Overinvesting in the model. Custom fine-tunes, custom architectures, expensive proprietary training runs. In a world where frontier models improve every quarter, this is depreciating capital. The margin is in the data, the eval, the workflow shape, and the surface integration. Not the model.
What to expect in the next twelve months
A short list of trends that will affect engineering choices.
Agent Builder competition. Harvey's Agent Builder set a category for no-code workflow construction. Spellbook, GC AI, and Ironclad will all ship competing agent builders within twelve months. The differentiation will be in the prebuilt workflow library and the depth of tool integrations, not the builder itself.
Microsoft 365 Copilot integration becomes standard. Every legal AI product will need a Copilot integration. Products without one will look limited compared to products with one.
Citation grounding becomes a procurement requirement. Firms will start requiring vendors to demonstrate character-level citation grounding and continuous eval as part of the security review. Vendors that cannot show this will lose deals.
State-by-state ethics fragmentation continues. Illinois, Florida, North Carolina, DC, and Kentucky have published AI ethics opinions. California, New York, and Texas are drafting. The variance creates work for vendors that sell nationally; products that build to the strictest interpretation win the most deals. Details are in ABA Formal Opinion 512 for Engineers.
EU AI Act compliance window. Full effective date August 2026. Legal AI is not high-risk under the Act, but transparency and record-keeping requirements apply. The audit trail you build for U.S. ABA compliance will mostly satisfy EU requirements; products that built without audit trails will scramble.
Per-matter cost transparency. As firms move from hourly billing to flat-fee or alternative fee arrangements on AI-augmented work, they need accurate per-matter cost attribution. Vendors that provide this become preferred; vendors that bill at the seat-month level without per-matter detail get pushback in renewals.
How to get started
If you are starting a legal AI build today, the priorities in roughly the order I would tackle them:
- Tracing first. Without it, nothing else is debuggable. Build it before the first production deploy.
- Eval set early. Lawyer-annotated, 100 cases minimum, stratified across deal types or research types. The week you spend on this is the highest-impact engineering week of the project.
- One pattern, deep. Pick one of the five patterns above (RAG, agents, contract review, embedded surface, domain eval) and ship it well. Resist the temptation to build all five at once.
- Citation grounding for any pattern that touches research or drafting. Existence + alignment verification, not just retrieval.
- ABA 512 architecture. Audit trail, ZDR routing, redaction, supervisory checkpoints. These are not features to add later; they are the architecture.
The detailed engineering guides for each piece live in the spoke posts:
- Why Legal AI Still Hallucinates Citations
- Building a Citation Grounding Eval for Legal AI
- ABA Formal Opinion 512 for Engineers
- Building an AI Contract Review Agent
How Respan fits
The five patterns above (RAG over case law, long-horizon transactional agents, playbook-driven contract review, embedded Word and Copilot surfaces, domain eval as moat) all run on the same observability and evaluation substrate. Respan is that substrate: tracing, evals, gateway, prompts, and monitors wired for legal AI engineering teams.
- Tracing: every legal workflow run captured as one connected trace, from clause segmentation through playbook retrieval, citation grounding, alignment judging, and human review checkpoints. Auto-instrumented for LangChain, LlamaIndex, Vercel AI SDK, CrewAI, AutoGen, OpenAI Agents SDK. When a lawyer flags a hallucinated cite or a missed redline, you can reconstruct the exact retrieval set, prompt version, model, and tool calls that produced it, which is what ABA 512 supervisory review actually requires.
- Evals: ten built-in evaluators (faithfulness, citation accuracy, refusal correctness, harmfulness) plus LLM-as-judge and custom Python evaluators. Production traffic flows directly into datasets. CI-aware experiments block regressions on citation existence, citation alignment, jurisdictional filter correctness, and playbook adherence before deploys ship. This is how you build the BigLaw Bench style asset that compounds while frontier models change underneath you.
- Gateway: 500+ models behind an OpenAI-compatible interface, semantic caching, fallback chains, per-customer spending caps. Route research queries to the model that wins on your citation eval, contract review to the model that wins on clause classification, and keep ZDR routing and PII redaction enforced before any token leaves your tenant.
- Prompt management: versioned registry, dev/staging/prod environments with approval workflows, A/B testing in production with one-click rollback. Citation grounding prompts, alignment judge prompts, clause classifiers, playbook comparison prompts, and per-step agent prompts all belong in the registry so you know which version ran on which matter.
- Monitors and alerts: citation existence failure rate, alignment judge rejection rate, refusal rate on out-of-jurisdiction questions, per-matter token spend, and long-horizon agent step latency. Slack, email, PagerDuty, webhook. A spike in alignment rejections after a model upgrade should page you before a partner sees it in a brief.
A reasonable starter loop for legal AI builders:
- Instrument every LLM call with Respan tracing including retrieval spans, citation verification spans, alignment judge spans, and human review checkpoint spans.
- Pull 200 to 500 production research and contract review traces into a dataset and label them for citation correctness, alignment, jurisdictional fit, and playbook adherence.
- Wire two or three evaluators that catch the failure modes you most fear (fabricated citations, misaligned cites that exist but do not support the proposition, and cross-clause inconsistencies in contract redlines).
- Put your citation grounding, alignment judge, and clause classification prompts behind the registry so you can version, A/B, and roll back without a deploy.
- Route through the gateway so jurisdiction-aware model selection, ZDR enforcement, and per-matter cost attribution are configured once rather than reimplemented per workflow.
When the next sanctions wave hits and a federal judge asks how AI was used on a specific filing, this loop is what lets you answer with a full reconstruction instead of a Sullivan and Cromwell style apology.
To wire any of the patterns above on Respan, start tracing for free, read the docs, or talk to us.
