On April 29, 2026, Rogo closed a $160 million Series D at a valuation that puts it among the most-funded vertical AI companies in financial services. The headline product, Felix, handles multi-step financial workflows that previously took analysts and associates weeks: deal screening, CIM generation, buyer outreach, data room diligence. More than 35,000 financial professionals across 250+ institutions use the platform. Hebbia, the document analysis platform for asset managers and investment banks, hit a $700 million valuation in 2024 and continues to expand. Bridgewater's AIA-style internal tools, JP Morgan's research AI deployments, and a growing list of in-house tools at hedge funds and PE firms all share architectural DNA with Rogo and Hebbia.
The pattern has stabilized. If you are building a financial research agent in 2026, the architecture is not a research question; it is an engineering execution question. The hard parts are not the agent loop or the document parser; those are now solved problems. The hard parts are the things that determine whether a senior MD trusts the output: factual grounding to source filings, handling of conflicting documents, structured extraction at the precision an analyst would tolerate, and the eval discipline that catches regressions before they reach a deal team.
This post covers the architecture pattern, the failure modes that show up in production, the eval taxonomy that catches them, and what to ship in the first 90 days of a build.
What financial research agents actually do
The category is broader than "ChatGPT for finance." A practical taxonomy of the workflows financial research agents handle:
| Workflow | Inputs | Output | Time-on-task replacement |
|---|---|---|---|
| Deal screening | Industry parameters, deal criteria, internal portfolio | Ranked target list with rationale | Hours of associate time per screen |
| CIM generation | Target company filings, internal notes, comparable deals | Draft CIM (Confidential Information Memorandum) | Days of analyst time per CIM |
| Earnings analysis | 10-K, 10-Q, transcripts, prior coverage | Structured earnings summary, KPI extraction, anomaly flags | Half-day per name per quarter |
| Data room diligence | Hundreds of files in a virtual data room | Issue list, deal-breaker flags, structured extraction | Weeks of associate time per deal |
| Comparable transaction search | Deal parameters, target characteristics | Ranked list of comparable transactions with rationale | Hours per search |
| Investment memo drafting | Diligence outputs, financial model, market context | Draft memo following firm template | Days per memo |
| Buyer outreach | Target list, deal context, banker preferences | Tailored outreach drafts | Hours per outreach round |
| Question-and-answer over portfolio | Portfolio of documents (filings, internal memos) | Answers to natural-language questions with citations | Minutes per query |
The throughline is that all of these workflows share three properties:
- The inputs are unstructured or semi-structured text (filings, transcripts, memos, data room files).
- The outputs require structured synthesis across many documents.
- The audience is a sophisticated reader who will catch inaccuracies and lose trust in the system if too many surface.
These properties drive the architecture.
The shared architecture
A simplified architecture diagram for the Felix-and-Hebbia-style agent:
[User input: question, deal parameters, or task spec]
|
v
[Task planner: decompose into subtasks]
|
v
+---------------------------------------------------+
| Long-horizon agent loop (one or many iterations) |
| |
| [Retrieval] |
| - Vector search over indexed corpus |
| - Filtered by entity, date, document type |
| - Hybrid lexical + semantic |
| [Document understanding] |
| - Structured extraction with schemas |
| - Cross-document reconciliation |
| [Reasoning step] |
| - LLM call with retrieved context |
| - Tool calls (calculator, API, sub-agent) |
| [State update] |
| - Persist intermediate findings |
| - Update task ledger |
| |
| [Pause checkpoint, if applicable] |
| - Wait for human review |
| - Or proceed if confidence is high enough |
+---------------------------------------------------+
|
v
[Synthesis: structured output with citations]
|
v
[Verification: every claim grounded in source]
|
v
[Output to user with traceable provenance]
Each block is its own engineering subsystem. The hard parts cluster in three places: retrieval (which determines what the model can know), structured extraction (which determines whether the output is useful), and grounding (which determines whether the output is trustworthy).
Retrieval over financial corpora
Financial research agents retrieve over corpora that look different from open-domain web text. The retrieval architecture has to reflect this.
Document types are heterogeneous and structured. A 10-K has a defined structure (Item 1, Item 1A, Item 7, etc.). An earnings transcript has speaker turns. A data room contains contracts, board minutes, financial statements, and operational memos with very different shapes. Naive chunking destroys structure and produces worse retrieval.
Entity disambiguation matters. "Apple" might be Apple Inc. or one of dozens of smaller entities. "Q3" depends on the company's fiscal calendar. The retrieval layer needs an entity resolution step that disambiguates references against an authoritative entity registry.
Recency is a first-class dimension. A finding from the most recent 10-K supersedes one from two years prior. Retrieval has to weight by date, with explicit handling of "what was true at time T" vs "what is true now."
Filings are versioned. A 10-K filed in March supersedes the 10-K filed last year. Amendments (10-K/A) supersede the original. Retrieval has to surface the right version, with awareness that downstream synthesis may need the historical version for trend analysis.
The retrieval architecture that handles this:
Query
|
v
Query understanding:
- Extract entities (companies, securities, dates)
- Resolve to canonical entity IDs
- Identify temporal scope
- Identify document types implied by query
|
v
Multi-stage retrieval:
- Lexical search filtered by entity + date + type
- Semantic search filtered by entity + date + type
- Merge and rerank
|
v
Document-aware chunking:
- Respect document structure (sections, items)
- Preserve metadata (filing date, source, section)
|
v
Reranking by relevance + recency + authority:
- Most-recent filing weighted up
- Primary sources weighted up
- Internal notes weighted by author credibility
|
v
Final context for reasoning step
A retrieval pipeline that does this well is what separates a research agent from a chat-with-PDFs demo.
Structured extraction with schemas
The output of a financial research agent is structured, not prose. A senior MD does not want a paragraph about Q3 revenue; they want a table with revenue, growth rate, segment breakdown, comparison to consensus, and the page reference for each number.
The pattern that works: define schemas for the output up front, prompt the model to fill the schema, validate the schema fill against the source documents.
Example schema for an earnings extraction:
quarterly_results:
reporting_period: <ISO date>
company_id: <entity ID>
source_documents:
- filing_type: 10-Q | 10-K | 8-K | transcript
filing_date: <ISO date>
url_or_internal_ref: <reference>
metrics:
revenue:
value: <number>
currency: <ISO code>
growth_rate_yoy: <number>
consensus_estimate: <number or null>
surprise: <number or null>
source_page: <reference into source doc>
operating_income:
...
eps:
...
segment_breakdown:
- segment_name: <string>
revenue: <number>
growth_yoy: <number>
source_page: <reference>
guidance:
next_quarter:
revenue_range:
low: <number>
high: <number>
source_page: <reference>
full_year:
...
notable_items:
- description: <string>
financial_impact: <number or qualitative>
source_page: <reference>Each numeric field has a source_page reference. The reference points to a specific page or paragraph in a specific filing where the number appears. After extraction, a verification pass confirms that the cited page actually contains the number. Hallucinated numbers fail verification and are flagged.
The same pattern applies to deal screening output (each candidate has a structured rationale tied to specific filings), CIM generation (each section is sourced), data room diligence (each issue has a citation to the document where it surfaced).
The discipline that matters: every claim in the output is traceable back to a source. Free-form narrative summaries without citations are exactly the kind of output that breaks trust when an MD spot-checks it.
Cross-document reconciliation
Real financial research involves conflicting documents. The 10-K says revenue is $4.2B, the press release says $4.21B, the management discussion says approximately $4.2B, the analyst day deck shows $4,219M. The agent has to handle this without producing nonsense.
Three patterns for handling conflict:
Authoritative source ranking. For a given metric, the agent has a ranking of which source to trust. SEC filings beat investor presentations beat internal notes. The agent picks the highest-ranked source and notes the others.
Reconciliation reporting. The agent surfaces the conflict explicitly. "Revenue: $4.2B per 10-K Item 7. Note: 10-K Item 7 states $4.2B; press release states $4.21B; difference attributable to rounding." This is the analyst-grade treatment.
Confidence flagging. When the conflict cannot be reconciled, the agent flags low confidence and surfaces the conflict for human review rather than picking a number.
The wrong pattern: silently picking one of the conflicting numbers without surfacing the others. This is what produces the "the AI was wrong" moments that destroy trust.
Long-horizon state and pause-resume
Data room diligence on a $500M acquisition can run for days. The agent processes hundreds of documents, asks many clarifying questions, surfaces issues, gets them addressed by the deal team, comes back to re-evaluate. This is not a single LLM call; it is a workflow with persistent state.
The state that has to persist:
- The original task specification
- All retrieved documents and their classification
- The structured findings produced so far
- The questions raised, with their resolution status
- The user's instructions and corrections at each checkpoint
- The current step in the workflow plan
The agent can be paused (the deal team goes home for the night, the MD wants to review intermediate findings before the agent continues) and resumed without losing context. Implementation is a state store keyed by task ID, with each step's output persisted as it happens.
The pattern that fails: a long-running agent that holds all state in the LLM's context window. Context windows are large but not infinite, and a multi-day workflow exceeds any reasonable context budget. The agent has to externalize state.
Eval taxonomy
A financial research agent has more dimensions to evaluate than a single-shot LLM. The eval framework that holds up:
Retrieval quality. Given a query and a corpus, did the retrieval surface the right documents? Measured against gold-standard retrieval sets (queries with expert-annotated correct documents). Recall and precision per query, aggregated.
Extraction accuracy. Given a document and a schema, did the extraction produce correct values? Measured against a gold-standard set of (document, schema, correct extraction) triples. Per-field accuracy, with stratification by field type (numeric, date, categorical, free-text).
Grounding rate. What fraction of claims in the output have valid citations to source documents? A claim with a citation that does not actually support it is ungrounded. Measured by an LLM-as-judge or human spot-check on production outputs, sampled.
End-to-end task quality. For a given task type (CIM draft, deal screen, earnings summary), does the agent's output meet the quality bar a senior reviewer would set? Measured by sampled review with structured rubric (completeness, accuracy, prioritization, narrative quality).
Handling of conflicts. When source documents conflict, does the agent surface the conflict or silently resolve it? Measured against test cases with known conflicts.
Refusal behavior. When the corpus does not support an answer, does the agent refuse or fabricate? Measured against test queries where the correct answer is "not enough information."
Latency and cost. Wall-clock time and dollar cost per task type, at expected production volume. Tracked per workflow.
A team running these evals on every prompt change and every model upgrade catches regressions early. A team running a single end-to-end eval and ignoring the others ships subtle quality degradations.
Common production failures
Patterns that show up in deployed financial research agents.
Confident wrong numbers. The agent extracts $4.21B when the source says $4.2B, with a citation that points to the right page (which contains both numbers, in different contexts). The failure is in the extraction step, not retrieval. Fix: schema-constrained extraction with the source quote required adjacent to each numeric field.
Stale data. The agent confidently cites a 2024 10-K when the 2025 10-K is in the corpus and supersedes it. Fix: explicit recency handling in retrieval and reranking.
Entity confusion. The agent merges data from two different "Apple" entities. Fix: canonical entity resolution at the retrieval layer.
Missing context across documents. A finding in one document is contradicted by a footnote in another, but the agent only retrieved the first. Fix: cross-document reconciliation pass before synthesis, with the agent prompted to actively look for contradictions.
Unanchored synthesis. The agent produces a fluent paragraph that summarizes information from multiple sources, but the summary's claims do not all map cleanly back to the sources. Fix: structured output with per-claim citations, validation that each citation is verifiable.
Drift on long horizons. A multi-day diligence workflow produces good output for the first two days, then drifts as the state accumulates errors. Fix: periodic re-grounding, where the agent revalidates its accumulated state against source documents at defined checkpoints.
What separates Rogo from a chat-with-PDFs
After two years watching this category, the differentiation between products that institutions trust and products that get demoed and abandoned reduces to the same set of properties.
Source grounding is non-negotiable. Every claim cites a source. The citation is verifiable. Outputs without citations are flagged. Tools that produce confident prose without citations get rejected after the first error a senior reviewer catches.
Structured output beats prose. Tables, schemas, and structured findings are easier to verify, easier to integrate into existing analyst workflows, and harder to fake with fluency.
Conflict handling is explicit. When sources conflict, the system surfaces the conflict rather than picking a side silently.
Long-horizon workflows persist state. Multi-day, multi-step tasks can be paused, reviewed, and resumed. State is externalized, not held in context.
Domain depth in retrieval. Entity resolution, document type awareness, recency handling, jurisdictional filtering. The retrieval layer encodes financial knowledge that generic RAG does not.
Eval as discipline, not vibes. Per-workflow eval sets with regular updates, regression catches before deploy, lawyer-and-analyst-grade ground truth.
The architecture above supports all of this. Built without these, the product becomes a clever demo that never makes it past the first deal team's pilot.
Build order
Financial research agents stack in dependency order. Each layer assumes the one below it works at an analyst-grade quality bar; skip the gate and the next layer inherits the error.
| Order | What you build | Eval gate before moving on |
|---|---|---|
| 1 | Entity registry and document-type schemas (companies, securities, fiscal calendars, filing types, target output shapes per workflow) | 100 percent of test queries resolve to canonical entity IDs; schema coverage validated against 50 real analyst outputs from one workflow |
| 2 | Retrieval pipeline with structure-preserving ingestion, entity-filtered hybrid search, recency-aware reranking, version-aware filing handling | Recall at 10 above 0.85 and precision at 5 above 0.7 on a 100-query gold set with expert-annotated correct documents |
| 3 | Schema-constrained extraction with source-quote requirements adjacent to every numeric field, plus citation verification pass | Per-field numeric accuracy above 0.95 on 200 gold extractions; citation verification pass catches 100 percent of seeded hallucinated numbers |
| 4 | Cross-document reconciliation with authoritative source ranking and explicit conflict surfacing on contested metrics | On 50 seeded conflict cases (10-K vs press release vs investor deck), conflicts surfaced 100 percent of the time; zero silent picks |
| 5 | One end-to-end workflow with externalized long-horizon state, pause-resume, and per-checkpoint re-grounding (start with deal screening or earnings analysis) | Senior reviewer rubric (completeness, accuracy, prioritization) clears 4 out of 5 on 25 sampled runs; grounding rate above 0.95 on production traffic |
| 6 | Second and third workflows (CIM generation, data room diligence, buyer outreach) layered on the same retrieval and extraction substrate | Each new workflow ships with its own gold set, regression evals wired to CI, and grounding monitor live before deal teams see output |
Add workflows only after the layer below clears its gate. Teams that ship the agent loop before retrieval and grounding spend the next year unwinding wrong numbers a senior MD already spot-checked.
How Respan fits
Financial research agents like Felix and Hebbia live or die on whether senior MDs trust the output, and that trust is built on the substrate underneath: traceable retrieval, grounded extraction, and eval discipline that catches regressions before a deal team sees them. Respan is that substrate.
- Tracing: every deal screen, CIM draft, earnings extraction, and data room diligence run captured as one connected trace. Auto-instrumented for LangChain, LlamaIndex, Vercel AI SDK, CrewAI, AutoGen, OpenAI Agents SDK. When a multi-day diligence workflow drifts, you can replay the entire task ledger, retrieval calls, and reconciliation steps to find where the agent picked the wrong filing version or merged two "Apple" entities.
- Evals: ten built-in evaluators (faithfulness, citation accuracy, refusal correctness, harmfulness) plus LLM-as-judge and custom Python evaluators. Production traffic flows directly into datasets. CI-aware experiments block regressions on confident wrong numbers, stale 10-K citations, ungrounded synthesis, and silent conflict resolution before deploys ship.
- Gateway: 500+ models behind an OpenAI-compatible interface, semantic caching, fallback chains, per-customer spending caps. Diligence runs that span hundreds of documents and many reasoning steps benefit from semantic caching on repeated entity-resolution calls and per-deal-team spend caps so a runaway agent loop does not blow through a quarter's budget on one CIM.
- Prompt management: versioned registry, dev/staging/prod environments with approval workflows, A/B testing in production with one-click rollback. Your task planner, retrieval query rewriter, schema-constrained extraction prompts, reconciliation prompts, and per-workflow synthesis templates all belong in the registry so analyst feedback can be wired into prompt versions without a code deploy.
- Monitors and alerts: grounding rate per workflow, retrieval recall against gold sets, extraction field accuracy, conflict-surface rate, refusal correctness on out-of-corpus queries, latency and cost per CIM or earnings summary. Slack, email, PagerDuty, webhook. When grounding rate drops on the earnings workflow after a model upgrade, the on-call analyst hears about it before the next quarterly close.
A reasonable starter loop for financial research agent builders:
- Instrument every LLM call with Respan tracing including retrieval spans, entity-resolution spans, structured extraction spans, and reconciliation spans.
- Pull 200 to 500 production extraction outputs (earnings summaries, CIM sections, deal screens) into a dataset and label them for citation validity, numeric accuracy, and conflict handling.
- Wire two or three evaluators that catch the failure modes you most fear (confident wrong numbers with valid-looking citations, stale-filing references when newer 10-Ks exist in corpus, unanchored synthesis where prose claims do not map back to sources).
- Put your task planner, schema-constrained extraction prompts, and per-workflow synthesis templates behind the registry so you can version, A/B, and roll back without a deploy.
- Route through the gateway so semantic caching absorbs repeated entity lookups, fallback chains keep diligence running through provider outages, and per-deal spend caps prevent a runaway long-horizon loop.
Skip this loop and the first wrong number a senior MD spot-checks ends the pilot, and the institution that just signed on becomes the reference customer your competitor wins.
To wire any of the patterns above on Respan, start tracing for free, read the docs, or talk to us.
Related reading
- The April 2026 Model Risk Overhaul: if your research agent informs decisions, MRM applies
- Building Adverse Action Explainability for LLM-Driven Credit Decisions: for research agents that touch underwriting
- Evaluating LLMs for Real-Time Fraud Detection: adjacent investigation copilot patterns
- How Fintech Teams Build LLM Apps in 2026: pillar overview
