The agent copilot is the highest-volume entry point for real estate AI. Lofty (formerly Chime), Top Producer, kvCORE, Real Geeks, BoomTown, Compass RealScout, and increasingly every CRM in the category have shipped some version of it. The market has stabilized; the engineering challenges are well-documented.
This piece is the build walkthrough. It assumes you have read the hallucination spoke for the property-fact grounding framework and the Fair Housing spoke for the disparate-impact testing layer. It covers lead enrichment, conversation drafting, MLS-grounded recommendations, calendar and task automation, and the eval pipeline that catches regressions before they reach licensed agents.
For context on where copilots sit in the broader real estate AI stack, the pillar covers the five build patterns at the platform level.
The architecture in one diagram
[New lead arrives: form fill, listing inquiry, referral]
|
v
[Lead enrichment: public records, search history, intent signals]
|
v
[Lead scoring: gradient-boosted ML, NOT LLM-as-scorer]
|
v
[Disparate-impact gate: four-fifths rule check on score distribution]
|
v
[Conversation drafting: brand voice + lead context]
|
v
[Property recommendations: MLS-grounded, hard filters]
|
v
[Agent review + edit + send]
|
v
[Calendar automation: scheduling, reminders, logging]
|
v
[Continuous eval capture: edit rate, conversion, override rate]
Eight components, two compliance gates, one feedback loop.
Step 1: Lead enrichment
A new lead arrives with minimal data: name, phone, email, the property they inquired about. Enrichment fills in the picture.
Public records. Property ownership, transaction history, current mortgage, school district, neighborhood data. Sourced from county records and licensed data providers (ATTOM, CoreLogic).
Search and behavioral signals. Pages viewed on the brokerage site, listing alerts subscribed to, time-of-day patterns. Stay within consented data only.
Lead source attribution. Where did the lead come from? Zillow, Realtor.com, brokerage website, referral. Source affects both intent and the right outreach pattern.
Urgency markers. "Looking to move within 30 days" vs "just browsing." Extracted from form fields or message text via the LLM.
A practical implementation:
from respan import Respan
client = Respan(api_key=os.environ["RESPAN_API_KEY"])
@client.workflow(name="lead-enrichment")
def enrich_lead(lead_data):
public = public_records_client.lookup(lead_data["email"], lead_data["phone"])
behavior = behavior_db.fetch_history(lead_data["session_id"])
urgency = client.chat.completions.create(
model="auto",
messages=[
{"role": "system", "content": URGENCY_EXTRACTION_PROMPT},
{"role": "user", "content": lead_data["message"]},
],
response_format={"type": "json_object"},
)
return EnrichedLead(
lead=lead_data,
public_records=public,
behavior=behavior,
urgency=parse(urgency),
)Compliance line: public records and behavioral data have varying state-by-state restrictions. CCPA, CPRA, and state consumer protection laws apply. Get this reviewed by counsel before shipping.
Step 2: Lead scoring (NOT LLM-as-scorer)
The scoring model is statistical (gradient boosting, random forest), not the LLM. Two reasons:
Disparate-impact testing requires the model to be inspectable. A statistical model with explicit features can be audited for which features drive disparate outcomes. An LLM that produces a score from text is harder to audit.
Calibration. Score distributions need to be calibrated; LLM-produced scores are typically not. Calibration matters because the score thresholds drive workflow decisions, and uncalibrated thresholds drift.
The LLM's job is not to score but to explain the score and draft outreach calibrated to the score band.
The four-fifths gate runs at every score generation:
@client.eval(name="lead-scoring-disparate-impact-online")
def online_disparate_impact_check(score_batch, demographic_strata):
rates = {}
for stratum in demographic_strata:
cases = score_batch.filter(stratum=stratum)
rates[stratum] = sum(s < threshold for s in cases.scores) / len(cases)
max_rate = max(rates.values())
for stratum, rate in rates.items():
ratio = rate / max_rate if max_rate > 0 else 1.0
if ratio < 0.80:
alert(f"Four-fifths violation: {stratum} ratio {ratio:.2f}")
return "block_until_review"
return "pass"A four-fifths violation is a deploy blocker, not a notification.
Step 3: Conversation drafting
The copilot drafts a follow-up message tailored to the lead context. Three architectural concerns.
Brand voice tuning per agent or brokerage. A Sotheby's agent's voice is different from a Keller Williams agent's. Brand voice prompts live in the prompt registry, versioned per brokerage and per agent if needed.
Template library for common scenarios. Showing follow-up, listing alert, market update, off-market opportunity. Each scenario has a versioned template that pulls from the lead enrichment.
Compliance redlines. No protected-class language. No language that could be construed as steering. Disclosure language for state-specific requirements (e.g., dual agency disclosures in some states). Run a pre-send compliance check.
@client.workflow(name="lead-outreach")
def draft_outreach(enriched_lead, scenario, agent_id):
voice = client.prompts.get(f"voice/{agent_id}", env="prod")
template = client.prompts.get(f"scenario/{scenario}", env="prod")
draft = client.chat.completions.create(
model="auto",
messages=build_outreach_prompt(enriched_lead, voice, template),
)
compliance = client.evals.run(
evaluator="real_estate_compliance_check",
candidate=draft,
checks=["no_steering_language", "no_protected_class_inference",
"required_disclosures_present"],
)
if compliance["fail_count"] > 0:
return require_agent_review(draft, compliance)
return draftThe compliance evaluator looks for known steering patterns ("good neighborhood for families like yours", "this area has great schools"), implicit demographic inferences, and missing required disclosures. Run on every outreach.
Step 4: Property recommendations
Given the lead's stated criteria and behavior signals, recommend properties from the MLS. Architectural notes:
Hard filters first. Geographic (zip code, school district), property type, price band, bed/bath, must-have features. Hard filters before any embedding similarity.
Embedding similarity within filters. Among properties passing hard filters, sort by similarity to the lead's expressed and behavioral preferences.
Recency reranking. A property listed today should rank above a property listed 60 days ago when other factors match.
Status filter. Only active listings. A pending or sold listing in the recommendation set is a customer-experience failure.
Anti-steering by design. The recommendation pipeline does not infer demographic preferences from the lead's name, address, or other features. Recommendations are based on stated criteria and explicit behavioral signals only.
def recommend_properties(enriched_lead, k=10):
hard_filtered = mls_client.search(
zip_codes=enriched_lead.preferences.zip_codes,
property_types=enriched_lead.preferences.property_types,
price_min=enriched_lead.preferences.price_min,
price_max=enriched_lead.preferences.price_max,
bedrooms_min=enriched_lead.preferences.bedrooms_min,
status="active",
)
if len(hard_filtered) < k:
return hard_filtered # do not loosen filters silently
scored = embedding_score(hard_filtered, enriched_lead.preferences_text)
reranked = recency_rerank(scored, weight=0.2)
return reranked[:k]Loosening hard filters silently when the result set is small produces recommendations that do not match what the lead asked for. Better to return fewer recommendations and tell the agent that the lead's criteria are tight than to surface off-criteria properties.
Step 5: Calendar and task automation
The agentic surface for the agent's workflow.
Showing scheduling. Detect intent in lead messages ("can we see this Saturday?"), check the agent's calendar, propose specific times, send to the lead, confirm, log to the CRM.
Reminder cadence. Configurable per scenario: showing reminders 24 hours and 1 hour before; listing alert follow-ups every 3 days for 14 days; market update emails monthly.
Activity logging. Every interaction logs to the CRM with structured fields (touchpoint type, outcome, next action). Audit trail for compliance and performance review.
Human-in-the-loop on key actions. Sending an offer, scheduling a closing, submitting a listing. The agent reviews and approves, the AI does not act unilaterally.
Step 6: Eval and override capture
The production-grade signal is agent edit rate on AI-drafted communications, sliced by scenario and brokerage. Real estate has a particularly high override rate because agents are licensed and liable for what they send under their name.
A reasonable eval suite:
- Brand voice match rate measured against agent-edited final versions.
- Compliance flag rate on the pre-send check (low is good).
- Conversion rate per scenario measured at outreach to next-step (response, showing scheduled, offer made).
- Disparate-impact ratio measured monthly across protected classes.
- Property recommendation accept rate measured by agent or lead clicks vs ignores.
client.monitors.create(
name="copilot-quality",
workflow="lead-outreach",
sample_rate=0.10,
evaluators=[
"brand_voice_match",
"compliance_flag_rate",
"conversion_to_response",
"disparate_impact_ratio",
],
alert_on={
"compliance_flag_rate": ">0.05",
"disparate_impact_ratio": "<0.80",
},
slice_by=["scenario", "brokerage_id", "agent_tenure_band"],
)The slice-by matters. A copilot that performs well for senior agents but poorly for new agents has a brand-voice transfer problem you cannot see in aggregate metrics.
Build vs buy
Buy when:
- Single brokerage or small team, generic CRM workflow
- No dedicated MLOps team
- Time-to-value matters. Vendor pricing typically lands at $50-200 per agent per month for the leading copilots
- Generic brand voice is acceptable
Build when:
- Large brokerage or multi-state network with thousands of agents
- Specific brand voice differentiation matters
- Tight integration with proprietary CRM, MLS, or transaction management systems
- Existing customer data assets that become a moat
The hybrid pattern is most common in 2026. License the CRM with built-in copilot (Lofty, kvCORE, Real Geeks). Build a thin custom layer on top for brand-specific voice tuning, proprietary lead sources, or compliance workflows that vendors cover poorly.
A reference build checklist
Before you ship a copilot to licensed agents:
- Lead enrichment pipeline with public records, behavior, urgency extraction
- Statistical lead-scoring model (NOT LLM-as-scorer) with disparate-impact testing
- Four-fifths rule gate that blocks deploy on violation
- Conversation drafting with brokerage and agent brand voice in the prompt registry
- Compliance pre-send check (no steering, no protected-class inference, required disclosures)
- Property recommendations with hard filters first, embedding similarity within filters
- Anti-steering: no demographic inference in recommendations
- Calendar and task automation with human-in-the-loop on key actions
- Audit logs (full decision context, six-year retention minimum)
- Disparate-impact monitoring on production traffic, monthly subgroup analysis
- Agent edit rate dashboards sliced by scenario, brokerage, agent tenure
- CI eval on every prompt or model change including disparate-impact suite
How Respan fits
A real estate copilot fans a single lead event into enrichment, scoring, drafting, recommendation, and calendar steps, each with its own failure modes and compliance gates. Respan gives you one connected view across that pipeline plus the eval and prompt tooling to ship changes without breaking licensed agents.
- Tracing: every lead-to-outreach workflow captured as one connected trace. Auto-instrumented for LangChain, LlamaIndex, Vercel AI SDK, CrewAI, AutoGen, OpenAI Agents SDK. Enrichment calls, scoring decisions, MLS retrieval, drafting, and the pre-send compliance check appear as nested spans against a single lead ID, so you can debug an off-criteria recommendation or an agent edit without grepping logs.
- Evals: ten built-in evaluators (faithfulness, citation accuracy, refusal correctness, harmfulness) plus LLM-as-judge and custom Python evaluators. Production traffic flows directly into datasets. CI-aware experiments block regressions on steering language, missing disclosures, off-criteria recommendations, and four-fifths violations before deploys ship.
- Gateway: 500+ models behind an OpenAI-compatible interface, semantic caching, fallback chains, per-customer spending caps. Route urgency extraction to a cheap model, brand-voice drafting to a stronger one, and let the gateway fall back when a provider blips during peak Saturday showing requests.
- Prompt management: versioned registry, dev/staging/prod environments with approval workflows, A/B testing in production with one-click rollback. Brokerage-level brand voice prompts, agent-level overrides, and scenario templates (showing follow-up, listing alert, market update) live in the registry so brokerage admins can ship voice updates without a deploy.
- Monitors and alerts: agent edit rate, compliance flag rate, conversion-to-response, disparate-impact ratio, property recommendation accept rate. Slack, email, PagerDuty, webhook. Slice every metric by scenario, brokerage, and agent tenure band so you catch problems that hide in the aggregate.
A reasonable starter loop for real estate copilot builders:
- Instrument every LLM call with Respan tracing including enrichment, scoring, drafting, MLS retrieval, and compliance-check spans.
- Pull 200 to 500 production lead-outreach traces into a dataset and label them for brand voice match, compliance correctness, and recommendation fit.
- Wire two or three evaluators that catch the failure modes you most fear (steering language in drafts, off-criteria property recommendations, four-fifths violations on score distribution).
- Put your brokerage and agent voice prompts plus scenario templates behind the registry so you can version, A/B, and roll back without a deploy.
- Route through the gateway so brokerage-level cost caps, semantic caching on repeated market-update prompts, and provider fallbacks are enforced centrally.
The point is not to replace the licensed agent's judgment but to give the engineering team the same observability and guardrails the compliance team already expects on paper.
To wire any of the patterns above on Respan, start tracing for free, read the docs, or talk to us.
CTA
To wire the copilot stack on Respan, start tracing for free, read the docs, or talk to us. For the rest of the Real Estate cluster: the pillar, the hallucination spoke, the Fair Housing compliance spoke, and the eval spoke.
FAQ
Should I use the LLM as the lead scorer? No. Use a statistical model (gradient boosting, random forest) for the score, with the LLM as the explanation layer. Statistical models are inspectable for disparate-impact testing; LLM-produced scores are not.
How do I handle brand voice across many brokerages? Versioned prompts in a registry, with brokerage-specific and agent-specific layers. Brokerage admins should be able to update brand voice without a deploy. Test brand voice match rate as a first-class metric.
What's the most common compliance failure in copilots? Steering language in conversation drafts. "This area has great schools" reads to many readers as a familial-status proxy. "Up-and-coming neighborhood" can read as a race proxy. Build a compliance check that flags these patterns and require agent review before send.
How do I test the property recommendation pipeline for steering? Match-pair testing: construct synthetic leads with identical criteria but different names, addresses, or other demographic proxies. Run the recommendation pipeline. Compare the recommended properties. Significant differences indicate steering.
What's the right approach when the lead's criteria are too tight to find matches? Tell the agent. Do not loosen the criteria silently and surface off-criteria properties. Surface the gap explicitly: "Only 2 active listings match the lead's stated criteria. Want to expand the price band or geography?"
