There are no public benchmarks for real estate AI like there are for medical or legal AI. No MedQA, no ASAP-AES, no MathTutorBench. The eval set you build is the eval set you have, and the quality of your build directly determines whether your product holds up to a state real estate commission inquiry, a Fair Housing complaint, or a major brokerage's procurement security review.
This piece covers the eval framework that real estate AI products need. Property-fact accuracy testing, AVM and comp quality, disparate-impact testing under the four-fifths rule, and the production cadence the leading proptech teams run.
For the wider Real Estate cluster: the pillar, the hallucination spoke, the Fair Housing compliance spoke, and the agent copilot build walkthrough.
Building the golden dataset
The eval is only as good as the dataset behind it. Several principles for real estate.
Source from real listings, not synthetic
Do not generate the dataset by asking GPT to make up properties. Real properties have data inconsistencies (MLS vs assessor differences), edge cases (unusual lot configurations, mixed-use zoning), and demographic patterns that synthetic data does not capture. Pull from real MLS feeds, real county records, real listing remarks.
Stratify the dataset:
- Easy positives. Common single-family homes in stable suburban markets with clean MLS data. Sanity checks.
- Long-tail positives. Vacant land, commercial, mixed-use, manufactured housing, condos with HOA quirks, properties with assessor-MLS discrepancies. Test whether the system handles edge cases.
- Adversarial fact tests. Listings where the LLM is likely to hallucinate features. Closed-floor-plan layouts (test "open concept" hallucination), original-condition kitchens (test "luxury finishes" hallucination), unique architectural styles (test feature-claim drift).
- AVM stress cases. Recent sales in fast-moving markets, properties with no recent comps, custom homes, properties in submarkets with thin transaction volume.
- Disparate-impact pairs. For lead-scoring or recommendation models: matched pairs of cases that differ only on demographic proxy variables (zip code, name, surname language) but otherwise have identical criteria.
100 cases per stratum is enough to start. Scale to 300-500 once the eval pipeline is working.
Annotation requires a licensed agent
Every entry in the golden dataset needs ground-truth annotations: which property facts are correct, which AVM estimates fall within reasonable comp-supported bands, which feature claims are supportable. This work cannot be done by an engineer alone. Get a licensed agent to spend the time. Their hourly rate is worth it; a contaminated golden set is worse than no golden set.
For disparate-impact testing specifically, the annotation is by demographic strata not individual cases: cases in the protected-class group and cases in the comparison group, with otherwise-matched criteria.
Property-fact accuracy testing
The simplest layer. Every property fact in the AI output is verified against MLS, assessor, or other authoritative source.
from respan import Respan
client = Respan(api_key=os.environ["RESPAN_API_KEY"])
@client.eval(name="property-fact-accuracy")
def property_fact_accuracy(trace, gold):
output = trace.output
facts = extract_property_facts(output.text)
correct = 0
fabricated = 0
for fact in facts:
source_value = lookup_authoritative(fact["field"], gold.listing_id)
if source_value is None:
fabricated += 1
elif normalize(source_value) == normalize(fact["value"]):
correct += 1
else:
# Cited source exists but value disagrees
fabricated += 1
return {
"accuracy": correct / max(len(facts), 1),
"fabricated": fabricated,
}Track accuracy by fact type: square footage, year built, lot size, bed count, bath count, feature claims. Different fact types have different failure modes.
AVM and comp quality eval
The valuation accuracy benchmark is straightforward: for each property in the golden dataset with a known recent sale price, measure the AVM error.
- Median absolute error. The headline metric.
- Confidence band coverage. What percentage of actual sale prices fell within the AVM's stated 80% confidence band? Should be 80%; significantly less means the model is overconfident.
- Performance by submarket. AVM accuracy varies wildly by market. Track by zip code, school district, and price band.
For comp quality:
- Submarket appropriateness. Of the comps the system selected, how many are from the same school district, same property type, same time window?
- Recency. What fraction of comps are within the right time window for the market?
- Agent agreement. Have an agent rate the comp set as "good" or "would have picked different ones." Track agreement rate.
Disparate-impact testing in production
The compliance-load-bearing eval. Run continuously on production traffic.
@client.eval(name="lead-scoring-disparate-impact")
def disparate_impact_eval(score_batch, demographic_strata):
"""
Run on a batch of recent scores. Compute the four-fifths ratio
across protected-class strata. Alert and block deploy on violation.
"""
rates = {}
for stratum in demographic_strata:
cases = score_batch.filter(stratum=stratum)
if len(cases) < 50:
continue # too few cases to be statistically meaningful
rates[stratum] = {
"rejection_rate": sum(s < threshold for s in cases.scores) / len(cases),
"n": len(cases),
}
if not rates:
return {"status": "insufficient_data"}
max_rate = max(r["rejection_rate"] for r in rates.values())
violations = []
for stratum, r in rates.items():
ratio = r["rejection_rate"] / max_rate if max_rate > 0 else 1.0
r["four_fifths_ratio"] = ratio
if ratio < 0.80:
violations.append((stratum, ratio))
return {
"status": "violation" if violations else "pass",
"rates_by_stratum": rates,
"violations": violations,
}Run monthly at minimum. Run on every model or prompt change for any model that affects access decisions (tenant screening, lead scoring, recommendations).
Continuous capture from agent overrides
Agent edit rate is the production signal that matters most. Real estate has a particularly high override rate because agents are licensed and personally liable for the marketing they send under their name. That high override rate is a gold mine.
Capture every edit:
- What changed. Word-level diff between the AI draft and the agent-final version.
- Why it changed. Inferred from the diff: factual correction, brand-voice mismatch, compliance concern, scenario mismatch, length adjustment.
- Trace ID. Link back to the full generation context (prompt version, model version, retrieval results).
@client.workflow(name="agent-override-capture")
def record_override(trace_id, agent_id, brokerage_id, original, edited, scenario):
diff = compute_diff(original, edited)
edit_type = classify_edit_type(diff)
client.datasets.append(
name="agent-overrides",
record={
"trace_id": trace_id,
"agent_id": agent_id,
"brokerage_id": brokerage_id,
"scenario": scenario,
"diff": diff,
"edit_type": edit_type,
},
)The dataset becomes both your eval set and your prompt-iteration corpus. When edit rate climbs, you go to the dataset and look at why.
Production patterns
The cadence the leading proptech teams run.
- Offline regression on every prompt or model change. Frozen golden set, agent-annotated. Property-fact accuracy, AVM error, comp quality, disparate-impact ratio.
- Online sampling at 5-10% of live traffic. Agent edit rate, compliance flag rate, conversion rate per scenario. Drift alarms on >10% week-over-week drops.
- Monthly disparate-impact analysis with formal documentation. Versioned reports, four-fifths ratio per protected class, remediation tracking.
- Quarterly external audit on the disparate-impact methodology. Some larger brokerages do this annually with outside counsel.
- Frozen-dataset re-runs monthly to catch judge drift before model drift.
A reference eval stack
If you are starting from zero today, the smallest defensible setup combines:
- A 200-500 case golden set of real properties, agent-annotated, stratified across property type and submarket.
- Property-fact accuracy eval running on every prompt or model change.
- AVM accuracy eval with confidence-band coverage as a first-class metric.
- Comp quality eval with submarket-appropriateness scoring.
- Disparate-impact testing under the four-fifths rule for any model that affects access decisions.
- Agent override capture pipeline writing to a labeled dataset by edit type.
- Online monitors for agent edit rate, compliance flag rate, and disparate-impact ratio.
- Documented monthly disparate-impact report with remediation tracking.
How Respan fits
Real estate AI evals live or die on traceability: when a property-fact hallucination or four-fifths violation surfaces, you need to walk back from the agent edit or compliance flag to the exact MLS retrieval, prompt version, and model call that produced it. Respan ties tracing, evals, and prompt management into one loop so the eval stack above is observable end to end.
- Tracing: every property valuation or listing generation captured as one connected trace. Auto-instrumented for LangChain, LlamaIndex, Vercel AI SDK, CrewAI, AutoGen, OpenAI Agents SDK. MLS lookups, assessor queries, AVM scoring spans, and agent edit deltas all attach to the same trace ID, so the override-capture pipeline above writes a real artifact you can replay.
- Evals: ten built-in evaluators (faithfulness, citation accuracy, refusal correctness, harmfulness) plus LLM-as-judge and custom Python evaluators. Production traffic flows directly into datasets. CI-aware experiments block regressions on property-fact accuracy drops, AVM confidence-band undercoverage, or four-fifths-rule violations before deploys ship.
- Gateway: 500+ models behind an OpenAI-compatible interface, semantic caching, fallback chains, per-customer spending caps. Cap per-brokerage spend on listing-copy generation, fall back from a frontier model to a cheaper one for AVM narrative summaries, and cache repeat MLS-summary calls without rewriting your client.
- Prompt management: versioned registry, dev/staging/prod environments with approval workflows, A/B testing in production with one-click rollback. The compliance review your legal team runs on Fair Housing language becomes a real approval gate, and prompts that lift agent edit rate get rolled back the same day.
- Monitors and alerts: agent edit rate by scenario, property-fact fabrication rate, AVM median absolute error, four-fifths ratio per protected class, compliance flag rate. Slack, email, PagerDuty, webhook. Wire the disparate-impact monitor straight to legal so a four-fifths ratio dipping below 0.85 pages the right human before it crosses 0.80.
A reasonable starter loop for real estate AI builders:
- Instrument every LLM call with Respan tracing including MLS retrieval, assessor lookups, AVM scoring, and agent edit-capture spans.
- Pull 200 to 500 production property valuations and listing generations into a dataset and label them for property-fact accuracy, AVM error, comp appropriateness, and demographic stratum.
- Wire two or three evaluators that catch the failure modes you most fear (property-fact hallucinations, AVM confidence-band undercoverage, four-fifths-rule violations on lead scoring).
- Put your listing-generation and lead-scoring prompts behind the registry so you can version, A/B, and roll back without a deploy.
- Route through the gateway so per-brokerage cost caps, fallback chains, and semantic caching on repeat MLS summaries work without touching application code.
This is the same loop the proptech teams running production cadence above already follow; Respan just collapses it into one platform instead of four.
To wire any of the patterns above on Respan, start tracing for free, read the docs, or talk to us.
CTA
To wire the eval stack on Respan, start tracing for free, read the docs, or talk to us. For the rest of the Real Estate cluster: the pillar, the hallucination spoke, the Fair Housing compliance spoke, and the agent copilot build walkthrough.
FAQ
Are there public benchmarks for real estate AI? Not at the level of MedQA or ASAP-AES. The eval set you build is the eval set you have. Build it from your own production traffic with agent annotations.
How big should the golden dataset be? Start with 200-300 properties stratified across property types and submarkets. Scale to 500-1,000 as the eval pipeline matures. Below 100 you do not have enough statistical power for stratified analysis.
What's the right disparate-impact methodology? Four-fifths rule on rejection or low-score rates across protected classes, with matched-pair testing on synthetic identical criteria varying only demographic proxies. Document the methodology, version the results, run monthly.
Should AVM accuracy be evaluated against the actual sale price or the appraisal? Both, but for different things. Sale price is the ground truth for the AVM's prediction quality. Appraisal is the ground truth for the AVM's defensibility in lending contexts. They differ in distribution; track both.
What's the most underrated production metric? Agent edit rate sliced by scenario and agent tenure. New agents and senior agents have different brand voice expectations and edit rates. Aggregate metrics hide this signal entirely.
How do I detect drift? Weekly QWK-style regression on a frozen golden set; alarm on >0.05 drop. Monthly four-fifths ratio re-computation; alarm on any subgroup drop below 0.85 (early warning before 0.80 violation). Quarterly judge-drift checks against archived ground truth.
