The shape of a product search query has changed. According to Klaviyo's 2026 AI Consumer Trends Report, 30% of AI shopping queries now contain eight or more words, and 78% include emotional or personal context at least some of the time. Shoppers no longer type "rain jacket"; they type "lightweight rain jacket that packs into its own pocket for hiking the Grand Canyon in shoulder season because I always overpack." The same report shows 39% of consumers have purchased a product based on an AI recommendation in the past six months, and another 27% used AI as a research starting point. The volume is real, the queries are nuanced, and standard relevance metrics break on them.
Why do they break? Traditional product search is evaluated on short keyword queries paired with binary relevance judgments. NDCG, MRR, and recall@k assume a small (query, document) primitive where the document either matches the keywords or does not. An eight-word query with emotional context, situational constraints, and an implied gift recipient does not fit that primitive. Worse, LLM-powered systems do not just rank; they generate summaries, infer attributes, and personalize on the fly, which means a result can score perfectly on relevance and still hallucinate a SKU that does not exist or invent a feature the product does not have.
The product search eval framework that has emerged for LLM-powered systems looks different from traditional e-commerce search eval. Five dimensions matter: relevance to long-tail intent, personalization fidelity, attribute extraction accuracy, hallucination control, and conversion correlation. This post covers each, with the dataset construction and operational practice that turns one-time benchmarks into continuous evaluation.
The eval pipeline at a glance
The pipeline is continuous: production traces feed sampling, sampling feeds the golden dataset, the dataset is graded by a multi-evaluator suite, and the dashboards drive the next sprint of fixes. Each loop tightens the system.
Trace every search call before you evaluate
You cannot grade what you cannot see. Respan auto-instruments LangChain, LlamaIndex, Vercel AI SDK, CrewAI, and the OpenAI Agents SDK so every attribute extraction, retrieval, ranking, and generation step is captured as one connected trace. Start at platform.respan.ai and replay any production query end to end before you write your first evaluator.
Why traditional search eval breaks on LLM systems
Four properties of LLM-powered product search make traditional metrics insufficient. Queries are conversational, not keyword-shaped: an LLM-powered system handles queries that span multiple sentences, carry prior conversation context, and combine specifications with situational constraints, so the (query, document) primitive is too small. Results are personalized in real time: a user with prior shin-splint queries gets different running shoe recommendations than one with marathon-training queries, so evaluation must either control for personalization or measure it explicitly. Generation is part of the result: the system produces summaries, comparisons, and recommendations in natural language, and the text around the product list affects trust and conversion. Multiple surfaces exist simultaneously: the same product needs to appear consistently across ChatGPT (ACP-backed), Google AI Mode (UCP-backed), merchant site search, and Comet-style browser agents, even though eval data may come from only one. Off-the-shelf IR metrics like NDCG and MRR are necessary but not sufficient.
The evaluator suite at a glance
Most teams default to bullet lists of metrics. That collapses the operational picture: which evaluator catches which failure, what threshold should fire an alert, and what the on-call should do when it fires. The table below is the version we recommend pinning to the eval repo README.
| Evaluator | What it catches | Threshold | Action on fail |
|---|---|---|---|
| Relevance@k (NDCG@10) | Long-tail queries where top-k results miss the multi-attribute intent | NDCG@10 < 0.65 on long-tail stratum | Re-rank step regression test, prompt rollback, retrieve-more-then-rerank patch |
| Personalization correctness | Counterfactual flips where changing one user signal redirects rather than refines | Mismatch rate > 8% on persona suite | Audit personalization features, gate the new signal behind a flag |
| Long-tail recall | 8+ word emotional queries where no relevant product appears in top 50 | Recall@50 < 0.7 on tail stratum | Expand retrieval candidate pool, tune embedding model, add query rewrite |
| Hallucinated product attribute | LLM cites attributes (waterproof, leather, in-stock) the catalog does not back | Hallucination rate > 1.5% on sampled production | Enforce evidence-required prompt, post-generation verifier, block deploy |
| Conversion-rate proxy | Recommendations that score well on relevance but never get clicked or added to cart | CTR@1 drop > 15% week-over-week per surface | Re-run experiment with prior prompt, check ranker drift, inspect new model swap |
Read the table left to right when triaging an alert. The threshold column is a starting point: tune it to your traffic volume and recover the precision/recall tradeoff that matches your team's response capacity.
Dimension 1: Long-tail relevance
The first-order question for product search: does the system return relevant products for the queries users actually issue?
Construct the eval set
Three sources for query construction. Production query logs are the natural starting point: sample queries from production traffic, stratified by query length, query type (browse, comparison, specific product), and intent (research, purchase, support). Synthesized adversarial queries test specific failure modes: unusual qualifiers ("vegan athletic shoes for someone with bunions"), conflicting constraints ("affordable luxury watch"), product type plus situational context ("something for my friend's housewarming who just moved to a tiny apartment"). Edge cases from category taxonomy cover queries that span multiple categories, brand names as descriptors ("Patagonia-style jackets"), and products you do not carry but for which you have substitutes. 500 to 2,000 queries is typical for a serious eval; below that, statistical power is limited.
For each query, the gold annotation is what an expert human who knows the catalog would consider relevant. Annotations can be binary (relevant or not), ordinal (highly, relevant, marginal, irrelevant), or multi-attribute (which aspects of the query each result addresses). The richer the annotation, the more useful the eval; multi-attribute annotation catches partial matches where a result addresses the price constraint but misses the use case.
Compute the metrics
| Metric | Output type | When to use | Stratify by |
|---|---|---|---|
| NDCG@K | Ranked list | Ordinal relevance with multiple acceptable answers | Query length, intent |
| Recall@K | Ranked list | Catalog coverage on multi-aspect queries | Category, tail vs head |
| Precision@K | Ranked list | High-confidence buy-box style surfaces | Surface (ACP, UCP, direct) |
| MRR | Ranked list | One clearly correct answer expected | Specific-product queries only |
| Coverage of query attributes | Ranked list + LLM | Multi-aspect long-tail queries | Number of attributes per query |
| Refusal correctness | Generated answer | Out-of-catalog or ambiguous queries | Query length, ambiguity score |
Stratify all metrics by query length, query type, and category. Aggregate metrics hide the queries the system does worst on.
Dimension 2: Personalization fidelity
Personalized recommendations should reflect real user signals, not surface noise. Four checks matter.
Persona-based eval. Define a small number of persona profiles (a marathon runner with prior shin splint queries, a parent looking for kid-friendly products, a budget-conscious shopper). For the same query set, measure how recommendations differ across personas. The differences should match what a domain expert would expect.
Counterfactual stability. Run the same query with one detail changed. A change from "running shoes for me" to "running shoes for my partner" should produce different results in expected ways. A change from "blue shirt" to "blue T-shirt" should refine, not redirect.
Personalization audit trail. For each personalized recommendation, what specific user signals contributed? A user is going to ask "why are you showing me this?" The answer needs to be coherent and the signals inspectable.
Privacy-respecting personalization. Distinguish opted-in personalization from inferred personalization (location proxies, device fingerprints, inferred demographics) and surface the latter for review.
Personalization that learns from biased historical data can produce discriminatory recommendations. A system that observes "users in zip code X buy more luxury items" and skews recommendations accordingly is using zip code as a wealth proxy, which can correlate with race. Include disparate impact testing for personalization, similar to the bias audit post in the HR cluster.
Dimension 3: Attribute extraction accuracy
LLM-powered search extracts structured attributes from natural-language queries before retrieval. Errors here propagate through everything downstream.
Test set construction
A separate eval set focused on parsing accuracy. Each example is (query, expected attributes). Coverage spans price ranges ("under $50," "premium"), category and subcategory ("running shoes" vs "trail running shoes"), brand mentions, use cases ("for hiking," "for cold weather"), constraints ("vegan," "made in USA"), comparative qualifiers ("alternative to X"), and negations ("not too tight," "without leather"). The system's extraction is compared to the gold structure each query implies.
Common failure modes
| Failure mode | Example query | What goes wrong |
|---|---|---|
| Negation flip | "shoes that are not too narrow" | Parser drops "not" and returns narrow shoes |
| Implicit attribute overshoot | "shoes for my grandmother" | System assumes all gifts for older relatives must be orthopedic |
| Conflicting constraints | "affordable luxury watch" | System guesses tier instead of asking, returns wrong half of catalog |
| Scope error | "Apple watch" vs "watch with apple logo" | Brand confused with descriptor, wrong category retrieved |
| Temporal anchoring | "latest iPhone" or "for my June wedding" | Cached training cutoff returns last-year's model or out-of-season inventory |
Specific tests for each failure mode are necessary. Negation, in particular, is worth its own labeled set because the cost of inversion is high.
Dimension 4: Hallucination control
The most legally and operationally consequential failure mode. LLM-powered product search hallucinates in four flavors: products that do not exist (a recommendation with price, image, and description that resolve to nothing), product attributes that are wrong ("waterproof" for water-resistant, "leather" for vegan), fabricated social proof ("one of our most popular products" with no review data backing it), and pricing or availability errors (a cached price in chat that does not match checkout, the bug that killed the original ChatGPT Instant Checkout).
LLM-as-judge for intent match on long-tail queries
For the emotional, multi-clause queries that traditional metrics miss, an LLM-as-judge step is the cleanest grader. The judge sees the original query, the returned product, and the catalog facts, and grades whether the product genuinely satisfies the user's intent. Below is a working Python snippet that runs through Respan's gateway so you get tracing, caching, and prompt versioning for free.
import os
import json
from openai import OpenAI
# Respan gateway is OpenAI-compatible. Set RESPAN_API_KEY in env.
client = OpenAI(
api_key=os.environ["RESPAN_API_KEY"],
base_url="https://gateway.respan.ai/v1",
)
JUDGE_PROMPT = """You are evaluating whether a product search result satisfies a customer's intent.
The customer's query may include emotional context, situational constraints, or implicit needs.
Query: {query}
Returned product:
- Title: {title}
- Attributes: {attributes}
- Price: ${price}
- In stock: {in_stock}
Task:
1. Extract every constraint from the query (explicit and implicit).
2. For each constraint, mark whether the returned product satisfies, partially satisfies, or violates it.
3. Flag any attribute the product card claims that is NOT supported by the catalog attributes provided.
4. Return a JSON object with: score (0-1), constraints (list), hallucinated_attributes (list), reasoning.
Return JSON only.
"""
def judge_search_result(query: str, product: dict) -> dict:
response = client.chat.completions.create(
model="claude-opus-4-7",
messages=[
{
"role": "user",
"content": JUDGE_PROMPT.format(
query=query,
title=product["title"],
attributes=json.dumps(product["attributes"]),
price=product["price"],
in_stock=product["in_stock"],
),
}
],
# Respan-specific headers attach metadata to the trace.
extra_headers={
"x-respan-prompt-id": "search-judge-v3",
"x-respan-tags": "eval,llm-judge,product-search",
},
response_format={"type": "json_object"},
temperature=0,
)
return json.loads(response.choices[0].message.content)
if __name__ == "__main__":
query = (
"lightweight rain jacket that packs into its own pocket "
"for hiking the Grand Canyon in shoulder season"
)
product = {
"title": "TrailShell Packable Rain Shell",
"attributes": {
"weight_oz": 8.2,
"packable": True,
"waterproof_rating_mm": 10000,
"season": ["spring", "fall"],
},
"price": 119.00,
"in_stock": True,
}
verdict = judge_search_result(query, product)
print(json.dumps(verdict, indent=2))
# Block the result if score is too low or any attribute hallucinated.
if verdict["score"] < 0.7 or verdict["hallucinated_attributes"]:
print("REJECT: send back to ranker for fallback candidate")The judge runs offline against eval datasets and online as a guardrail on a sampled fraction of production traffic. When the score drops or hallucinated attributes appear, the trace is already captured, the prompt version is pinned, and the on-call can replay the entire chain.
Test for hallucination
Four practices, layered: evidence-required prompting forces every recommendation to cite a specific product ID and attribute values from the live catalog, with unevidenced recommendations filtered. Post-generation verification runs a separate process that validates each cited product and fact (does it exist, do attributes match, is the price current, is inventory available) and flags or removes failures. An adversarial query suite probes for fake brand names, very specific products that do not exist, and attributes uncommon in the category, expecting refusal or qualified substitutes. Production sampling continuously runs responses through the verification pipeline and alerts on hallucination-rate increases, since a frontier model release that improves overall accuracy can simultaneously raise edge-case hallucinations.
Run the judge as a CI gate, not just a dashboard
Respan experiments let you run the LLM-as-judge above against a versioned dataset on every pull request, with a blocking threshold for hallucinated attributes and intent-match score. Wire it once at platform.respan.ai and a regression in long-tail relevance never reaches production.
Dimension 5: Conversion correlation
The ultimate metric for product search is whether it drives purchases, but conversion correlation is not the same as relevance.
| Signal | What it tells you | Watch out for |
|---|---|---|
| CTR by rank | Whether users investigate your recommendations | Position bias inflates top results |
| Add-to-cart rate | Whether users move toward purchase | High intent queries skew the metric |
| Purchase conversion | Whether they buy | Selection bias on agent surfaces |
| Return rate | Whether they keep what they bought | 30-day lag on the signal |
| LTV of agent-driven customers | Whether AI traffic is worth acquisition spend | Cohort takes weeks to mature |
Three traps recur. Optimizing solely for conversion can degrade relevance: a system that learns "expensive products convert better, recommend them more" is making a fairness error if price correlates with anything other than user value. Selection bias in production data: users arriving through agent surfaces are not representative, so conversion from that slice has to be normalized against the broader user base. Returns are a slow signal: today's purchase data is informative about today's quality only with a lag, and the eval framework has to account for it.
What separates serious eval from compliance theater
After watching e-commerce LLM search through 2025 and 2026, the teams that build durable search experiences share six habits. The eval set is curated, versioned, and refreshed as a strategic asset that compounds over time. Metrics are multiple and stratified, not a single relevance score collapsing five distinct failure modes. Hallucination is actively tested with adversarial queries, evidence-required prompting, and post-generation verification, with the hallucination rate tracked rather than assumed low. Personalization is audited for fairness with disparate impact testing on recommendation distributions. Conversion correlation acknowledges its biases: optimization signals are balanced, selection bias is normalized, returns are tracked with their lag. And findings produce engineering work: eval is not a dashboard, it is the input to the next sprint. Teams that ship a quick LLM wrapper and call it search produce demos that erode trust the second time the user notices a hallucinated product.
Treat the eval set as a versioned asset
Production queries that fail today are training data tomorrow. Respan datasets version every example with the trace, judge verdict, and prompt that produced it, so the eval set compounds rather than drifts. Spin one up at platform.respan.ai and pipe failed production queries directly into your golden dataset.
Build order
| Step | What you are building | Why it matters |
|---|---|---|
| 1 | Trace every search call end to end | You cannot grade what you cannot see |
| 2 | Pull 200-500 production queries into a labeled dataset | The eval set is the asset that compounds |
| 3 | Wire 2-3 evaluators for the failure modes you fear most | Coverage beats sophistication early |
| 4 | Put extraction, recommendation, and refusal prompts in the registry | Versioning unlocks A/B and rollback |
| 5 | Route through the gateway with caching and fallback | Cost and consistency across surfaces |
| 6 | Monitor hallucination rate and conversion correlation per surface | Catch regressions before customers do |
How Respan fits
LLM-powered product search eval only works when every query, retrieval, and generated recommendation is captured with the structured signals to score them. Respan is the substrate that turns the five-dimension framework above into a continuous pipeline rather than a one-time benchmark.
- Tracing: every product search query captured as one connected trace, from attribute extraction through retrieval, ranking, generation, and post-generation verification. Auto-instrumented for LangChain, LlamaIndex, Vercel AI SDK, CrewAI, AutoGen, OpenAI Agents SDK. When a long-tail query like "lightweight rain jacket that packs into its own pocket for hiking the Grand Canyon in shoulder season" produces a hallucinated product, you can replay the full chain and see exactly which step invented the SKU.
- Evals: ten built-in evaluators (faithfulness, citation accuracy, refusal correctness, harmfulness) plus LLM-as-judge and custom Python evaluators. Production traffic flows directly into datasets. CI-aware experiments block regressions on hallucinated products, wrong attribute extraction (negation flips, conflicting constraints), incorrect refusals, and disparate impact in personalized recommendations before deploys ship.
- Gateway: 500+ models behind an OpenAI-compatible interface, semantic caching, fallback chains, per-customer spending caps. Semantic caching cuts cost on repeated browse queries while fallback chains keep ACP and UCP surfaces consistent when the primary model degrades on edge cases.
- Prompt management: versioned registry, dev/staging/prod environments with approval workflows, A/B testing in production with one-click rollback. The attribute extractor prompt, the evidence-required recommendation prompt, the refusal prompt, and the persona-conditioned ranking prompt all belong in the registry so eval scores can be tied to specific prompt versions.
- Monitors and alerts: hallucination rate, refusal correctness, NDCG@K per query stratum, attribute extraction accuracy, conversion correlation per surface (ACP, UCP, direct, Comet). Slack, email, PagerDuty, webhook. A frontier model swap that improves overall accuracy but spikes hallucination on obscure categories pages the on-call before customer support tickets land.
Search experiences users come back to are built on this loop, not on a one-time benchmark that drifts the moment a frontier model ships. To wire any of the patterns above on Respan, start tracing for free, read the docs, or talk to us.
Related reading
- Building for the Agentic Commerce Era: the protocol layer and agent traffic patterns
- LLM Customer Service in E-commerce: adjacent LLM application with similar failure modes
- Building an AI Shopping Assistant: full architecture walkthrough
- How E-commerce Teams Build LLM Apps in 2026: pillar overview
