If you are building an AI shopping assistant in 2026, the architecture is no longer a research question. The build pattern has stabilized into four layers: a catalog graph that represents every product with consistent attributes, hybrid retrieval that combines lexical and semantic signals, structured generation grounded in real catalog data with per-claim citations, and conversion tracking designed for a no-click world. Shopify Sidekick crossed 100 million conversations in its first year. Klaviyo's AI customer service agent reports response time reductions of 70 to 90 percent with answer quality on par with a senior support agent. Nosto, Rep AI, and Octopus AI run production agents at scale across thousands of merchants. The differentiation is execution depth.
The hard parts are not the LLM call or the embedding model. They are the things that determine whether the assistant converts: a catalog representation that handles millions of products with consistent attributes, retrieval that surfaces the right products for long-tail queries, structured generation grounded in real catalog data, personalization that uses real signals, and conversion tracking that works when the user buys without ever clicking a link. Get any of these wrong and the assistant either hallucinates products that do not exist or recommends generic items that fail to move purchase intent. McKinsey's 2025 commerce survey put conversion-rate uplift from grounded recommenders at 10 to 30 percent, with the top quartile clearing 40 percent on assisted SKUs.
This post walks through the architecture, identifies where teams typically miss, and lays out a build plan with eval gates between every stage. It assumes you have read the related posts on agentic commerce protocols, search evaluation, and customer service architecture. Those define the context; this is the build.
Architecture overview
The simplified production architecture, end to end:
Each block is its own subsystem. The hard parts cluster in three places: the catalog graph (the foundation everything else depends on), retrieval (where relevance lives), and verification (the difference between trusted and abandoned). Conversion tracking is the closing loop, and without it the team is iterating blind.
Trace every layer end to end
Respan auto-instruments LangChain, LlamaIndex, Vercel AI SDK, CrewAI, and OpenAI Agents SDK so a single shopping conversation is one connected trace from intent classification through retrieval, generation, verification, and conversion. Replay any hallucinated product or stale price citation in seconds at platform.respan.ai.
The catalog graph
The single most important piece of infrastructure for a shopping assistant is the catalog representation. Shopify spent two years on this before launching Sidekick at scale, using LLMs to categorize products and extract attributes consistently across millions of merchants where each had inconsistent data.
The pattern that works starts with a canonical product schema. Every product in your catalog has a standardized representation independent of how the merchant uploaded it. The schema below is the minimum that supports retrieval, grounded generation, and verification:
product:
product_id: <uuid>
merchant_id: <id>
identity:
title: <text>
brand: <text or null>
canonical_category: <reference to taxonomy>
canonical_subcategory: <reference to taxonomy>
gtin: <if available>
description:
short_description: <text>
long_description: <text>
structured_attributes:
- attribute: <e.g., "material">
value: <e.g., "merino wool">
confidence: <float>
source: extracted | provided
variants:
- variant_id: <id>
sku: <text>
attributes: <map> # color, size, etc.
price:
amount: <decimal>
currency: <ISO>
sale_price: <decimal or null>
inventory:
available: <int or "in_stock" / "out_of_stock">
last_updated: <ISO timestamp>
media:
images: [<list of refs>]
primary_image: <ref>
pricing_metadata:
typical_price_range: [<low>, <high>]
price_tier: budget | mid | premium | luxury
social_proof:
review_count: <int>
average_rating: <float 0-5>
most_recent_review_excerpt: <text>
availability:
shipping_estimate: <text>
geographic_availability: [<list>]
return_policy_summary: <text>
agent_metadata:
embeddings: <vector or ref>
last_indexed: <ISO>
parsing_confidence: <float>What matters in this schema. Three fields carry most of the weight. structured_attributes with confidence and source is what lets the verifier trace every generated claim back to a real catalog field. last_updated on inventory and last_indexed on the agent metadata are how the system knows which records are stale enough to refuse. parsing_confidence is the trust signal you propagate into retrieval ranking, so noisy merchant uploads get downweighted before they reach the generation layer.
Taxonomy and attribute extraction. A standardized taxonomy of categories and a standardized vocabulary of attributes. Products from different merchants get mapped consistently. An LLM-based extraction layer fills in attributes the merchant did not provide, with confidence scores.
This is the hard, expensive, ongoing infrastructure work. It is also the foundation everything downstream depends on. A retrieval system over a poorly normalized catalog returns inconsistent results. A generation layer working from inconsistent attributes hallucinates to fill gaps.
Versioning and freshness. Every product record has a last-indexed timestamp. Inventory and price changes propagate to the index in near real time (sub-minute is the target, minute-level is acceptable, hour-level is failing). Stale data is the most common cause of agent platform deprioritization, as documented in OpenAI's retreat from the original Instant Checkout.
Scale considerations. Shopify operates this across products from millions of merchants. A single-merchant build is much simpler. The architecture scales in two directions: catalog size (more products, more attributes, more variants) and merchant count (different schemas, different quality, different policies). Build for the dimension you actually face.
Retrieval
The retrieval layer takes a user query (and conversation context, and personalization signals) and returns a candidate set of products for the generation layer to rank and present.
The pattern that works in production combines lexical search (BM25 or equivalent) plus semantic search (dense embeddings) plus a learned reranker. Each catches different query types. Lexical matches keyword-specific queries ("Patagonia jacket"), semantic matches descriptive queries ("warm jacket for cold weather hiking"), and the reranker combines them.
A real implementation of the hybrid call looks like this:
from typing import Iterable
def hybrid_retrieve(
query: str,
extracted_attrs: dict,
user_signals: dict,
k: int = 50,
) -> list[dict]:
"""Combine BM25 and dense retrieval, filter by attributes,
then rerank with a cross-encoder. Returns candidates with
score breakdowns so generation can cite retrieval reasons."""
bm25_hits = bm25_index.search(query, k=k * 2, filters=extracted_attrs)
dense_hits = vector_index.search(
embed(query, user=user_signals.get("user_id")),
k=k * 2,
filters=extracted_attrs,
)
fused = reciprocal_rank_fusion(bm25_hits, dense_hits, k_rrf=60)
fused = apply_personalization(fused, user_signals)
pairs = [(query, _doc_text(c["product_id"])) for c in fused[: k * 2]]
rerank_scores = cross_encoder.score(pairs)
for c, s in zip(fused[: k * 2], rerank_scores):
c["retrieval_scores"]["combined"] = s
fused.sort(key=lambda c: c["retrieval_scores"]["combined"], reverse=True)
return diversify(fused, attribute="canonical_subcategory", k=k)
def reciprocal_rank_fusion(
*result_lists: Iterable[dict], k_rrf: int = 60
) -> list[dict]:
scores: dict[str, float] = {}
by_id: dict[str, dict] = {}
for results in result_lists:
for rank, hit in enumerate(results):
pid = hit["product_id"]
scores[pid] = scores.get(pid, 0.0) + 1.0 / (k_rrf + rank)
by_id[pid] = hit
out = []
for pid, s in sorted(scores.items(), key=lambda x: -x[1]):
hit = by_id[pid]
hit.setdefault("retrieval_scores", {})["fused"] = s
out.append(hit)
return outThe pieces that matter: reciprocal rank fusion gives a calibrated way to merge BM25 and dense scores without hand-tuned weights, the cross-encoder rerank pass is where most of the precision gain shows up, and diversify enforces attribute spread so the top-K are not five near-duplicate variants of the same SKU.
Filtering by extracted attributes. The intent classifier produces structured attributes from the query (price range, category, color, etc.). Retrieval filters the candidate set against these. A query for "blue running shoes under $80" filters by color = blue, category = running shoes, price under $80 before semantic search runs.
Personalization at the retrieval layer. User signals (prior queries, prior purchases, browsing history, declared preferences) influence which candidates surface. Implementation varies: some systems use user embeddings concatenated with query embeddings, some use post-retrieval reranking based on user history, and some use feature-store-backed filtering.
Recency and seasonality. Recently launched products and seasonally relevant products get a boost. A query in October for "jackets" should preferentially surface fall and winter inventory.
Diversity and exploration. The top K results should not all be near-duplicates. Diversity-aware reranking ensures the surfaced set covers the relevant attribute space. Exploration occasionally surfaces products outside the user's typical pattern to learn new preferences.
The retrieval contract that downstream layers rely on:
retrieval_request:
query: <user input>
conversation_context: <prior turns>
user_signals:
user_id: <if known>
session_history: [<list>]
declared_preferences: [<list>]
extracted_attributes:
category: [<list>]
price_range: [<min>, <max>]
color: [<list>]
size: [<list>]
use_case: [<list>]
brand: [<list>]
other: [<map>]
retrieval_response:
candidates:
- product_id: <id>
retrieval_scores:
lexical: <float>
semantic: <float>
personalization: <float>
recency: <float>
combined: <float>
retrieval_explanation:
matched_attributes: [<list>]
personalization_signals: [<list>]
total_candidates_in_index: <int>
filtered_to: <int>
surfaced: <int>What matters in this schema. Score breakdowns are non-optional. The combined score alone is not auditable, but lexical, semantic, personalization, and recency components together let you diagnose why a product surfaced and back-test changes when relevance drifts. The retrieval explanation is what supports the eval framework covered in Evaluating LLM-Powered Product Search.
Generation with grounding
The generation layer takes the user query, conversation context, and retrieved candidates, and produces the response. The structure of the response is what separates trusted from untrusted.
Structured output, not prose. A grounded response is structured: a list of recommendations with per-product rationale, possibly a comparison table, possibly a follow-up question. Prose summaries that float free of structure are harder to verify and easier to hallucinate.
Citations to specific products. Every claim the generation layer makes about a product references that product's ID and the specific catalog field the claim came from. "This jacket is waterproof" cites the product's structured_attributes for waterproofing. "These reviewers love it for hiking" cites the actual review excerpts. Without citations, the layer hallucinates with confidence.
Comparison logic. Multi-product queries often ask for comparisons. The generation layer produces structured comparison tables when comparison is the implied intent. The table fields come from the catalog graph, rows are products, and cells are real attribute values, not generated descriptions.
Personalization in the response. The same retrieved candidates can produce different responses based on user signals. A user with prior shin splint queries gets a response that highlights cushioning and stability features, and a user shopping for a marathon gets weight and energy return. The personalization signals that affect retrieval also affect generation.
Conversation continuity. Multi-turn shopping conversations carry context. "What about in blue?" only makes sense given the prior turn. The generation layer maintains state about what was previously discussed, which products were shown, and what the user expressed interest in.
The structured output contract for a shopping query:
response:
intent: product_recommendation
conversational_text: <text wrapping the recommendations>
recommendations:
- rank: 1
product_id: <id>
headline: <text, e.g., "Best for your hiking criteria">
key_points:
- claim: <text>
source: structured_attribute | review | catalog_description
source_reference: <field path>
- claim: <text>
source: <as above>
source_reference: <as above>
personalization_rationale: <if applicable>
cta: <e.g., "View product", "Add to cart", "Compare">
- rank: 2
[...]
comparison_table:
columns: [<product_ids>]
rows:
- attribute: <name>
values: [<from catalog graph>]
source_reference: <path>
followup_question: <text or null>
metadata:
retrieval_id: <for audit>
generation_model: <id>
generation_timestamp: <ISO>What matters in this schema. Every claim carries a source and source_reference. That single requirement is what the verifier uses to reject hallucinated attributes before the response renders. If a claim lacks a source path, the response fails closed.
Verification pass
After generation, every claim and every product reference goes through verification before being shown to the user. A minimal per-claim citation enforcement function:
def verify_response(response: dict, catalog: "Catalog") -> dict:
"""Reject any claim that cannot be traced to a live catalog field.
Returns the response with offending recommendations removed and a
verification report attached for tracing."""
report = {"checked": 0, "dropped": 0, "reasons": []}
kept = []
for rec in response["recommendations"]:
product = catalog.get(rec["product_id"])
if product is None or not product["variants"][0]["inventory"]["available"]:
report["dropped"] += 1
report["reasons"].append((rec["product_id"], "missing_or_oos"))
continue
ok = True
for kp in rec["key_points"]:
report["checked"] += 1
ref = kp["source_reference"]
actual = catalog.resolve(product, ref)
if actual is None or not _claim_matches(kp["claim"], actual):
report["dropped"] += 1
report["reasons"].append((rec["product_id"], f"unsourced:{ref}"))
ok = False
break
if ok and _price_fresh(product, max_age_seconds=60):
kept.append(rec)
response["recommendations"] = kept
response["metadata"]["verification"] = report
return responseThe function enforces four checks: product existence with live availability, attribute trace-back through the source reference, claim-to-value match, and price freshness. Any failure drops the recommendation rather than letting a confident hallucination reach the user.
Product existence check. Each cited product_id must exist in the live catalog with current availability and pricing.
Attribute correctness. Each claim about a product must trace to a real catalog field. If "waterproof" is cited but the product's structured_attributes do not list waterproofing, the claim is hallucinated.
Price and inventory currency. The cited price matches the live price, and the cited availability matches the live inventory. A 30-second-old retrieval result with stale price does not get surfaced; the response is regenerated against fresh data.
Review accuracy. If the response cites reviews ("customers love this for X"), the reviews actually exist and actually say that. Fabricated reviews are direct customer harm.
Failed verification triggers either regeneration with refreshed context or graceful degradation, where the response falls back to recommendations the verifier passed. Hallucinated products are filtered before they ever reach the user.
This is the layer most often skipped or under-built. Without it, the assistant degrades silently as the LLM's confidence varies day to day.
Block hallucinated citations in CI
Respan ships ten built-in evaluators including faithfulness, citation accuracy, and refusal correctness. Wire them into CI-aware experiments so a model swap or prompt edit that increases unsourced claims blocks the deploy before it ships. See the eval workflow at platform.respan.ai.
Conversion tracking in a no-click world
In agentic commerce, traditional click-based attribution fails. The user might see a recommendation in chat, never click, and buy directly. They might add to cart through an agent and the merchant has no idea which assistant interaction drove it. Klaviyo's deployments report that more than half of AI-assisted purchases close without a measurable click on the recommendation surface.
The signals to track, end to end, look like this:
| Stage | Signal | Where it fires | What it tells you |
|---|---|---|---|
| 1 | Query received | Intent classifier span | Demand shape and gap coverage |
| 2 | Recommendation rendered | Server-side surface event | Impression by surface (site, ACP, UCP) |
| 3 | Click or expand | Client event tied to recommendation_id | Surface-level relevance |
| 4 | Add to cart | Cart service webhook | Mid-funnel conversion |
| 5 | Checkout started | Checkout service event | Intent strength |
| 6 | Purchase | Order webhook with attribution match | Realized conversion and AOV |
| 7 | Return or refund | Returns service webhook | Quality signal that updates personalization |
The pattern that works pairs that signal table with five practices:
Server-side tracking. Every recommendation surfaced to a user generates an attributed event with the user, query, and product. When a purchase happens, the conversion is matched against recent recommendations to that user.
Per-surface tagging. Recommendations sourced from different surfaces (your own site assistant, ChatGPT through ACP, Google AI Mode through UCP, browser agents) carry surface tags. Conversion analysis is stratified by surface.
Attribution windows. Recommendation-to-purchase windows for AI shopping are longer than click-to-purchase windows. A user might see a recommendation and buy 24 to 72 hours later. Attribution should accommodate this.
Outcome categorization. Beyond purchase, track add-to-cart, save-for-later, return, and re-purchase. The outcome distribution over a recommendation tells you whether the assistant is genuinely helping or just appearing helpful.
Multi-touch context. A user's path might involve several assistant interactions, agent traffic, direct browsing, and email. Attribution should be multi-touch where possible, not last-touch.
The conversion event model:
recommendation_event:
recommendation_id: <uuid>
user_id: <or anonymous_id>
surface: ai_assistant | acp | ucp | browser_agent | direct
timestamp: <ISO>
query: <text>
recommended_products: [<list>]
context_signals: [<list>]
conversion_event:
conversion_id: <uuid>
user_id: <or anonymous_id>
product_id: <id>
conversion_type: cart | purchase | save | return
timestamp: <ISO>
attribution:
matched_recommendations: [<list of recommendation_ids>]
attribution_window: <hours>
surface_path: [<list>]What matters in this schema. surface and surface_path are the fields most teams forget. Without them, conversion is collapsed across surfaces and the team cannot tell whether ACP traffic converts at half the rate of on-site traffic or twice the rate. That answer changes the entire investment thesis on agentic surfaces.
Personalization signal management
Personalization that works is grounded in real signals. Several signal types matter:
| Signal | Source | Use |
|---|---|---|
| Declared preferences | User profile, onboarding | Hard filters and ranking boosts |
| Purchase history | Order management | Style, brand, category preference inference |
| Browse history | Site analytics | Recent interest signals |
| Conversational context | Current and prior chat sessions | Real-time intent refinement |
| Cart and wishlist | Site state | Active consideration signals |
| Reviews and feedback | Product feedback | Quality preferences |
| Returns | Return system | Negative signals (avoid future similar items) |
| Engagement patterns | Email, push, ad clicks | Channel preferences and inferred interests |
The risk: signals that proxy for protected demographics (zip code as wealth proxy, browsing patterns as race proxy) introduce disparate impact. The mitigation is the same as covered in the HR cluster: continuous monitoring of recommendation distributions across demographic dimensions, with alerts on disparate impact thresholds.
The opposite risk: ignoring signals and producing generic recommendations that do not move conversion. The balance comes from explicit user-controllable personalization (the user can see which signals are influencing their experience and can disable them) plus careful disparate impact monitoring.
Cache the expensive calls
Catalog enrichment and intent extraction calls fire on nearly every conversation. The Respan gateway provides 500+ models behind an OpenAI-compatible interface with semantic caching, fallback chains, and per-customer spending caps so per-conversation cost stays predictable as ACP and UCP traffic compounds. Switch your base URL to start at platform.respan.ai.
Where engineering teams typically miss
Patterns that show up across deployed shopping assistants in 2025 and 2026.
Catalog graph is shallow. Products from different merchants have inconsistent attributes; the LLM compensates by generating attributes that may not be true. Retrieval and generation both suffer.
No verification pass. Hallucinated product references reach the user. Some are caught, many are not. Trust erodes silently.
Conversion tracking is click-based. Server-side tracking is missing or per-surface tagging is missing. The team thinks the assistant has X% conversion, but the real number is much higher or much lower.
Personalization without disparate impact monitoring. Recommendations skew along demographic lines, the team does not know because they are not measuring, and the regulatory exposure is real (Colorado AI Act covers consumer-facing AI personalization in some interpretations).
Top-K filtering hides the long tail. The assistant always shows the same handful of products because the retrieval scores cluster around a few "popular" items. Diversity-aware reranking is missing, and the long tail of legitimately relevant products is invisible.
No conversation memory. Each turn is independent and the assistant forgets what was just shown. Multi-turn shopping conversations break.
Stale catalog data. The catalog graph syncs nightly from the warehouse system. Recommendations cite prices that are wrong. Cart shows different prices than chat.
No evaluation discipline. The team launched the assistant and is iterating on user feedback, not on a structured eval set. Drift goes undetected.
Build order
Shopping assistants fail when teams build the LLM call before the catalog graph. The dependencies are real: each layer below assumes the one above is solid, and the eval gate decides when you move on.
| Order | What you build | Eval gate before moving on |
|---|---|---|
| 1 | Catalog graph: canonical schema, taxonomy mapping, LLM-based attribute extraction, sub-minute price and inventory sync | 1,000-product sample with at least 95% attribute completeness, less than 1% stale-price rate, 100% canonical category coverage |
| 2 | Hybrid retrieval: lexical plus semantic, attribute filters from intent classification, diversity-aware reranking | 100-query gold set with recall at 20 above 0.85, top-3 attribute-match precision above 0.9, no near-duplicate clusters in surfaced set |
| 3 | Structured generation with citations: per-product rationale, comparison tables sourced from catalog fields, followup logic | Faithfulness above 0.95 and citation-accuracy above 0.95 on the 100-query set, zero unsourced attribute claims in 50 sampled responses |
| 4 | Verification pass: product-existence check, attribute trace-back, price and inventory currency, review-quote validation | Hallucinated-product rate below 0.1% on shadow traffic, 100% of failed verifications either regenerate or degrade gracefully |
| 5 | Multi-turn state and conversion tracking: conversation memory, server-side recommendation events, per-surface tags, multi-touch attribution windows | Multi-turn coherence above 0.9 on a 50-conversation eval set, conversion attribution reconciles within 2% of merchant order data across surfaces |
| 6 | Signal-grounded personalization with fairness monitoring: declared preferences, purchase and browse history, disparate impact dashboards | Recommendation distributions within fairness thresholds across protected dimensions, recommendation-to-cart lift above baseline holdout |
After step 6, A/B testing infrastructure, adversarial robustness suites, and ACP/UCP cross-surface integration become tractable. Skipping ahead, especially shipping generation before the catalog graph and verification, produces an assistant that hallucinates with confidence and erodes trust silently while the team chases user feedback instead of structural fixes.
What separates serious shopping assistants from chatbots
After watching the category through 2025 and 2026:
The catalog graph is the moat. Standardized, attribute-rich, real-time, scaled. Vendors that built this win the relevance battle.
Verification is non-negotiable. Hallucinated products and prices are existential trust failures. Verification has to run on every response, not just sampled.
Conversion tracking is server-side and per-surface. Click-based attribution is wrong, and surface attribution captures where the value is being created.
Personalization is signal-grounded and audit-able. Real signals, user-controllable, monitored for fairness.
Multi-turn conversations work. State persists, followups make sense, and the assistant feels like one entity across turns.
Eval discipline is continuous. The eval set evolves, production failures feed back, and regressions catch before deployment.
These are the engineering practices that produce assistants users return to. Without them, the assistant becomes a clever demo that gets bypassed in favor of search.
How Respan fits
Building an AI shopping assistant means orchestrating catalog retrieval, structured generation, verification, and conversion tracking across millions of products and conversations. Respan is the substrate underneath: tracing, evals, gateway, prompts, and monitoring for the entire shopping pipeline.
- Tracing: every shopping conversation captured as one connected trace, from intent classification through hybrid retrieval, attribute filtering, generation, and verification. Auto-instrumented for LangChain, LlamaIndex, Vercel AI SDK, CrewAI, AutoGen, OpenAI Agents SDK. When a recommendation hallucinates a product or cites a stale price, you can replay the exact retrieval candidates, personalization signals, and verification path that produced it.
- Evals: ten built-in evaluators (faithfulness, citation accuracy, refusal correctness, harmfulness) plus LLM-as-judge and custom Python evaluators. Production traffic flows directly into datasets. CI-aware experiments block regressions on hallucinated product references, mismatched attributes, stale prices, and lost conversation continuity before deploys ship.
- Gateway: 500+ models behind an OpenAI-compatible interface, semantic caching, fallback chains, per-customer spending caps. Caching repeated catalog enrichment and intent extraction calls keeps per-conversation cost predictable as Sidekick-scale traffic compounds across surfaces (your site, ACP, UCP, browser agents).
- Prompt management: versioned registry, dev/staging/prod environments with approval workflows, A/B testing in production with one-click rollback. Intent classifiers, attribute extractors, generation templates with citation requirements, comparison-table builders, and followup-question prompts all belong in the registry so merchandising and engineering can iterate without redeploys.
- Monitors and alerts: hallucinated-product rate, citation-accuracy rate, verification-failure rate, recommendation-to-cart conversion by surface, disparate impact on recommendation distributions. Slack, email, PagerDuty, webhook. Stale catalog data and silent personalization drift are caught before they erode trust.
A reasonable starter loop for AI shopping assistant builders:
- Instrument every LLM call with Respan tracing including intent classification, retrieval, generation, and verification spans.
- Pull 200 to 500 production shopping conversations into a dataset and label them for catalog accuracy, citation correctness, and conversion outcome.
- Wire two or three evaluators that catch the failure modes you most fear (hallucinated product references, stale price or inventory citations, recommendation distributions that skew along demographic lines).
- Put your intent classifier, generation template, and comparison-builder prompts behind the registry so you can version, A/B, and roll back without a deploy.
- Route through the gateway so catalog enrichment and intent extraction calls cache cleanly and fallback chains absorb model outages without breaking the shopping flow.
Without this loop, the assistant degrades silently as the LLM's confidence varies day to day, hallucinated products reach users, and conversion attribution stays guesswork in a no-click world.
To wire any of the patterns above on Respan, start tracing for free, read the docs, or talk to us.
Related reading
- Building for the Agentic Commerce Era: protocol layer and external surfaces
- Evaluating LLM-Powered Product Search: the relevance and personalization layer
- LLM Customer Service in E-commerce: adjacent application with similar failure modes
- How E-commerce Teams Build LLM Apps in 2026: pillar overview
