LLM monitoring is the part of running an LLM application that nobody documents until something breaks. Engineers ship a feature, watch it work in a few demos, and then have no idea whether quality is drifting until a customer support ticket lands. The fix is monitoring built around the right metrics, not the metrics that are easy to compute. This piece is the operational checklist we recommend after watching teams ship LLM features for production through Respan.
If you have read our LLM observability pillar, monitoring is the layer on top. Observability captures the data. Monitoring decides what to look at and when to wake someone up.
TL;DR
- Seven metrics matter: per-call latency, per-call cost, per-session cost, cache hit rate, faithfulness (or your domain quality metric), error rate, traffic volume.
- Five alerts worth setting: latency p99 spike, cost-per-session 3x outlier, error rate over 1%, faithfulness 7-day rolling drop, traffic anomaly.
- Ignore: vanity averages (mean latency hides p99), unsegmented dashboards (treat your high-stakes traffic separately), and alerts on absolute thresholds without rolling-window context.
- The biggest single change: histograms instead of averages. p50 plus p99 tells you what a normal user sees and what the long tail does. The average tells you neither.
- Monitoring becomes useful the moment you can link an alert to a specific failing trace in under 30 seconds.
What to monitor (the 7 metrics that matter)
1. Per-call latency
Track p50, p95, p99. The mean is a lie. A model that takes 800ms 99% of the time and 30 seconds the rest is not a "1.1 second average" application. It is an application that times out for 1 in 100 users.
Break the latency budget into stages: time-to-first-token, total response time, tool-call latency separately. Different fixes for different bottlenecks.
2. Per-call cost
Cost per LLM call. Track in dollars (not just tokens) because pricing changes silently. With provider prompt caching enabled, the cache_read vs cache_write split is the second number worth tracking. See LLM cache layers for what each cache layer should be doing to this number.
3. Per-session cost
This is the one nobody computes early enough. A single user request that produces one LLM call is cheap. A 14-step agent loop that calls the model on every iteration is not. Per-session cost tells you when an agent has gone recursive or when context windows are bloating across turns.
Set up a per-session cost histogram (p50, p95, p99) per feature. The day the p99 jumps from $0.20 to $4.00, something just changed in your agent loop and you want to know before the bill arrives.
4. Cache hit rate
Per cache layer. Provider prompt cache, exact-match cache, semantic cache (if you have one). Hit rate below 30% means a layer is failing to do its job and is either misconfigured or wrong for the workload. We covered the operational diagnostics in LLM cache layers.
5. Faithfulness (or your domain quality metric)
The single quality metric that matters most for your application. For RAG, faithfulness or citation accuracy. For classification, accuracy on a sampled gold set. For chat, helpfulness or refusal rate. Run it on a sampled fraction of production traffic via an LLM judge (online evaluation pattern, see RAG observability for the loop).
Without a quality metric, you cannot tell when the model gets worse. With one, drift becomes visible in a trend line before users complain.
6. Error rate
Provider errors (429, 500, 529), internal errors (timeouts, validation failures), and "logical" errors (the model returned malformed JSON when you asked for structured output). Each category is a different fix and each one is invisible if you bucket them all into "500".
Anything above 1% error rate, sustained, deserves attention. Anything above 5% is an incident.
7. Traffic volume
Calls per minute, broken down by feature or surface. The dullest metric but it is how you detect the new feature that suddenly tripled your LLM bill or the bot traffic that started hitting your public endpoint last night.
What to ignore
Three patterns we see teams burn weeks on for no usable signal.
Vanity averages. Mean latency, mean cost, mean tokens-per-call. They all hide the long tail. The user complaining about your product is sitting on the p99, not the mean. Compute p50 + p99 and look at both.
Single-dashboard everything. A dashboard that mixes your high-stakes regulated traffic with casual chat looks like a healthy average when one segment is on fire. Always segment by surface or intent class.
Absolute-threshold alerts without context. "Alert when latency goes above 2 seconds" sounds reasonable until 2 seconds is normal during peak hours and the alert fires every afternoon. Rolling-window alerts with seasonality matter. "Alert when p99 is 3x the 7-day rolling average" is a useful alert. "Alert above 2 seconds" is not.
The five alerts worth setting up
Five alerts cover roughly 90% of the cases where you want to be woken up.
Alert 1: Latency p99 spike
Fire when latency p99 is 3x its 7-day rolling average for at least 5 minutes. Catches provider degradation, your own context-window bloat, and tool calls that started timing out upstream.
Alert 2: Cost-per-session 3x outlier
Fire when per-session cost p99 is 3x its 7-day rolling average. This is the agent-loop-gone-recursive detector. Often the first signal that someone shipped a prompt or workflow change with a loop bug.
Alert 3: Error rate over 1%
Sustained 1% error rate over 10 minutes is the threshold most teams care about. Page someone. Break out by error category in the alert (provider 5xx vs internal timeout vs malformed-JSON).
Alert 4: Faithfulness 7-day rolling drop
Fire when the rolling 7-day average of your quality metric drops below your floor (we use 0.90 for general chat, 0.96 for any regulated workload). This is the silent-quality-regression detector. Without it, prompt changes ship and damage quality for weeks before someone notices.
Alert 5: Traffic anomaly
Fire when calls per minute drop or spike 50% off baseline. Drops catch outages and broken integrations. Spikes catch bot traffic and overnight viral moments.
For each alert, the rule of thumb: if the alert fires, can someone find the bad traces in under 30 seconds? If not, the alert is not connected to your observability and is useless. See agent debugging for the trace-tree pattern that makes drill-down work.
Dashboards we recommend
Three dashboards cover the operational picture for most teams.
Real-time ops. Calls per minute, p50/p99 latency, error rate, current spend rate. The "is anything on fire right now" view. Refresh every 15 seconds.
Cost. Per-feature spend, per-customer spend (if you bill or just want to know), cache hit rate per layer, token mix (cached vs uncached input). The "where is the money going" view.
Quality. Faithfulness or your domain quality metric over time, segmented by feature and customer tier. Eval score distribution. Sample-grade audit results vs LLM-judge results. The "are users getting good answers" view.
Build the quality dashboard first. It is the dashboard you check on Monday morning before doing anything else. The other two are diagnostic.
Where this fits in the broader stack
Monitoring sits on top of observability. Observability captures the spans, traces, attributes, and scores. Monitoring decides what is normal, what is not, and what wakes someone up. The two pieces share the same data model.
- LLM observability is the broader picture of what to capture.
- LLM tracing is the data model that feeds monitoring.
- RAG observability is the RAG-specific cut, including the dashboards.
- Agent debugging is the workflow once monitoring fires an alert.
- LLM cache layers is what to monitor on the cache side.
Respan ships all of this in one platform with the alerts integrated into the same data model. The monitors and notifications feature handles the alert rules. Self-built monitoring works fine too, especially on top of OpenTelemetry plus Grafana.
Common gotchas
Mistakes we see, ranked by how often they happen.
- No quality metric. You will not know when the model gets worse. Pick a metric, even an imperfect one, and start tracking.
- Alerts on absolute thresholds. Fire constantly during expected peaks, get muted, miss the real incidents.
- One mega-dashboard. Hides segment-specific problems.
- Monitoring without observability. Alert fires, nobody can find the bad traces, alert gets ignored.
- Per-call cost without per-session cost. Catches the cheap-call regressions, misses the agent-loop-gone-bad regressions.
- No alerting on prompt-version changes. Set a hard threshold alert that fires the first day any new prompt version goes prod. Catches the regressions before they age.
- Forgetting traffic volume. Most useful for catching the things that are supposed to be small but are not.
FAQ
What is the difference between LLM monitoring and LLM observability? Observability is the data you capture (traces, spans, metrics, eval scores). Monitoring is what you decide to look at and alert on. Same underlying data, different operational layers.
How often should I sample online evals? 1 to 5 percent of production traffic for high-volume apps. Up to 100 percent for errored calls and for the first 90 days of a new feature. The cost of judges is real if you run them on everything.
What is the cheapest judge to use for online evals? Claude Haiku 4.5 or GPT-5-mini are both reasonable for the qualitative metrics. Run a calibration where you compare 50 samples against a stronger judge, check correlation, and use the cheap one if it tracks closely.
Should I alert on average latency or p99? p99. The average will not catch the tail that hurts your users.
How do I monitor an agent loop specifically? Loop count per session, tokens per session, and the pattern-specific signals from agent workflow patterns. Histograms, not averages.
Do I need Prometheus and Grafana for this, or is OpenTelemetry enough? OpenTelemetry plus any backend that supports OTLP works. Grafana is a reasonable visualization layer. Respan or similar managed platforms collapse the stack into one product. Pick by what your team will actually use.
What's the single highest-ROI monitoring change for a team that has none? Per-session cost histograms. They surface agent loops gone bad, prompt bloat, and context-window inflation in one chart. Pair with a p99 latency chart and you have caught most of what hurts users and the bill.