Blog

How to Evaluate AI Agents in Production (Not Just Benchmarks)

Every 'AI agent evaluation' article is benchmark-focused. Production agent evaluation is different. Five eval criteria that catch the failures benchmarks miss, with the methodology to wire them into live traffic.

Frank Chen · May 21, 2026

How to Evaluate AI Agents in Production (Not Just Benchmarks)

Engineering

Prompt Versioning Without Evals Is Just Diff Tracking (2026)

Prompt versioning is solved. Closed-loop iteration is not. The 4 gaps in 2026 prompt management stacks, a side-by-side comparison of Respan, LangSmith, Langfuse, PromptLayer, Braintrust, Humanloop, Helicone, and Agenta, and what the full loop actually looks like.

Frank Chen · May 19, 2026

Product

Single Agent vs Multi-Agent: Why We Rebuilt Our AI Agent

Single agent vs multi-agent (router pattern): when each architecture wins, the regression net we used to measure the rebuild, and the production data.

Marcus Huang · May 5, 2026

Guide

Portkey was just acquired by Palo Alto Networks. Here's where to migrate.

Palo Alto Networks acquired Portkey on April 30, 2026. Portkey will become the AI Gateway for Prisma AIRS. Compare the best independent Portkey alternatives including Respan, LiteLLM, OpenRouter, Vercel AI Gateway, and Cloudflare AI Gateway.

Respan · April 30, 2026

Model

Claude 3.5 Haiku vs. Sonnet: speed or power? A comprehensive comparison

Hendrix Liu · February 18, 2026

Model

GPT-5 mini vs Gemini 3 Flash Preview vs Claude 4.5 Haiku

Frank Chen · February 17, 2026