An honest look at the state of OpenTelemetry semantic conventions for LLM applications, the specific challenges of tracing CLI-based AI tools, and practical workarounds that actually work in production.
Hendrix Liu · 3 days ago

A trace-level comparison of Claude Code and OpenAI Codex CLI: how they plan, call tools, recover from errors, and spend tokens, and what that means for choosing between them.
Frank Chen · 4 days ago
A deep dive into the systematic biases in LLM-as-a-judge evaluation, including self-preference, verbosity, authority, agreeableness, and position bias, and how to design experiments that produce conclusions you can actually trust.
Frank Chen · 5 days ago
A catalog of 25 LLM failure modes in production: loud failures you can catch, silent failures that hide in plain sight, and slow failures that only reveal themselves in hindsight.
Frank Chen · 6 days ago