Skip to main content

2 posts tagged with "benchmarking"

View all tags

The CI Host Whose CPU Governor Decided Your Agent Benchmark's Outcome

· 9 min read
Tian Pan
Software Engineer

A team I worked with spent three days hunting a 22% latency regression in their agent loop. They blamed a new tool router. They blamed a switched model version. They blamed the JSON schema validator they had quietly upgraded the week before. They eventually found the culprit two layers below their code: a runner image had rolled forward, the new image defaulted the cpufreq governor to schedutil instead of performance, and the burstiness of an agent's tool-call loop made schedutil's ramp-up latency visible in p95. The model was fine. The agent was fine. The kernel changed its mind about how to clock the CPU between micro-bursts of work, and the entire benchmark moved.

This is the failure mode most agent teams never see, because they never look. Your CI benchmark numbers are not measurements of the model or the agent. They are measurements of a stack that happens to include a model, a network, a shared VM, a hypervisor scheduler, a cache hierarchy with unknown neighbors, and — most quietly — a frequency-scaling policy that gets to decide whether a given millisecond of compute runs at 1.0 GHz or 3.6 GHz.

The Eval-Rig Latency Lie: Why Your p95 Doubles in Production

· 10 min read
Tian Pan
Software Engineer

The eval team puts a number on the deck: "p95 latency is 1.2s." The launch ships. A week later, oncall posts a graph: production p95 is 4.8s and climbing through the dinner-time peak. Engineers spend the next five days arguing about whether something regressed, instrumenting model versions, opening tickets with the provider — and eventually discover that nothing changed except where the number was measured. The eval rig was reporting the latency of a quiet machine running serial calls against a warm cache. Production is a different system. The p95 was never wrong; it was answering a different question.

This is the eval-rig latency lie. It is not about bad benchmarks — most teams use reasonable tools and report the numbers honestly. It is about the gap between "the latency of the model" and "the latency a user experiences," and the fact that the rig you build for development almost always measures the first while implying the second. Once you internalize this, latency SLOs derived from a benchmark stop looking like product commitments and start looking like claims about a private testing environment that nobody else can reproduce.