The CI Host Whose CPU Governor Decided Your Agent Benchmark's Outcome
A team I worked with spent three days hunting a 22% latency regression in their agent loop. They blamed a new tool router. They blamed a switched model version. They blamed the JSON schema validator they had quietly upgraded the week before. They eventually found the culprit two layers below their code: a runner image had rolled forward, the new image defaulted the cpufreq governor to schedutil instead of performance, and the burstiness of an agent's tool-call loop made schedutil's ramp-up latency visible in p95. The model was fine. The agent was fine. The kernel changed its mind about how to clock the CPU between micro-bursts of work, and the entire benchmark moved.
This is the failure mode most agent teams never see, because they never look. Your CI benchmark numbers are not measurements of the model or the agent. They are measurements of a stack that happens to include a model, a network, a shared VM, a hypervisor scheduler, a cache hierarchy with unknown neighbors, and — most quietly — a frequency-scaling policy that gets to decide whether a given millisecond of compute runs at 1.0 GHz or 3.6 GHz.
