The CI Host Whose CPU Governor Decided Your Agent Benchmark's Outcome
A team I worked with spent three days hunting a 22% latency regression in their agent loop. They blamed a new tool router. They blamed a switched model version. They blamed the JSON schema validator they had quietly upgraded the week before. They eventually found the culprit two layers below their code: a runner image had rolled forward, the new image defaulted the cpufreq governor to schedutil instead of performance, and the burstiness of an agent's tool-call loop made schedutil's ramp-up latency visible in p95. The model was fine. The agent was fine. The kernel changed its mind about how to clock the CPU between micro-bursts of work, and the entire benchmark moved.
This is the failure mode most agent teams never see, because they never look. Your CI benchmark numbers are not measurements of the model or the agent. They are measurements of a stack that happens to include a model, a network, a shared VM, a hypervisor scheduler, a cache hierarchy with unknown neighbors, and — most quietly — a frequency-scaling policy that gets to decide whether a given millisecond of compute runs at 1.0 GHz or 3.6 GHz.
The reason this hides so well is that agent latency looks like a network-bound story. The model is remote. The tools are remote. The cumulative latency is dominated by token generation. You wouldn't expect the host's CPU clock to matter at all. But the agent loop itself — the JSON-parsing, the schema validation, the tool dispatch, the trace serialization, the embedding lookups, the local cache hits — runs on the runner. And it runs in short, bursty intervals between remote calls, which is exactly the workload pattern that defeats dynamic frequency governors.
What a Governor Actually Does to a Bursty Workload
A CPU frequency governor is a kernel policy that decides what clock speed a core should run at given recent utilization. The default on most Linux distros has shifted over the past few years: many server images used performance (pinned at max) for benchmarking workloads, while general-purpose images moved to schedutil, which integrates with the kernel scheduler's per-entity load tracking. Ubuntu 24.04 LTS ARM defaults to schedutil. Several cloud images do too. The intent is sensible — match clock speed to load so idle hosts don't burn power — but the practical effect on bursty work is a measurable ramp-up.
Here's the mechanism. When a core has been mostly idle, schedutil clocks it down. When work arrives, the governor needs to observe the load before deciding to ramp the frequency back up. That observation window is small — milliseconds — but it is not zero. For workloads with intermittent CPU usage, this means the default governor keeps the CPU at a lower frequency because it sees the process mostly sleeping, then must scale during intensive functions, which causes fluctuations in recorded benchmarks. Agent loops fit this profile almost exactly: a few milliseconds of local work, then hundreds or thousands of milliseconds waiting on a remote token stream, then another small burst of local work.
The first few milliseconds of each burst run at a lower frequency. For a tool-call-heavy agent, those first-milliseconds-of-each-burst add up across dozens of tool turns. A single benchmark run absorbs the cost. A regression test that runs the same scenario 50 times absorbs it 50 times, and now your p50 is several percentage points slower than it was on the previous runner image — without any code change at all.
Cloud CI Makes This Worse, Not Better
A common reflex is "we use cloud CI, so this can't matter — we're not running on hardware we can configure." The reflex is wrong, and the situation is actually worse in cloud.
Cloud CI runners do not guarantee consistent hardware. GitHub Actions standard runners historically used Standard_DS2_v2 Azure instances, which spans multiple Intel CPU generations with different L2 and L3 cache sizes. The runner you get for one job may be physically different from the runner you got an hour ago. CPU-bound benchmarks tend to be more stable than memory- or disk-bound ones, but reported variability in cloud CI environments commonly runs to 50%, with some studies citing average performance differences of up to 3x between runs of the same job.
On top of CPU model variance you get noisy neighbors. Cloud VMs share physical hardware. Cache and memory bandwidth are shared resources. Empirical measurements of L3 cache behavior in cloud VMs show noisy patterns over time. Hypervisor scheduling and resource overcommitment compound this. One survey of virtualized performance degradation reports degradation factors up to 16x relative to bare metal in worst cases.
Layer the governor effect on top of all of this and you have a benchmark whose result is determined by: (1) which physical CPU you happen to land on, (2) what your noisy neighbor is doing this hour, (3) what governor the runner image happens to ship with this week, (4) whether boost is enabled at the hypervisor level, and (5) somewhere down the list, your code change. The signal-to-noise ratio for catching a real agent-loop regression in this environment is genuinely terrible.
The Recent Examples That Broke Real Teams
This isn't theoretical. A few patterns have surfaced in the last couple of years that map directly onto agent benchmarking.
The glibc-on-Actions case: CodSpeed documented a benchmark regression that turned out to be GLIBC 2.33+ loading different optimized library variants depending on the underlying CPU. Same code, same dependency versions, different runner CPU model, different hot path in the C library. Teams that run their agent in a container assume the container abstracts hardware; it does not abstract glibc's CPU-feature-dispatching IFUNC selectors.
The GCP boost-clock toggle: Google Cloud added an explicit option to disable boost clock on Emerald Rapids instances specifically to give customers "consistent performance." The boost feature itself is the source of inconsistency that customers asked to be able to turn off. Teams running benchmarks on default-configured cloud VMs are measuring with boost enabled, which means runs vary depending on neighbor load and thermal headroom.
The runner-image rollover: GitHub-hosted runner images update on a known cadence. Image updates have, in documented cases, changed default kernel parameters, governor defaults, or even microcode revisions. A team's benchmark numbers can shift by a measurable percentage on the morning an image rolls — with no commit, no PR, and no model change. The blame falls on the most recent code change, and the team spends days reverting innocent commits.
What an Honest Agent Benchmark Looks Like
If you are serious about catching real agent-loop regressions, the host-config problem forces a few discipline shifts.
-
Pin the governor explicitly. If you have any control over the runner — self-hosted, dedicated VM, bare-metal pool — set
performanceand document it. If you don't, at least log/sys/devices/system/cpu/cpu0/cpufreq/scaling_governorat the start of every benchmark run so you can correlate result shifts with governor changes after the fact. -
Log the host CPU model.
lscpu | grep "Model name"at the top of the benchmark output. When a regression appears, the first question is "did the CPU change," not "what did we commit." For cloud CI specifically, also log/proc/cpuinfocache sizes, since these vary across the same instance type. -
Stop comparing single-run numbers. Run each benchmark scenario long enough to amortize ramp-up effects — 10 to 30 minutes is the range practitioners report works for surfacing boost decay and noisy-neighbor patterns. Report median, p95, and a dispersion metric like coefficient of variation across repeated runs. A 5% median shift inside a 20% CV envelope is noise. A 5% median shift inside a 1% CV envelope is signal.
-
Separate the agent-loop benchmark from the model-latency benchmark. Most of the host-sensitive work is the local loop: JSON, schema, dispatch, trace serialization, embedding cache. Most of the model latency is provider-side and host-insensitive. Bench them separately so you can attribute movement correctly. A regression in the local-loop benchmark with no regression in the model-call benchmark is a host-config story until proven otherwise.
-
Run cross-day, cross-time. A benchmark that only runs once after merge is a benchmark that gets one sample from one runner with one set of neighbors. Schedule the same benchmark on a cron across several days and look at the distribution. Real regressions show up as distribution shifts; host noise shows up as time-of-day patterns.
-
Use deterministic measurement tools for the local-loop piece. Cachegrind-style instrumentation that counts instructions instead of timing them removes host variance entirely for the deterministic portions of your code. You won't get wall-clock numbers, but you will get a stable signal for whether your local loop got more expensive in terms of work performed.
The Conceptual Shift
The underlying mistake is treating "latency" as a property of your code. It isn't, in a CI environment. Latency is a property of the entire stack at the moment of measurement, and on cloud CI, most of that stack is outside your repo. Your benchmark dashboard shows you a number, and the number has many parents: the model's tokenization speed, the network's tail behavior, the runner's CPU model, the noisy neighbor's cache pressure, and — quietly — the governor's decision about whether the millisecond it just clocked your tool-dispatch function deserved 1.0 GHz or 3.6 GHz.
When a regression shows up, the disciplined first move is to ask which parents changed since the last green run. The model? Probably not. The agent loop code? Maybe. The runner image? Often. The governor default in the runner image? Almost never asked, sometimes the answer.
The teams that ship reliable agents are the ones that have separated their benchmark stack into pieces with known sensitivity to host config, and that have made the host config a logged, version-controlled artifact instead of a silent input. The teams that haven't are still reverting commits trying to chase a regression two layers below their code.
If your agent benchmark cannot tell you which CPU it ran on, which governor was active, and how the result distribution compared to the last week of runs, the benchmark is not measuring your agent. It is measuring the runner. And the runner has its own opinions about how fast your code should be.
- https://docs.kernel.org/admin-guide/pm/cpufreq.html
- https://karthikkaranth.me/blog/performance-benchmarking-beware-frequency-scaling/
- https://www.phoronix.com/review/amd-2990wx-cpufreq/4
- https://pythonspeed.com/articles/consistent-benchmarking-in-ci/
- https://aakinshin.net/posts/github-actions-perf-stability/
- https://codspeed.io/blog/unrelated-benchmark-regression
- https://runs-on.com/benchmarks/github-actions-cpu-performance/
- https://huggingface.co/blog/daya-shankar/cloud-vm-performance-benchmarking
- https://pythonspeed.com/articles/cpu-limits-to-speed/
- https://www.phoronix.com/news/AmpereOne-CPPC-CPUFreq
- https://wiki.archlinux.org/title/CPU_frequency_scaling
