Skip to main content

The kWh Column Missing From Your Inference Span: Carbon Attribution Per Request

· 10 min read
Tian Pan
Software Engineer

Your inference flame graph has a cost axis. It does not have an energy axis. That gap is fine right up until the morning a customer's procurement team sends you a spreadsheet with twenty-three columns of vendor sustainability disclosures, and one of them is kgCO2e per 1,000 inferences. You have no way to fill that cell, your provider's answer is a methodology paper, and the deal closes in nine days. The token-cost dashboard your platform team has been polishing for two years suddenly looks like it was solving the wrong problem.

The shift here is not abstract. Sustainability disclosure is moving from corporate aggregate to product-level granularity. The first wave of that movement landed inside CSRD and ESRS in 2025, and the second wave is landing in B2B procurement contracts right now. Engineering organizations that built observability for cost are about to discover they need observability for carbon, and the two are not the same column on the same span.

The reason it matters at the span level — not just the monthly invoice level — is that aggregate footprints are useless for engineering decisions. A CFO reporting line tells you how big the problem is; it does not tell you which retrieval call, which reranker, which speculative draft, which reasoning effort tier is the energy-expensive subtree of your agent. If your only carbon data lives in a quarterly PDF, the optimization conversation never reaches the team that can actually move the number.

Why the cost column does not stand in for the energy column

It is tempting to say tokens are tokens, dollars track tokens, and so dollars track energy. They do not, and the gap is going to widen.

The first reason is hardware heterogeneity. An H100 draws roughly 700W at the chip and closer to 1,275W per GPU once you account for the host server inside an 8-GPU box. An A100 draws 400W. The same model served on different generations consumes very different power for the same token output. Inference workloads are also memory-bound, so chips often run at around 70% of peak rather than full TDP. None of that variation shows up in your provider's per-token price, which is set by competitive pressure and margin strategy, not by the physics of which silicon ran your batch.

The second reason is reasoning models. A visible 500-token answer from o3 or DeepSeek-R1 can hide thousands of internal reasoning tokens that the model burned to get there. Recent benchmarking found long prompts on o3 and DeepSeek-R1 consumed over 33 watt-hours per request — more than seventy times a single GPT-4.1-nano call. Some of that reasoning cost is billed as output tokens, some of it is not, and the mapping from billed dollars to consumed kilowatt-hours is no longer linear within a single provider, let alone across them.

The third reason is the grid. A request served from a region running on overnight wind has a fraction of the carbon intensity of the same request served from a midday gas peaker. Your dollars-per-token contract has no idea where the GPUs were physically located when your batch ran, and your provider's sustainability page reports an annual average that papers over the hourly variation that actually drives marginal emissions.

Cost is one observable. Energy is another. Carbon is a third, derived from energy plus place plus time. Treating any of them as a proxy for the others gets you through the next quarter and breaks the moment a customer asks a precise question.

The estimate you can actually emit today

You will not get exact kilowatt-hours per request from a closed-API provider. You can get a defensible estimate, and a defensible estimate is what auditors, customers, and internal product teams actually need.

The shape of the estimate is straightforward. Per-request energy is a function of the model class (which sets a watts-per-token coefficient), the token volume (input, output, and where you can detect it, reasoning), the data center efficiency overhead (PUE, typically 1.10 to 1.20 for hyperscalers), and the regional grid carbon intensity at the hour the request ran. None of those factors is precise. All of them are bounded enough that the resulting estimate has a useful uncertainty band rather than being a wild guess.

For model-class coefficients, public benchmarks now exist for thirty-plus production models, and the orders of magnitude are clear: GPT-4.1-nano at roughly 0.0005 kWh per long prompt, mid-tier chat models in the 0.001-to-0.003 kWh range, reasoning models like o3 above 0.03 kWh on hard problems. Pick a coefficient, document the source, and revisit it quarterly. Do not pretend the number is more accurate than it is, and do not refuse to publish it because it is approximate. The procurement spreadsheet does not have a "we are still working on the methodology" cell.

For grid carbon intensity, two APIs do most of the work. Electricity Maps provides hourly average carbon intensity for grid regions worldwide. WattTime provides marginal operating emissions, which answer a different and arguably more relevant question: what would have been avoided if this request had not run? Average intensity is the conservative choice for compliance disclosure. Marginal intensity is the right signal for carbon-aware scheduling, where you might shift a non-urgent batch to a cleaner hour. Pick one, document which, and do not silently switch.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates