The kWh Column Missing From Your Inference Span: Carbon Attribution Per Request
Your inference flame graph has a cost axis. It does not have an energy axis. That gap is fine right up until the morning a customer's procurement team sends you a spreadsheet with twenty-three columns of vendor sustainability disclosures, and one of them is kgCO2e per 1,000 inferences. You have no way to fill that cell, your provider's answer is a methodology paper, and the deal closes in nine days. The token-cost dashboard your platform team has been polishing for two years suddenly looks like it was solving the wrong problem.
The shift here is not abstract. Sustainability disclosure is moving from corporate aggregate to product-level granularity. The first wave of that movement landed inside CSRD and ESRS in 2025, and the second wave is landing in B2B procurement contracts right now. Engineering organizations that built observability for cost are about to discover they need observability for carbon, and the two are not the same column on the same span.
The reason it matters at the span level — not just the monthly invoice level — is that aggregate footprints are useless for engineering decisions. A CFO reporting line tells you how big the problem is; it does not tell you which retrieval call, which reranker, which speculative draft, which reasoning effort tier is the energy-expensive subtree of your agent. If your only carbon data lives in a quarterly PDF, the optimization conversation never reaches the team that can actually move the number.
Why the cost column does not stand in for the energy column
It is tempting to say tokens are tokens, dollars track tokens, and so dollars track energy. They do not, and the gap is going to widen.
The first reason is hardware heterogeneity. An H100 draws roughly 700W at the chip and closer to 1,275W per GPU once you account for the host server inside an 8-GPU box. An A100 draws 400W. The same model served on different generations consumes very different power for the same token output. Inference workloads are also memory-bound, so chips often run at around 70% of peak rather than full TDP. None of that variation shows up in your provider's per-token price, which is set by competitive pressure and margin strategy, not by the physics of which silicon ran your batch.
The second reason is reasoning models. A visible 500-token answer from o3 or DeepSeek-R1 can hide thousands of internal reasoning tokens that the model burned to get there. Recent benchmarking found long prompts on o3 and DeepSeek-R1 consumed over 33 watt-hours per request — more than seventy times a single GPT-4.1-nano call. Some of that reasoning cost is billed as output tokens, some of it is not, and the mapping from billed dollars to consumed kilowatt-hours is no longer linear within a single provider, let alone across them.
The third reason is the grid. A request served from a region running on overnight wind has a fraction of the carbon intensity of the same request served from a midday gas peaker. Your dollars-per-token contract has no idea where the GPUs were physically located when your batch ran, and your provider's sustainability page reports an annual average that papers over the hourly variation that actually drives marginal emissions.
Cost is one observable. Energy is another. Carbon is a third, derived from energy plus place plus time. Treating any of them as a proxy for the others gets you through the next quarter and breaks the moment a customer asks a precise question.
The estimate you can actually emit today
You will not get exact kilowatt-hours per request from a closed-API provider. You can get a defensible estimate, and a defensible estimate is what auditors, customers, and internal product teams actually need.
The shape of the estimate is straightforward. Per-request energy is a function of the model class (which sets a watts-per-token coefficient), the token volume (input, output, and where you can detect it, reasoning), the data center efficiency overhead (PUE, typically 1.10 to 1.20 for hyperscalers), and the regional grid carbon intensity at the hour the request ran. None of those factors is precise. All of them are bounded enough that the resulting estimate has a useful uncertainty band rather than being a wild guess.
For model-class coefficients, public benchmarks now exist for thirty-plus production models, and the orders of magnitude are clear: GPT-4.1-nano at roughly 0.0005 kWh per long prompt, mid-tier chat models in the 0.001-to-0.003 kWh range, reasoning models like o3 above 0.03 kWh on hard problems. Pick a coefficient, document the source, and revisit it quarterly. Do not pretend the number is more accurate than it is, and do not refuse to publish it because it is approximate. The procurement spreadsheet does not have a "we are still working on the methodology" cell.
For grid carbon intensity, two APIs do most of the work. Electricity Maps provides hourly average carbon intensity for grid regions worldwide. WattTime provides marginal operating emissions, which answer a different and arguably more relevant question: what would have been avoided if this request had not run? Average intensity is the conservative choice for compliance disclosure. Marginal intensity is the right signal for carbon-aware scheduling, where you might shift a non-urgent batch to a cleaner hour. Pick one, document which, and do not silently switch.
For provider region carbon data, Google Cloud publishes hourly carbon-free energy percentages per region and exports footprint data to BigQuery. Microsoft Azure shipped Carbon Optimization in the portal during 2025 with both CSV and a REST API. AWS still does not provide a real API and reports only market-based figures, which means your AWS-hosted inference will need a third-party intensity feed to produce a grid-based number at all.
The span attribute that should sit next to cost
The point of putting energy on the span and not on the monthly rollup is that flame graphs are how engineers find waste. A trace that shows gen_ai.usage.input_tokens and gen_ai.usage.output_tokens but no energy attribute leaves the optimization conversation half-formed. You can see which subtree spent dollars; you cannot see which subtree spent watt-hours.
OpenTelemetry's GenAI semantic conventions stabilized in 2025 around token-usage and model-identification attributes. Sustainability attributes are still under active proposal — there is an open issue (#835 in the semantic-conventions repo) tracking hardware-level energy and carbon metrics, and the GenAI spec has not yet adopted a recommended energy attribute. That is exactly the gap to fill in your own collector before the standard catches up: emit gen_ai.usage.energy_kwh and gen_ai.usage.co2e_g as span attributes alongside the token counts, document that they are estimates and which methodology produced them, and ship.
The collector pipeline does the math. You already know the model from gen_ai.request.model. You already know the region from your routing layer. You already know the token counts from the provider response. Multiply by the model coefficient, multiply by the regional intensity at the request's wall-clock hour, attach to the span. The estimate is bounded, the methodology is auditable, and the flame graph now has the third axis a sustainability question requires.
What this unlocks practically: a query that retrieves twelve documents, runs them through a reranker, and synthesizes with a reasoning model is now visibly the energy-expensive subtree of the agent's run. A 500ms latency optimization that saves 30% of CPU time is also a measurable carbon improvement, not a footnote. A prompt revision that drops average reasoning tokens by 40% has a number associated with it that the sustainability team can put in the next disclosure draft.
The procurement conversation that is not optional
The B2B side of this is the part most engineering teams underestimate. CSRD's first reporting cycle covers FY 2024 with reports filed in 2025, and the directive's value-chain provisions push Scope 3 disclosure obligations down to suppliers — including your AI providers and, transitively, your AI-powered features. ESRS revisions in mid-2025 reduced mandatory data points by 57% but kept the Scope 3 climate disclosures intact for in-scope companies. The signal to vendors is unambiguous: customers in regulated markets are about to ask "what is the per-call footprint of using your service?" and "we'll get back to you" is not an acceptable answer.
Provider responses range from "here is an API" to "here is a methodology paper" to silence. As of early 2026, neither OpenAI nor Anthropic publishes per-query energy figures. OpenAI's CEO stated a 0.34 Wh average for ChatGPT in mid-2025; Google followed with a 0.24 Wh median for Gemini text prompts and 0.03 grams CO₂e per request. Those are sound bites, not engineering inputs — they are unconditioned averages that do not vary by model size, prompt length, or region. If your customer's auditor asks for the per-call footprint of a specific feature, "we believe it is approximately 0.34 watt-hours based on a CEO tweet" is not the artifact they are expecting.
The contract conversation that follows has two failure modes. The first is engineering teams committing to disclosure granularity that the platform cannot produce, because the procurement team did not ask whether the data exists before signing the contract. The second is the platform team waiting for a provider to publish official per-call data, which may never happen because the providers consider model architecture and infrastructure utilization to be trade secrets. The way out of both is to ship your own estimate with documented methodology, treat it as a published interface that the procurement team can quote, and version it so that improvements in the methodology do not retroactively invalidate prior disclosures.
What "we'll add it later" actually costs
The argument against doing this now is the obvious one: the estimate is approximate, the standards are immature, the regulations are still being refined, and there are higher-ROI engineering investments. All of that is true. None of it changes the calendar.
What "we'll add it later" looks like in practice is: a customer-facing sustainability questionnaire arrives, the platform team has a week to estimate per-feature footprints across forty product surfaces, and the resulting numbers are extracted by hand from a billing CSV and a back-of-envelope token estimate done in a hurry by someone who has never read the cloud-carbon-footprint methodology. Those numbers go on a customer's vendor scorecard. They do not get revised. They become the baseline against which next year's claimed improvements are measured.
The alternative is to put a kWh column on the inference span now, document that it is an estimate within ±30%, and revise the coefficients quarterly. The number will be wrong in detail and right in shape. That is enough to defend a procurement disclosure, enough to drive an internal optimization conversation, and enough to demonstrate to an auditor that the company has a measurement program rather than a press release. The right answer in this domain is approximate today. The wrong answer is silence followed by improvisation, and improvisation under deadline is how teams ship the numbers they later wish they had not.
The kWh column is not exotic infrastructure. It is one attribute on a span you are already emitting, computed from data you already have, joined to a public API that costs less per month than a small cloud VM. The reason it is not in production yet is that no one has had to put it there. That is changing, and the teams that move first will spend the next year refining a methodology while the teams that wait will spend the next year explaining why their answer is a paragraph instead of a number.
- https://arxiv.org/html/2507.11417v1
- https://arxiv.org/html/2505.09598v1
- https://epoch.ai/gradient-updates/how-much-energy-does-chatgpt-use
- https://www.technologyreview.com/2025/05/20/1116327/ai-energy-usage-climate-footprint-big-tech/
- https://opentelemetry.io/docs/specs/semconv/gen-ai/
- https://github.com/open-telemetry/semantic-conventions/issues/835
- https://app.electricitymaps.com/docs
- https://docs.watttime.org/
- https://cloud.google.com/sustainability/region-carbon
- https://www.cloudcarbonfootprint.org/docs/methodology/
- https://arxiv.org/html/2502.05043v1
- https://www.senken.io/academy/csrd-reporting-requirements
