The Carbon Math of Agent Workflows: A Token Budget Is Now an ESG Disclosure
A stateless chat completion sips electricity. A median Gemini text prompt clocks in at about 0.24 Wh; a short GPT-4o query is around 0.3–0.4 Wh. These numbers are small enough that nobody puts them on a board deck.
An agent task is not a chat completion. A typical "go research this customer and draft a reply" workflow can fan out to 30+ tool calls, 10–15 model invocations, and a context window that grows with every step. The energy cost compounds with the call graph. By the time the agent returns, you have not consumed one unit of inference — you have consumed fifty to two hundred. Suddenly the per-task footprint is in the same order of magnitude as a video stream.
That arithmetic is about to matter outside the engineering org. The EU's CSRD makes Scope 3 emissions disclosure mandatory for in-scope companies, with machine-readable iXBRL reporting required from 2026. The SEC dropped Scope 3 from its final rule, but any multinational with EU operations still has to answer the question. Procurement teams have started adding "what is the carbon footprint per user task of your AI feature?" to vendor questionnaires. Most engineering teams cannot answer it, because nobody instrumented it.
Why Agent Workflows Break the Per-Query Mental Model
The energy estimates that get cited in press coverage — 0.24 Wh, 0.3 Wh, 0.34 Wh — are per-query numbers from public chat assistants. They describe a single prefill plus a short decode, with a small system prompt and no tool use. They do not describe what your agent does.
Three multipliers stack on top of those baselines:
Fan-out. Parallel agents that dispatch subtasks each accumulate cost independently. A planner that spawns six researchers, each making four tool calls and a final summary, has just done seven model invocations and twenty-four retrievals on what used to be one user request.
Context growth. Every tool result gets appended to the context for the next call. By step ten, you are paying prefill on a prompt that may be 50–100x the size it started at. Prefill scales roughly linearly with input tokens, so the energy cost of "the same model, the same task" climbs throughout the trajectory.
Reasoning models. The cost gap between a reasoning model and a small model on long prompts is brutal. Recent benchmarks measured o3 and DeepSeek-R1 at over 33 Wh per long prompt — more than 70x what GPT-4.1 nano consumes on the same input. If your agent uses an extended-thinking model for every step rather than for the steps that actually need it, you have already lost the carbon argument before tool selection.
Combine these and a single completed agent task can plausibly land at 15–60 Wh. That is still small in absolute terms, but it is the kind of number that moves with traffic. At a million tasks a month, you are talking about the annual electricity consumption of mid-size office buildings, sourced from grids whose carbon intensity ranges from under 50 gCO₂e/kWh in Quebec or France to over 600 gCO₂e/kWh in parts of the US South.
The Reporting Pressure Is Already Here
The political conversation about SEC Scope 3 disclosure obscured what actually shipped. The SEC's final rule does not require Scope 3 reporting, but the EU's CSRD does, and the EU's Omnibus revisions in late 2025 did not remove the Scope 3 requirement — they narrowed which companies are in scope and pushed deadlines for some filers to 2028, while keeping the substance intact.
For digital products, AI inference falls into Scope 3 Category 1 (Purchased Goods and Services) for the customer, and Scope 2 for the model vendor. That means two things in practice. First, your buyer's compliance team is going to ask you for an emissions estimate per user task, even if your own legal team has not flagged it yet. Second, the model vendors are not going to do that work for you at the per-task granularity you need. Anthropic, for example, has publicly committed to net-zero offsets and to absorbing electricity-price increases for ratepayers, but as of the most recent vendor-risk disclosures has not published per-call Scope 1/2/3 breakdowns. OpenAI has been similarly vague.
The pattern is familiar to anyone who has run a Kubernetes cluster on a shared cloud account: the only entity who can attribute the cost to a specific user task is you. The bill comes from the platform; the attribution is yours to compute.
Per-Call Attribution: The Telemetry You Already Have
The good news is that the instrumentation needed for carbon attribution overlaps almost completely with the instrumentation needed for cost attribution. If you have already adopted the OpenTelemetry GenAI semantic conventions — gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.system — you have most of what you need. The carbon math is one more attribute on the same span.
The model is straightforward. For each model call, multiply tokens by an energy-per-token coefficient, multiply that energy by the data center's marginal grid carbon intensity, and attach the result to the trace. The hard part is picking defensible coefficients.
For energy per token, the public benchmarks place modern frontier models in the range of 0.3–1.0 J per output token at production batch sizes, and roughly 30–50% of that for input tokens (prefill is more efficient than decode). Smaller models (Haiku-class, Nano-class) sit at the bottom of that range; reasoning models sit two orders of magnitude above when they engage extended thinking.
For grid carbon, the right input is locational marginal emissions for the region your inference actually ran in, sampled hourly. Cloud vendor sustainability dashboards expose this; ElectricityMaps and the UK Carbon Intensity API offer the same signal at finer granularity. Avoid a single annual average — it hides the variation that makes routing decisions defensible.
The output you want from the pipeline is not a corporate footprint number. It is a per-trace, per-user-task energy and carbon estimate, queryable by feature flag, customer cohort, and model tier. That is what lets you answer the questionnaire honestly, and more importantly, it is what lets you see which features are running hot.
Routing Decisions a Carbon Budget Forces
Once the per-task carbon number is in front of you, the optimization vocabulary changes. The same architectural moves that get pitched as cost optimizations are also carbon optimizations, and the carbon framing tends to expose the ones that are quietly underused.
Tier the model by stakes, not by default. A tiered architecture — Haiku-class for routing and classification, mid-tier for the bulk of generation, frontier reasoning for the cases that genuinely need it — typically cuts total token costs by 60–70% compared to running everything through a frontier model. The carbon math is steeper than the cost math, because the smallest models are also the most thermally efficient on per-token basis. Carbon-tier-routing is the single highest-leverage decision in most agent pipelines, and the one most teams skip because "Sonnet for everything" is what their first prototype shipped with.
Gate extended thinking explicitly. Reasoning-mode model calls dominate the energy profile of any agent that uses them. Treat the decision to enable extended thinking the same way you treat the decision to call a paid third-party API: explicit, logged, and tied to a confidence or stakes signal. "Always think" is the carbon equivalent of a busy loop.
Batch tool fan-out where latency permits. Warm batches amortize per-token overhead more aggressively than serial calls. If your agent has a step that fires off N independent retrievals, a single batched embedding or rerank request is dramatically more efficient than N separate ones — both in cost and in joules. The architectural shift is to move toward a "plan, then dispatch" pattern rather than a "loop and call" pattern.
Pick regions by grid intensity within your latency SLO. This is the most controversial recommendation, because it sounds like premature optimization for environmental theater. The data does not support that read. Even modest spatial shifting of inference workloads — within the latency budget you already accept — yields meaningful emissions reductions. The reason it works is that grid carbon intensity varies by region by an order of magnitude, while the latency cost of routing one zone over typically does not. The pattern that survives in production is region pinning by user geography (for compliance) combined with intensity-aware selection within the eligible set.
Cache aggressively, especially across sessions. Cached prompts and cached embeddings are not just a latency win — they zero out the inference. The marginal energy cost of a cache hit is the lookup itself, which is rounding error compared to the prefill it replaces. Most agent codebases have one or two prompts that are sent on every session and rarely change. Those are pure carbon savings and they are usually invisible because nobody tracks the avoided cost.
The Architectural Argument
The reason this matters now is not that regulation will force it tomorrow. It is that sustainability constraints will shape AI infrastructure decisions before regulation does, and the teams that instrument first will own the language that the conversation gets framed in.
Three forces converge in that direction. Procurement is the loudest, because Fortune 500 buyers have started embedding carbon questions in AI vendor questionnaires alongside the security and privacy ones, and "we don't measure it" is a closing-stage objection. Investors are next, because ESG-flagged funds need attributable Scope 3 numbers from their portfolio companies, and "AI is in scope but unmeasured" reads as model risk. Internal product strategy is the quietest but the most consequential: when you can show a carbon number per feature, conversations about which features are worth keeping change shape.
The instrumentation cost is small. The attribution requires per-call telemetry that most teams are already collecting for cost tracking. The energy coefficients are public and getting better as benchmarks like TokenPowerBench mature. The grid carbon signal is already exposed by every major cloud. What is missing in most agent codebases is the connecting code — the part that joins token counts to energy coefficients to regional carbon intensity and writes the result back to the trace.
The team that builds that connector first does not win a sustainability award. They win the ability to answer the question when it gets asked, which is increasingly often, by people who control whether the contract closes. They also gain a forcing function for token efficiency that converges with cost efficiency, which means the work earns its keep even before the first ESG report.
A token budget used to be a finance question. It is becoming a disclosure. The engineering response is the same as it was for cost, latency, and reliability before it: instrument the call graph, attribute the consumption to the user task, and put the number where the people making product decisions can see it. Everything else follows from there.
- https://arxiv.org/html/2505.09598v1
- https://epoch.ai/gradient-updates/how-much-energy-does-chatgpt-use
- https://hannahritchie.substack.com/p/ai-footprint-august-2025
- https://cloud.google.com/blog/products/infrastructure/measuring-the-environmental-impact-of-ai-inference/
- https://www.mdpi.com/2071-1050/17/23/10473
- https://opentelemetry.io/blog/2024/llm-observability/
- https://www.datadoghq.com/blog/llm-otel-semantic-convention/
- https://carbonintelligence.green/blog/csrd-advertising-your-2026-compliance-roadmap/
- https://www.persefoni.com/blog/sec-climate-disclosure-rule-ghg-emissions
- https://sustainabilitymag.com/news/why-anthropic-pledging-offset-ai-energy-costs
- https://caylent.com/blog/claude-haiku-4-5-deep-dive-cost-capabilities-and-the-multi-agent-opportunity
- https://www.mindstudio.ai/blog/anthropic-advisor-strategy-cut-ai-agent-costs
- https://arxiv.org/html/2509.07218v4
