The Carbon Math of Agent Workflows: A Token Budget Is Now an ESG Disclosure
A stateless chat completion sips electricity. A median Gemini text prompt clocks in at about 0.24 Wh; a short GPT-4o query is around 0.3–0.4 Wh. These numbers are small enough that nobody puts them on a board deck.
An agent task is not a chat completion. A typical "go research this customer and draft a reply" workflow can fan out to 30+ tool calls, 10–15 model invocations, and a context window that grows with every step. The energy cost compounds with the call graph. By the time the agent returns, you have not consumed one unit of inference — you have consumed fifty to two hundred. Suddenly the per-task footprint is in the same order of magnitude as a video stream.
That arithmetic is about to matter outside the engineering org. The EU's CSRD makes Scope 3 emissions disclosure mandatory for in-scope companies, with machine-readable iXBRL reporting required from 2026. The SEC dropped Scope 3 from its final rule, but any multinational with EU operations still has to answer the question. Procurement teams have started adding "what is the carbon footprint per user task of your AI feature?" to vendor questionnaires. Most engineering teams cannot answer it, because nobody instrumented it.
Why Agent Workflows Break the Per-Query Mental Model
The energy estimates that get cited in press coverage — 0.24 Wh, 0.3 Wh, 0.34 Wh — are per-query numbers from public chat assistants. They describe a single prefill plus a short decode, with a small system prompt and no tool use. They do not describe what your agent does.
Three multipliers stack on top of those baselines:
Fan-out. Parallel agents that dispatch subtasks each accumulate cost independently. A planner that spawns six researchers, each making four tool calls and a final summary, has just done seven model invocations and twenty-four retrievals on what used to be one user request.
Context growth. Every tool result gets appended to the context for the next call. By step ten, you are paying prefill on a prompt that may be 50–100x the size it started at. Prefill scales roughly linearly with input tokens, so the energy cost of "the same model, the same task" climbs throughout the trajectory.
Reasoning models. The cost gap between a reasoning model and a small model on long prompts is brutal. Recent benchmarks measured o3 and DeepSeek-R1 at over 33 Wh per long prompt — more than 70x what GPT-4.1 nano consumes on the same input. If your agent uses an extended-thinking model for every step rather than for the steps that actually need it, you have already lost the carbon argument before tool selection.
Combine these and a single completed agent task can plausibly land at 15–60 Wh. That is still small in absolute terms, but it is the kind of number that moves with traffic. At a million tasks a month, you are talking about the annual electricity consumption of mid-size office buildings, sourced from grids whose carbon intensity ranges from under 50 gCO₂e/kWh in Quebec or France to over 600 gCO₂e/kWh in parts of the US South.
The Reporting Pressure Is Already Here
The political conversation about SEC Scope 3 disclosure obscured what actually shipped. The SEC's final rule does not require Scope 3 reporting, but the EU's CSRD does, and the EU's Omnibus revisions in late 2025 did not remove the Scope 3 requirement — they narrowed which companies are in scope and pushed deadlines for some filers to 2028, while keeping the substance intact.
For digital products, AI inference falls into Scope 3 Category 1 (Purchased Goods and Services) for the customer, and Scope 2 for the model vendor. That means two things in practice. First, your buyer's compliance team is going to ask you for an emissions estimate per user task, even if your own legal team has not flagged it yet. Second, the model vendors are not going to do that work for you at the per-task granularity you need. Anthropic, for example, has publicly committed to net-zero offsets and to absorbing electricity-price increases for ratepayers, but as of the most recent vendor-risk disclosures has not published per-call Scope 1/2/3 breakdowns. OpenAI has been similarly vague.
The pattern is familiar to anyone who has run a Kubernetes cluster on a shared cloud account: the only entity who can attribute the cost to a specific user task is you. The bill comes from the platform; the attribution is yours to compute.
Per-Call Attribution: The Telemetry You Already Have
The good news is that the instrumentation needed for carbon attribution overlaps almost completely with the instrumentation needed for cost attribution. If you have already adopted the OpenTelemetry GenAI semantic conventions — gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.system — you have most of what you need. The carbon math is one more attribute on the same span.
- https://arxiv.org/html/2505.09598v1
- https://epoch.ai/gradient-updates/how-much-energy-does-chatgpt-use
- https://hannahritchie.substack.com/p/ai-footprint-august-2025
- https://cloud.google.com/blog/products/infrastructure/measuring-the-environmental-impact-of-ai-inference/
- https://www.mdpi.com/2071-1050/17/23/10473
- https://opentelemetry.io/blog/2024/llm-observability/
- https://www.datadoghq.com/blog/llm-otel-semantic-convention/
- https://carbonintelligence.green/blog/csrd-advertising-your-2026-compliance-roadmap/
- https://www.persefoni.com/blog/sec-climate-disclosure-rule-ghg-emissions
- https://sustainabilitymag.com/news/why-anthropic-pledging-offset-ai-energy-costs
- https://caylent.com/blog/claude-haiku-4-5-deep-dive-cost-capabilities-and-the-multi-agent-opportunity
- https://www.mindstudio.ai/blog/anthropic-advisor-strategy-cut-ai-agent-costs
- https://arxiv.org/html/2509.07218v4
