Token-Per-Watt: The AI Sustainability Metric Your Dashboard Cannot Compute
Your sustainability dashboard reports "AI energy: 2.3 GWh this quarter, down 4% YoY" and the slide gets a polite nod in the ESG review. The CFO walks out of an analyst call six months later and asks the head of platform a question that sounds simple: "What is our token-per-watt, and how does it compare to our competitors?" The dashboard cannot answer. Not because the data is missing — the dashboard is full of data — but because it treats inference as a single line item and tasks as a product concept, and the only honest unit of AI sustainability lives at the intersection.
The mismatch is not a reporting bug. It is a category error that the existing carbon-accounting playbook, perfected for cloud workloads on CPU-hours and kWh per VM, cannot fix on its own. Inference is not a workload with a stable energy profile. The watts per token shift by 30× depending on which model tier served the request, by 4× depending on batch size at the moment of the call, and by another order of magnitude depending on whether the prefix cache hit or missed. Aggregating those into a single GWh number is like reporting "average car fuel economy" across a fleet that includes scooters, sedans, and 18-wheelers — accurate in the most useless sense.
The teams that figure this out first will look like they invented a new discipline. They did not. They translated the FinOps playbook into a unit that maps to the product, and they put it on a dashboard before the regulator or the customer asked for it.
Token-Per-Watt Is the Wrong Unit, but the Right Question
When a board member asks for token-per-watt, they are asking for a normalized efficiency metric — something they can compare across vendors, across quarters, across product lines. Token-per-watt sounds like the AI equivalent of miles-per-gallon. It is not. A token is an accounting fiction at the model layer; the user does not consume tokens, the user consumes outcomes. A summarization task that emits 200 tokens is not "twice as expensive" as one that emits 100 tokens if the second one made the user retry three times.
The real unit is task-watts: the energy spent per completed user-visible action. A task is "the user clicked the summarize button and got back a summary they kept." Task-watts requires joining inference logs (which know about tokens, model version, GPU type, and batch context) with user-action telemetry (which knows whether the action was completed, retried, abandoned, or escalated to a human). Most observability stacks have both halves. Almost none of them join them.
Token-per-watt is still a useful intermediate metric — it is the right number to put on a model card, the right number for procurement to compare two vendors at the same model tier. But it is the wrong number for the roadmap. A team that drives token-per-watt down by 30% by switching to a smaller model and breaks task completion rate by 12% has not made the product more sustainable. It has shifted the energy from the inference call to the user's third attempt.
The Variance You Are Hiding by Aggregating
The reason a single AI-energy line item is misleading is that it averages over three sources of variance, each of which is a roadmap lever the average team is not pulling.
Model tier. The 2026 hardware envelope means a single text completion can range from roughly 0.05 watt-hours on a small quantized model running on a Blackwell-class GPU to several watt-hours on a frontier model serving the same prompt. That is not noise — that is a routing decision. Most production traffic is amenable to the cheap end of that range, and most teams cannot tell what fraction of their traffic is being routed to the expensive end out of caution rather than necessity. Quantization to 4-bit or 8-bit with negligible quality loss has become standard practice and shifts the curve again.
Batch size. Inference engines are dramatically more energy-efficient at high batch sizes. The same request served at batch-of-one consumes several times more energy than the same request served when the engine has 16 concurrent requests it can pack together. Latency-sensitive endpoints that pin batch size to one for tail-latency reasons are paying that energy multiple every call, and the dashboard does not show it because batch size is a serving-engine internal, not a logged field.
Prefix-cache hit rate. A prefix-cache hit can use roughly 90% less energy than a cold inference for the same prompt. Real workloads with stable system prompts and conversational prefixes routinely hit 80–90% cache hit rates with the right scheduler; workloads with poor prefix discipline hit 20–30% and pay full freight on the rest. Cache hit rate is the single largest energy lever in most production stacks, and it lives in the serving layer below the metric the sustainability team is reporting.
A dashboard that reports total GWh hides all three. A dashboard that reports task-watts per feature, with a break-out by model tier, average batch size, and cache hit rate, makes them visible — which is the precondition for treating them as roadmap items rather than operational accidents.
The Discipline That Has to Land
Per-feature task-watt instrumentation is the foundation. Every inference call needs a tag for the user-visible task it serves; every task event needs a join key back to the inference calls that produced it. This is not a feature flag or a one-off batch job — it is a contract between the inference path and the product analytics path that has to hold across both. The teams that have done this for cost (the FinOps story) already have most of the plumbing; they just have not multiplied the per-task token count by the watts-per-token coefficient for the model that served it. Adding carbon intensity for the data center and time of day produces task-grams of CO₂, which is the unit a regulator will eventually ask for.
Model-mix routing is the largest controllable lever, and treating it as a carbon decision changes how it gets governed. Routing cheap-model-eligible tasks downward is already a cost decision; framing it as a sustainability decision raises the bar on the eval that justifies the routing rule (because shipping a routing change that quietly degrades a cohort's experience is now a fairness story too) and creates a defensible reason to invest in the router itself. Carbon-aware routing research like Green-Aware Routing formalizes this as a constrained optimization — minimize emissions subject to accuracy floors and latency SLOs — and the framing is more useful than the algorithm: it forces the team to write down the floors.
A quarterly carbon-vs-quality curve is the artifact that makes the trade-off explicit at the roadmap level. Plot task-watts on one axis and the user-facing quality metric on the other; the team's portfolio of features sits as points on that plane. The conversation shifts from "are we sustainable" (unanswerable) to "which features are off the efficient frontier and what would it cost to move them." That is the conversation product leadership knows how to have, and it does not require a sustainability specialist to mediate it.
Procurement contracts have to start surfacing the vendor's data-center carbon intensity as a comparable input. A model API priced in dollars per million tokens with no carbon-intensity disclosure is a black box for a Scope 3 calculation. Vendors that publish per-region carbon intensity, time-of-day signals, and per-model energy disclosures will become preferable not because they are cheaper, but because they are auditable. The Green Web Foundation's work on AI model cards in carbon.txt is an early version of what this disclosure looks like; the carbon-aware SDK from the Green Software Foundation is an early version of what consuming it looks like.
The Org Failure Mode That Is Already Visible
The pattern that breaks this is structural, not technical. Sustainability lives in a separate org from AI engineering — usually under operations, real estate, or ESG reporting — and neither org has the data the other needs to act. The sustainability team has the carbon intensity numbers from the data center but no way to attribute them to a feature. The AI engineering team has the per-feature inference logs but no carbon coefficient to multiply them by. The dashboard that gets shipped is the join the two orgs can both agree on without having to coordinate, which is the trivial one: total quarterly GWh.
The failure mode is quiet. The ESG report ships. The AI roadmap ships. A year later a customer with a serious procurement process or a regulator with a new disclosure rule asks for a number neither org can produce on a deadline, and the scramble exposes that nobody owned the join. The teams that get ahead of this either embed a sustainability data engineer in the AI platform team or — more commonly, and more sustainably as an org pattern — put the AI engineering lead on the hook for sustainability metrics directly, with the sustainability team as a partner rather than a reporting destination.
The cultural shift is the hard part. Sustainability has historically been a reporting function: instrument what already happened, attribute it to a category, file it. The AI engineering version is closer to performance engineering: instrument continuously, attribute to a feature, optimize as a roadmap item. That is a different muscle, and the team that does not build it ends up with a dashboard that grows green numbers while the underlying watts-per-task gets worse.
Why This Is the Next FinOps
FinOps spent its first two years as a backwater discipline that finance teams asked about politely and engineering teams considered a distraction from real work. The inflection happened when cloud bills crossed the threshold where a CFO could no longer treat them as overhead, and the engineering teams that had quietly built per-service cost attribution were suddenly the indispensable ones. The same trajectory is going to play out for AI sustainability, and the inflection point is closer than most teams are planning for.
The forcing functions are stacking up. AI's share of data-center energy is growing fast, with projections suggesting AI workloads could push global data-center electricity demand to roughly 1,050 TWh by 2026 from 460 TWh in 2022. Disclosure regimes are tightening: the U.S. executive order from January 2025 directing DOE to draft AI data center reporting requirements, the EU's AI Act sustainability provisions, and procurement rules at large enterprise customers are all converging on per-task carbon disclosures within the next two years. The vendors that publish first will become the default; the vendors that cannot publish will lose enterprise deals to the ones that can.
The team that stands up per-feature task-watt instrumentation this quarter is not doing optional ESG hygiene. They are building the metric that their CFO is going to ask for, the disclosure their largest customer is going to require, and the routing primitive that their cost-optimization roadmap is going to converge on independently. The work is the same regardless of which forcing function arrives first.
The inflection that matters is when token-per-watt — or whatever normalized version of task-watts the industry standardizes on — moves from a slide in the sustainability deck to a row in the architectural review. The teams that ship a static GWh dashboard are shipping a snapshot of efficiency in a stream that has already drifted. The teams that ship a per-task carbon metric, joined to the product, with model-mix and cache discipline as visible levers, are shipping the only version of sustainability that survives contact with an inference workload that changes shape every quarter.
- https://www.technologyreview.com/2025/05/20/1116327/ai-energy-usage-climate-footprint-big-tech/
- https://arxiv.org/html/2512.03024v1
- https://euromlsys.eu/pdf/euromlsys25-27.pdf
- https://antarctica.io/research/one-token-model
- https://www.brookings.edu/articles/global-energy-demands-within-the-ai-regulatory-landscape/
- https://fas.org/publication/measuring-and-standardizing-ais-energy-footprint/
- https://spectrum.ieee.org/data-center-sustainability-metrics
- https://www.carbonbrief.org/ai-five-charts-that-put-data-centre-energy-use-and-emissions-into-context/
- https://bentoml.com/llm/inference-optimization/prefix-caching
- https://docs.vllm.ai/en/stable/design/prefix_caching/
- https://llm-d.ai/blog/kvcache-wins-you-can-see
- https://openreview.net/forum?id=wVd99lgt4j
- https://link.springer.com/article/10.1557/s43581-025-00146-1
- https://www.thegreenwebfoundation.org/news/ai-model-cards-in-carbon-txt/
- https://github.com/Green-Software-Foundation/carbon-aware-sdk
