Token-Per-Watt: The AI Sustainability Metric Your Dashboard Cannot Compute

April 28, 2026 · 11 min read

Software Engineer

Your sustainability dashboard reports "AI energy: 2.3 GWh this quarter, down 4% YoY" and the slide gets a polite nod in the ESG review. The CFO walks out of an analyst call six months later and asks the head of platform a question that sounds simple: "What is our token-per-watt, and how does it compare to our competitors?" The dashboard cannot answer. Not because the data is missing — the dashboard is full of data — but because it treats inference as a single line item and tasks as a product concept, and the only honest unit of AI sustainability lives at the intersection.

The mismatch is not a reporting bug. It is a category error that the existing carbon-accounting playbook, perfected for cloud workloads on CPU-hours and kWh per VM, cannot fix on its own. Inference is not a workload with a stable energy profile. The watts per token shift by 30× depending on which model tier served the request, by 4× depending on batch size at the moment of the call, and by another order of magnitude depending on whether the prefix cache hit or missed. Aggregating those into a single GWh number is like reporting "average car fuel economy" across a fleet that includes scooters, sedans, and 18-wheelers — accurate in the most useless sense.

The teams that figure this out first will look like they invented a new discipline. They did not. They translated the FinOps playbook into a unit that maps to the product, and they put it on a dashboard before the regulator or the customer asked for it.

Token-Per-Watt Is the Wrong Unit, but the Right Question

When a board member asks for token-per-watt, they are asking for a normalized efficiency metric — something they can compare across vendors, across quarters, across product lines. Token-per-watt sounds like the AI equivalent of miles-per-gallon. It is not. A token is an accounting fiction at the model layer; the user does not consume tokens, the user consumes outcomes. A summarization task that emits 200 tokens is not "twice as expensive" as one that emits 100 tokens if the second one made the user retry three times.

The real unit is task-watts: the energy spent per completed user-visible action. A task is "the user clicked the summarize button and got back a summary they kept." Task-watts requires joining inference logs (which know about tokens, model version, GPU type, and batch context) with user-action telemetry (which knows whether the action was completed, retried, abandoned, or escalated to a human). Most observability stacks have both halves. Almost none of them join them.

Token-per-watt is still a useful intermediate metric — it is the right number to put on a model card, the right number for procurement to compare two vendors at the same model tier. But it is the wrong number for the roadmap. A team that drives token-per-watt down by 30% by switching to a smaller model and breaks task completion rate by 12% has not made the product more sustainable. It has shifted the energy from the inference call to the user's third attempt.

The Variance You Are Hiding by Aggregating

The reason a single AI-energy line item is misleading is that it averages over three sources of variance, each of which is a roadmap lever the average team is not pulling.

Model tier. The 2026 hardware envelope means a single text completion can range from roughly 0.05 watt-hours on a small quantized model running on a Blackwell-class GPU to several watt-hours on a frontier model serving the same prompt. That is not noise — that is a routing decision. Most production traffic is amenable to the cheap end of that range, and most teams cannot tell what fraction of their traffic is being routed to the expensive end out of caution rather than necessity. Quantization to 4-bit or 8-bit with negligible quality loss has become standard practice and shifts the curve again.

Batch size. Inference engines are dramatically more energy-efficient at high batch sizes. The same request served at batch-of-one consumes several times more energy than the same request served when the engine has 16 concurrent requests it can pack together. Latency-sensitive endpoints that pin batch size to one for tail-latency reasons are paying that energy multiple every call, and the dashboard does not show it because batch size is a serving-engine internal, not a logged field.

Prefix-cache hit rate. A prefix-cache hit can use roughly 90% less energy than a cold inference for the same prompt. Real workloads with stable system prompts and conversational prefixes routinely hit 80–90% cache hit rates with the right scheduler; workloads with poor prefix discipline hit 20–30% and pay full freight on the rest. Cache hit rate is the single largest energy lever in most production stacks, and it lives in the serving layer below the metric the sustainability team is reporting.

A dashboard that reports total GWh hides all three. A dashboard that reports task-watts per feature, with a break-out by model tier, average batch size, and cache hit rate, makes them visible — which is the precondition for treating them as roadmap items rather than operational accidents.

Loading…

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

Token-Per-Watt: The AI Sustainability Metric Your Dashboard Cannot Compute

Token-Per-Watt Is the Wrong Unit, but the Right Question

The Variance You Are Hiding by Aggregating

Recommended Reading

About Tian Pan

Token-Per-Watt Is the Wrong Unit, but the Right Question​

The Variance You Are Hiding by Aggregating​

Recommended Reading

About Tian Pan

Token-Per-Watt Is the Wrong Unit, but the Right Question

The Variance You Are Hiding by Aggregating