The Carbon Line Item Nobody Puts in the AI Feature Spec
Open any AI feature review and you will hear the same three numbers debated: latency, token cost, and accuracy. Someone pulls up the p95 chart, someone else does the math on cost-per-thousand-requests, and a third person argues the eval score is good enough to ship. Nobody mentions energy. Nobody mentions carbon. And because nobody mentions it, the environmental footprint of the feature still gets decided — implicitly, by whoever wins the argument about the dollar figure.
That is the quiet problem with AI sustainability. It is not that teams choose a high-carbon design on purpose. It is that they never choose at all. The footprint is a side effect of a cost decision, and cost only loosely tracks carbon. A routing rule that looks like a clean win on the spend dashboard can quietly double emissions, and no one in the room would know, because the number that would have told them was never on a dashboard.
This post treats energy and carbon as what they actually are: a measurable, ownable property of an AI system, on the same footing as latency and cost. Not a corporate-values footnote. A line item.
Why cost stopped being a good proxy
For most of software history, you could ignore carbon and trust that optimizing for cost would drag the footprint down with it. Cheaper usually meant less compute, and less compute meant less energy. AI inference broke that assumption in three specific ways.
Region. The carbon intensity of electricity — grams of CO2 per kilowatt-hour — varies by more than an order of magnitude between grids. A data center running on hydro or nuclear power can be cleaner than a coal-heavy grid by a factor of ten or more. But cloud pricing barely reflects this. Two regions can bill nearly the same price per GPU-hour while one emits ten times the carbon. Pick the cheaper region and you have made a carbon decision without knowing it.
Hardware generation. Newer accelerators do far more work per watt. A workload pinned to last-generation hardware because that is what had spare quota can burn meaningfully more energy per token than the same workload on current silicon — at a list price that looks similar or even lower. Cost-per-token rewards you for using up cheap idle capacity. Carbon-per-token does not.
Idle capacity. This is the one most measurements miss entirely. Production inference fleets keep machines provisioned and warm to absorb traffic spikes and survive failover. Those idle machines draw power whether or not your request lands on them. When Google published its inference measurement methodology, the comprehensive figure — 0.24 watt-hours for a median Gemini text prompt — was more than double the naive estimate of 0.10 Wh that counted only active accelerator time. The gap was idle capacity, host CPU and RAM, and data center cooling overhead. Most public estimates ignore 40 to 50 percent of the real footprint because they only count the chip while it is busy.
None of these three show up cleanly on a cost dashboard. That is why "we optimized for cost" is no longer the same sentence as "we optimized for carbon."
What a per-request carbon number actually contains
If you want carbon to be a number a team can be held to, you need a number — a real one, computed per request, that you can put next to cost and p95. The Software Carbon Intensity specification, now standardized as ISO/IEC 21031:2024, gives you the shape of it. SCI expresses emissions as a rate: carbon per functional unit, where the functional unit is whatever your software actually delivers — a request, a generated response, a completed task.
The calculation has two halves.
Operational emissions are the straightforward part: the electricity your hardware consumed, multiplied by the carbon intensity of the grid it ran on. The trap is the word "consumed." A defensible per-request estimate has to include the active accelerator draw, the host CPU and RAM, a share of the idle fleet kept warm for your traffic, and the data center overhead captured by Power Usage Effectiveness — the multiplier for cooling and power distribution. A hyperscale facility might run a PUE near 1.1; an older enterprise data center can sit at 1.5 or worse, meaning half again as much energy spent on everything that is not computation.
Embodied carbon is the part teams forget exists. Manufacturing and eventually disposing of a GPU emits carbon before it ever runs a single token. SCI allocates a fraction of that one-time cost to the software, amortized across the hardware's useful life. For inference on expensive, fast-cycling accelerators, embodied carbon is not a rounding error — it is a real share of the total, and it is the reason "just buy more GPUs" is never carbon-free even if the grid is clean.
You will not get this exact. Nobody does. The point of a per-request estimate is not precision to three decimal places; it is to be consistently wrong in the same direction, so that a relative change — this routing rule versus that one — is trustworthy even when the absolute number is fuzzy. A footprint estimate that is roughly right and always computed beats a precise one that exists only in a sustainability report once a year.
Where cost optimization and carbon optimization overlap — and where they split
Here is the genuinely useful news: the two goals are not opposed. They overlap heavily. The danger is assuming the overlap is total.
The shared wins come from doing less compute for the same output. Caching is the cleanest example. A prompt cache hit — Anthropic discounts cached input heavily, and agent workloads routinely see 60 to 80 percent hit rates — skips real computation. The tokens you do not recompute cost nothing and emit nothing. Right-sizing is the same story: routing a simple classification to a small model instead of a frontier one cuts both the bill and the joules. Output control — not generating 800 tokens when 200 will do — trims both. Anything that reduces the actual work done is a true joint optimization, and you should do all of it.
The split shows up the moment you stop reducing work and start relocating it. Raw token price is the worst offender. Provider list prices are set by competition, margin, and capacity strategy. They are not a measurement of energy, and they are certainly not a measurement of grid carbon. A model that is cheaper per token can run in a dirtier region, on older hardware, or at a lower achieved utilization — and emit more. When your routing logic chases the lowest sticker price across providers and regions, it is optimizing a number that has been deliberately disconnected from physics.
So the rule of thumb: optimizations that shrink the work help cost and carbon together — chase those freely. Optimizations that move the work somewhere cheaper need a separate carbon check, because cheaper and cleaner are no longer the same place.
Routing is where the leverage is
There is a reason carbon-aware design focuses on inference rather than training. Training is a rigid, high-power workload that is hard to move. Inference is the opposite: a large and growing share of total AI energy — by recent estimates inference now consumes the majority of it — and individual requests are small, latency-tolerant within limits, and geographically routable. That routability is the lever.
Two kinds of shifting are available.
Spatial shifting sends requests to regions running on cleaner power right now. A query that does not care whether it executes in a coal-heavy region or a hydro-heavy one — and many do not — should prefer the clean grid. This is the highest-leverage move available to most teams, and it costs nothing in user-visible quality.
Temporal shifting applies to anything that does not need to happen this second: batch evals, embedding regeneration, offline summarization, synthetic data generation. Grid carbon intensity swings hour to hour with solar and wind. Deferrable jobs can wait for the cleaner window. The user-facing request cannot wait; the nightly batch absolutely can, and almost no one schedules it that way.
The catch is the same as with the per-request number: you cannot route on carbon if you are not measuring it. Carbon-aware routing needs real-time grid intensity data wired into the same layer that already makes cost and latency decisions. The infrastructure for cost-and-latency routing already exists in most production AI systems. Adding carbon as a third input is an extension, not a rebuild.
Making it a number a team is held to
The deepest reason carbon gets ignored is not technical. It is that nobody owns it. Latency has an owner because it has an SLO and a pager. Cost has an owner because someone gets asked about the bill. Carbon has neither, so it has no owner, so it is no one's job, so it drifts.
Closing that gap is a process change more than an engineering one.
Put the number on the existing dashboard. Not a separate sustainability portal that gets opened once a quarter. The same Grafana board that shows p95 and cost-per-request should show grams of CO2 per request. A metric that lives where engineers already look gets defended. A metric in a PDF does not.
Make it relative and per-feature. Because SCI normalizes by functional unit, the per-request footprint is independent of traffic volume. That is what makes it fair to hold a team to: they cannot be blamed for the product getting popular, only for the design getting wasteful. A feature whose carbon-per-request climbs release over release is doing something wrong, and now you can see it in the same review where you already see a latency regression.
Build the regression test before you trust the metric. Before you believe a carbon number can catch a real problem, deliberately ship a worse design in staging — a heavier model, a dirtier region — and confirm the dashboard moves. A footprint metric that does not visibly react to a design you know is worse is not a metric you can hold anyone to. This is the same discipline you would apply to any eval: validate that it can detect a planted failure before you trust it on a real one.
One honest caveat. Per-unit efficiency has been improving fast — Google reported a 44x drop in carbon per median prompt over a single year. But total AI demand is growing faster, and a cheaper, cleaner query simply invites far more queries. Efficiency alone does not bend the total curve down; it is necessary, not sufficient. Treating carbon as a real engineering metric does not solve that. It just makes sure that when your team makes the trade, it is a decision someone made on purpose — not a side effect of an argument about a dollar sign.
The fix is not heroic. It is one more column on a dashboard you already have, one more input to a router you already built, and one named owner. Carbon stops being a values statement and starts being what it should have been all along: a line item in the spec.
- https://cloud.google.com/blog/products/infrastructure/measuring-the-environmental-impact-of-ai-inference/
- https://www.technologyreview.com/2025/05/20/1116327/ai-energy-usage-climate-footprint-big-tech/
- https://sci.greensoftware.foundation/
- https://www.iso.org/standard/86612.html
- https://arxiv.org/html/2505.09598v5
- https://docs.cloud.google.com/architecture/framework/sustainability/low-carbon-regions
- https://www.carbonbrief.org/ai-five-charts-that-put-data-centre-energy-use-and-emissions-into-context/
- https://aws.amazon.com/blogs/database/optimize-llm-response-costs-and-latency-with-effective-caching/
