Skip to main content

AI Infrastructure Carbon Accounting: The Sustainability Cost Your Team Hasn't Measured Yet

· 9 min read
Tian Pan
Software Engineer

Every engineering team building on LLMs right now is making infrastructure decisions with a hidden cost they're not measuring. You track tokens. You track latency. You track API spend. But almost nobody tracks the carbon output of the inference workload they're running — and that gap is closing fast, from both the regulatory side and the market side.

AI systems now account for 2.5–3.7% of global greenhouse gas emissions, officially surpassing aviation's 2% contribution, and growing at 15% annually. US data centers running AI-specific servers consumed 53–76 TWh in 2024 alone — enough to power 7.2 million homes for a year. The scale is not hypothetical anymore, and the expectation that engineering teams will have visibility into their contribution is becoming a real organizational pressure.

The good news is that the measurement tooling exists, the methodology has been standardized, and the highest-leverage interventions are software decisions your team can make today. This post covers how to measure carbon per model call, what drives emissions at the infrastructure level, and how the regulatory timeline should calibrate your urgency.

Why Inference Is Your Actual Problem, Not Training

The mental model most engineers carry — that training runs are the environmental story for AI — is wrong by a wide margin. Meta, AWS, and Google research consistently finds that 60–90% of an LLM's total lifecycle emissions come from inference, not training. This makes sense once you think about it: a model trains once, but it handles billions of queries over its operational lifetime.

The per-query numbers are small enough to feel negligible until you multiply them out:

  • A single GPT-4o query costs approximately 0.42 Wh — about 40% more than a Google search
  • Claude 3 Opus runs about 4.05 Wh per request (roughly 1.80 grams of CO₂ on a typical grid)
  • Claude 3 Haiku runs 0.22 Wh per request — 94% less than Opus for the same underlying provider infrastructure
  • Efficient small models (GPT-4.1 nano, LLaMA-3.2 1B/3B) come in below 0.3 grams CO₂ per query

The gap between the most and least efficient models for equivalent tasks is 10–40×. At a million queries per day, that gap becomes a metric ton of CO₂ differential per week. Model selection is the highest-leverage environmental decision on your architecture diagram, and most teams are not making it with emissions data anywhere in the picture.

The Measurement Stack

Measuring AI inference carbon is now genuinely tractable. There are three layers to get right.

Hardware-level measurement is what tools like CodeCarbon provide. It monitors GPU, CPU, and RAM power consumption at configurable intervals (default: every 15 seconds), retrieves the carbon intensity of the local electricity grid for your hardware's geographic location, and outputs a CO₂ equivalent figure for the code that ran. If you're self-hosting models, this is the most direct approach. CodeCarbon integrates as a Python context manager — you wrap your inference calls and get a measurement without restructuring your code.

API-call-level attribution is what the One-Token Model (OTM) methodology addresses for teams calling hosted providers. The core insight: the token is a universal measurement unit that works across text, audio, image, and multimodal inputs. The conversion chain is: token count → hardware power profile for that model → energy consumed per token → grid carbon intensity for the provider's data center region → CO₂e. Tools like EcoLogits implement this for OpenAI, Anthropic, and other major provider APIs without requiring any internal provider data. You pass in your API call and get an emissions estimate back.

Infrastructure-level attribution is where Cloud Carbon Footprint comes in. It pulls usage data from AWS, GCP, and Azure accounts, reconstructs energy consumption per service, and applies grid intensity data for each region. This is the right layer for attribution reporting — mapping which team, product feature, or customer segment is responsible for what share of your AI carbon spend.

The emerging consensus standard is the Software Carbon Intensity (SCI) for AI specification from the Green Software Foundation, which extends the existing ISO/IEC 21031:2024 standard. It defines a Provider Score (covering training and deployment) and a Consumer Score (covering inference usage), with token as the functional unit for language models. The practical upshot: if your organization is asked by an ESG team or an enterprise customer to report your AI emissions, SCI for AI is the framework you'll be using.

What Actually Moves the Needle

Once you have measurement in place, the interventions worth prioritizing break into three categories.

Model routing is the highest-leverage lever. The 10–40× carbon gap between model tiers means that routing a query to the right model tier — not the capable-for-anything frontier model — is worth more than most optimization work. The pattern is similar to cost-aware routing: classify query complexity first, then dispatch to the cheapest model that can handle it reliably. Haiku for summaries and classification, Sonnet for reasoning tasks, Opus or equivalent for the work that genuinely needs it. The emissions reduction tracks the cost reduction directly.

Batching is underrated. Research shows that moving from batch size 4 to batch size 8 delivers approximately 45% reduction in energy per prompt. Moving from 8 to 16 yields another 43% reduction. For any workload that isn't truly interactive — document processing, classification pipelines, nightly enrichment jobs — request batching is the single easiest infrastructure change with the largest environmental payoff. The latency tradeoff is real for synchronous workloads, but for async pipelines it's often free.

Quantization has more headroom than most teams use. INT8 quantization reduces model size by 75% with less than 1% accuracy loss on most benchmarks. q4 variants (4-bit quantization) can achieve up to 79% energy reduction versus FP16. Advanced techniques like AWQ (Activation-aware Weight Quantization) protect the critical weight channels that cause accuracy degradation, making aggressive quantization viable for production without the quality cliff that earlier methods hit. If you're self-hosting, the quantization decision is one of the highest-leverage configurations you haven't optimized.

One data point that reframes how to think about all of this: production systems achieved a 33× reduction in energy per prompt between May 2024 and May 2025. The breakdown was roughly 23× from model architecture improvements and 1.4× from better hardware utilization. The lesson is that software-level optimization has an order of magnitude more headroom than hardware purchasing decisions. The carbon optimization work is mostly engineering work, not infrastructure spend.

The Attribution Problem Inside Organizations

Measurement at the API boundary is straightforward. The harder problem is attribution — answering which team, feature, or customer is responsible for which portion of the emissions.

The organizational version of this problem looks like: you have a shared inference API gateway. Fifteen product teams route traffic through it. At the end of the quarter, your sustainability team asks for an AI emissions breakdown. Without attribution at the request level, you have a total and no decomposition.

The right instrumentation layer treats carbon like cost. Just as you attribute token spend to teams via request tagging, you attribute emissions the same way. Every inference request should carry team, feature, and environment tags. The carbon calculation runs at collection time against the token count and model ID. Cloud Carbon Footprint or similar tooling aggregates at the account level; request-level tagging provides the decomposition below that.

Google's internal approach to this is instructive: teams have explicit compute and storage quotas, which forces prioritization and makes the cost-of-compute signal visible in the developer workflow. Carbon is a lagging signal when compared to cost, but the quota mechanism is the same. Organizations that instrument this early have it as a queryable metric before they need it for compliance. Organizations that don't have a scramble ahead of them.

The Regulatory Timeline, Honestly

The current regulatory situation is messier than the sustainability community presents it, and cleaner than the "it's all being rolled back" narrative suggests.

In the US, the SEC's climate disclosure rule (issued March 2024) would have required Scope 1 and 2 GHG emissions disclosure for large accelerated filers starting with FY 2025 filings. The rule is currently under a voluntary stay following legal challenges, and the SEC announced in early 2025 that it would not defend the rule in court. The short version: mandatory SEC disclosure for US public companies is genuinely uncertain right now.

In the EU, the AI Act includes requirements for voluntary codes of conduct on energy efficiency and environmental sustainability, with a first progress report due in August 2028. The actual enforcement standards don't exist yet and will take years to develop. This is not imminent regulatory pressure.

The stronger near-term signal is industry standard adoption. The SCI for AI specification is the first consensus standard for AI carbon measurement, extending an existing ISO standard. When enterprise customers start asking AI vendors about their emissions footprint — which is already happening in procurement questionnaires — this is the framework teams will be asked to report against. The SBTi methodology update in 2027 will tighten what companies with validated science-based targets have to show. Investor-driven ESG pressure is moving faster than regulatory mandates.

The practical framing: implement carbon measurement as infrastructure observability, not as compliance theater. The tooling is mature enough, the standards are clear enough, and the organizational usefulness of having the data (for capacity planning, cost optimization, and vendor negotiations) is high enough that it's worth doing on its own merits. The compliance requirement will follow the practice, not the other way around.

Where to Start Next Week

The minimum viable measurement setup requires three things: a token counter that fires per inference call, a model-to-carbon-intensity lookup table (EcoLogits or DitchCarbon maintain these for major providers), and a metrics sink that accepts a carbon_gCO2e field alongside your existing latency and cost metrics.

From there, the most informative first query to run is: what is the carbon breakdown by model tier across your last 30 days of inference traffic? The answer almost always reveals that a large fraction of frontier-model calls are handling tasks that a cheaper, lower-emissions model would handle adequately. That analysis pays for itself in API cost savings before the sustainability report ever gets written.

The teams that will be ahead of this shift are not the ones that wait for a regulation to force the measurement — they're the ones that instrument it early, make it part of normal capacity planning, and build the organizational vocabulary for discussing AI efficiency as an environmental as well as a financial metric.

References:Let's stay in touch and Follow me for more thoughts and updates