The sustainability conversation around AI has been almost entirely about training. GPT-4’s training run. The carbon cost of a frontier model. The energy consumed at a hyperscale cluster over a six-week training cycle. That framing made sense when AI was predominantly a research activity and when the number of training runs was large relative to the number of users. It no longer reflects where the emissions are going.
Inference is now the dominant AI workload. Billions of queries hit production language models every day, each generating a small carbon event. Those small events accumulate into something much larger than any single training run. Research from Accenture Labs published in March 2026 was direct: while carbon estimation tools focus on training, the carbon cost of training is quickly eclipsed by emissions generated during inference due to its widespread and repeated usage. AWS and Nvidia have both separately stated that inference accounts for as much as 90% of the cost of large-scale AI workloads. The AI inference carbon footprint is not an emerging problem. It is the current problem, and the industry is still discussing it as if training were the primary concern.
Training Is a One-Time Event. Inference Never Stops.
A training run has a beginning and an end. The carbon emitted during that run is real and significant, but it is bounded. A model trained once can serve millions of users across months or years. Each of those service interactions is an inference event, and the cumulative carbon of those interactions compounds continuously from the moment the model goes into production.
The arithmetic becomes visible quickly. ChatGPT’s training reportedly emitted the equivalent of roughly 502 tonnes of CO2. Within weeks of deployment, the inference carbon from serving that model’s user base surpassed the training carbon. As user bases scale into the hundreds of millions and query volumes reach billions per day, inference carbon accumulates at a rate that no training run comparison captures. Recent estimates suggest inference can account for up to 90% of a model’s total lifecycle energy use. The model that took months and millions of dollars to train generates most of its lifetime carbon in the months of production service that follow.
The Carbon Per Query Problem the Industry Is Not Tracking
The per-query carbon cost of inference varies dramatically across model architectures, serving infrastructure, and query complexity. Processing a short prompt using a large production language model consumes roughly 0.42 watt-hours of energy. At hundreds of millions of queries per day across a platform like ChatGPT, that figure translates into energy consumption that rivals the output of a mid-sized power plant on an annualised basis. The trajectory only worsens as AI capabilities expand and as enterprises deploy models for more complex, multi-step agentic tasks that chain multiple model calls per user action.
The problem is not just the absolute carbon. It is the invisibility of it. Training carbon is relatively straightforward to measure and attribute because it happens at specific times on specific hardware in specific facilities. Inference carbon is distributed across global serving infrastructure, varies with query volume and complexity, and is largely absent from the sustainability disclosures that regulators are starting to require. The EU Energy Efficiency Directive now requires operators above a certain threshold to disclose energy and sustainability data. Most of those disclosures will capture PUE and renewable energy percentage. Neither captures the per-query carbon intensity of the workloads those facilities serve.
The Efficiency Gains Are Real but Not Keeping Pace
The AI industry’s defence against the inference carbon concern is efficiency improvement. Smaller, more efficient models are delivering comparable performance on specific tasks at a fraction of the compute cost. Quantisation, distillation, and hardware-level optimisation are all reducing the energy cost per inference token. These improvements are genuine and commercially important.
They are not keeping pace with volume growth. The inference efficiency gains from model optimisation are being offset by the expansion of AI into new use cases, new markets, and new layers of the enterprise stack. Every productivity tool that embeds AI inference, every customer service deployment, every code completion system, and every agentic workflow running in the background adds to the global inference compute load. Volume is growing faster than efficiency is improving, and the gap between the two is where the carbon accumulates.
The Measurement Problem That Needs to Be Solved First
Before the industry can manage inference carbon, it needs to measure it. That is harder than it sounds. Inference happens across thousands of servers in dozens of facilities, responding to query volumes that fluctuate continuously. The carbon intensity of each inference event depends on the hardware serving it, the grid it draws from, the time of day, and the complexity of the query. No standardised prompt-level carbon measurement framework currently exists that can capture all of those variables in real time at production scale.
Accenture Labs, in its March 2026 framework paper, described this as the core gap: existing tools either cannot benchmark proprietary models, cannot provide the real-time granularity required for deployment-specific prompt-level benchmarking, or force users to operate in local environments that fail to capture the infrastructure complexity of production-scale inference. The sustainability conversation around AI missing land as a critical metric applies equally here. The industry is tracking the inputs it can measure rather than the outputs that matter. Inference carbon is the output that matters most in 2026, and the operators who build the measurement infrastructure to track it now will be ahead of the regulatory requirements that are clearly coming rather than scrambling to comply when they arrive.
