Performance per Watt in CPUs, GPUs & Data Centres: A 10‑Year Overview
Revised
Published
~ 5 min read
Introduction
As AI demand drives more power‑hungry data centres, it’s worth asking: how much more work do we get per watt today than a decade ago? This post surveys performance‑per‑watt (perf/W) trends from roughly 2015 to 2025 across three layers:
- Server CPUs
- GPUs/accelerators
- Data‑centre infrastructure (power, cooling, interconnect)
Throughout, “perf/W” means useful throughput per watt of power consumed, whether that’s FLOPS/W, inferences/W, or a composite benchmark score per watt.
TL;DR
| Domain | ~10‑year improvement (perf/W) | Main drivers |
|---|---|---|
| Server CPUs | ~1.5×–3× | Process nodes, µarch, DVFS, core counts, Arm entry |
| GPUs/Accelerators | ~3×–4×+ (esp. ML/inference) | Tensor/AI cores, low‑precision math, memory BW, specialised engines |
| Data centre (system) | ~5×–10× at the very top; infra gains slower | Chip gains + cooling/PSU/interconnect optimisations |
Bottom line: perf/W improves each year, but demand for compute is rising faster.
CPU efficiency: the slow, steady climb
What we mean by perf/W
For servers, think throughput per watt under realistic workloads. System‑level benchmarks like SPECpower_ssj2008 and SERT 2 measure a whole server under load, giving efficiency numbers that account for power supply, memory, and idle overhead, not just the CPU die.
What the data say
- SPECpower results show roughly 2–3× improvement in server‑side Java throughput per watt between 2015 and 2024 vintage systems, driven by process shrinks (22 nm → 5 nm), wider cores, and better power management. SPEC
- In desktop/consumer benchmarks, CPU power draw has risen sharply while single‑thread efficiency gains are incremental. Performance does not scale linearly with power. GamersNexus (Note: consumer parts, but the thermal/voltage trend applies to server silicon too.)
- The arrival of Arm‑based server CPUs (AWS Graviton from 2018, Ampere Altra from 2020, and NVIDIA Grace in 2023) added a new efficiency curve. Graviton3 delivers comparable throughput to contemporary x86 parts at roughly 60 % of the power for many cloud workloads. AWS
Takeaways
- Expect roughly ~1.5×–3× CPU perf/W improvement over ~10 years, workload‑dependent.
- Arm server chips have been the biggest single disruption to the x86 efficiency plateau, offering a step‑change rather than an incremental gain.
- Limits remain: IPC/clock headroom, process shrinks yielding diminishing voltage returns, and thermal ceilings on air‑cooled racks.
- Practical wins beyond the chip: improved power delivery (48 V rack distribution), DVFS, better idle states, and system‑level tuning.
GPU efficiency: large, discrete jumps
Why GPUs leap ahead
Massive parallelism, dedicated tensor/AI cores, lower‑precision math (FP16, BF16, INT8, FP4), and memory/interconnect advances combine to produce large generational gains, especially for workloads that can exploit reduced precision.
What the data say
- A survey of ML accelerator efficiency finds that leading GPUs/TPUs roughly double perf/W every ~2 years, outpacing Moore’s Law–era CPU gains. Epoch AI
- Concrete example: NVIDIA’s H100 (Hopper, 2022) delivered ~3.5× the inference perf/W of the A100 (Ampere, 2020) on large‑language‑model workloads, largely thanks to the FP8 Transformer Engine. The H100‑based “Henri” system reached ~65 GFLOPS/W, topping the Green500. HPCwire
- Google’s TPU v5p (2023) and AMD’s Instinct MI300X (2023) show similar generational jumps for inference, each claiming 2–3× perf/W gains over their predecessors in target workloads.
- GPU energy efficiency roughly doubles every 3–4 years across a broad set of workloads, while CPU efficiency improves more slowly. arXiv
Takeaways
- Over a decade, ~3×–4×+ improvements are common, especially for inference and low‑precision (FP16/INT8) workloads.
- Real gains depend on workload mix (FP64 scientific vs. FP16/INT8 inference), memory subsystem, and utilisation. A GPU running at 30 % occupancy wastes much of its efficiency advantage.
- Match accelerator to workload; mind memory bandwidth, interconnect topology (NVLink vs. PCIe), and the precision formats your model actually uses.
Data‑centre/system level: beyond the chip
Why it matters
Racks add power delivery, cooling (CRAC/CRAH, chillers, or increasingly liquid cooling), network/storage, and other overheads. The standard metric is PUE (Power Usage Effectiveness) = total facility power ÷ IT equipment power. Lower is better; 1.0 is the theoretical ideal where every watt goes to compute.
What the data say
- Top Green500 systems have climbed from ~7 GFLOPS/W in 2015 to ~70+ GFLOPS/W by late 2024 (e.g., JEDI at 72.7 GFLOPS/W). That’s roughly a 10× system‑level gain, driven primarily by GPU/accelerator improvements. TOP500 NVIDIA
- Industry‑average PUE has plateaued at around ~1.55–1.60 according to the Uptime Institute’s annual survey, though hyperscalers (Google, Microsoft, Meta) report figures closer to 1.10–1.12. Uptime Institute
- Liquid cooling (direct‑to‑chip and rear‑door heat exchangers) is becoming essential for racks exceeding 40 kW, which high‑density GPU clusters now routinely hit. Liquid cooling can improve PUE by 0.1–0.2 points and eliminate the need for raised‑floor CRAH units.
- Even if IT perf/W doubles, facility‑level gains are muted unless the overhead infrastructure improves in step.
Takeaways
- Chip‑level perf/W is necessary but insufficient at rack scale; infrastructure design (cooling, power distribution, rack density) is decisive.
- Most of the Green500’s 10× gain comes from the chips, not the building. The remaining 20–30 % of facility overhead is increasingly hard to shave.
- Plan with GFLOPS/W (or OPS/W) plus PUE and rack‑level metrics (power density per rack, cooling headroom, modularity) to get a true picture of efficiency.
Closing thoughts
Efficiency is improving at every layer, but unevenly. CPUs deliver steady, incremental gains, accelerated recently by Arm’s entry into the server market. GPUs post large generational jumps, especially for ML inference. At data‑centre scale, the bottleneck is increasingly the infrastructure around the chips: cooling, power delivery, and physical density.
The net effect: we get more work per watt each year, but aggregate compute demand, driven by AI training and inference, is rising faster still. Efficiency gains buy time; they don’t solve the energy problem on their own.