Specialized Chips Won’t Save Us From Impending ‘Accelerator Wall’
As CPU performance improvements have slowed down, we’ve seen the semiconductor industry move towards accelerator cards to provide dramatically better results. Nvidia has been a major beneficiary of this shift, but it’s part of the same trend driving research into neural network accelerators, FPGAs, and products like Google’s TPU. These accelerators have delivered tremendous performance boosts in recent years, raising hopes that they present a path forward, even as Moore’s law scaling runs out. A new paper suggests this may be less true than many would like.
Specialized architectures like GPUs, TPUs, FPGAs, and ASICs may work very differently from a general-purpose CPU, but they’re still built using the same process nodes as an x86, ARM, or POWER processor. That means the performance gains in these accelerators have also depended on the performance improvements delivered by transistor scaling, at least to some extent. But how much of the gains have depended on these manufacturing improvements and the density gains delivered by Moore’s law as opposed to by underlying improvements in targeted domain performance? What degree of improvement has occurred independently of transistor budget?
Princeton University associate professor of electrical engineering David Wentzlaff and his doctoral student Adi Fuchs have created a model that allows them to measure this rate of improvement. The pair built a model using the characteristics of 1,612 CPUs and 1,001 GPUs implemented across a range of process nodes and power ranges to quantify the gains attributable to process node improvements. Wentzlaff and Fuchs created a metric for all of the performance improvements delivered by CMOS advances (CMOS-Driven Return) versus those gains linked to more effective execution of the workload (Chip Specialization Return). More data on the tool they developed to aid in quantifying CMOS Potential, dubbed Rankine, is available here.
What the team found was sobering. Performance gains in specialized silicon are fundamentally linked to the number of transistors available per millimeter of silicon over the long term, as well as the improvements to those transistors introduced with each new process node. Worse, there are fundamental limits to how much performance we can extract from improved accelerator design without simultaneous CMOS scaling improvements.
The phrase “over the long term” is important. Wentzlaff and Fuchs’ research shows that it’s not unusual for workload performance to improve dramatically when accelerators are initially deployed. Over time, as methods for optimally accelerating a given workload are explored and best practices are established, researchers converge on the most optimal approaches possible. The problems that tend to respond well to accelerators are those that are well-defined, parallelizable (think GPU workloads), and exist within a mature, well-studied domain. But this also means that the same traits that make a problem amenable to acceleration also limit the total advantage gained in the long term from doing so. The team dubs this the “accelerator wall.”
The HPC market may have had a sense of this for quite some time. Back in 2013, we wrote a story about the difficult road to exascale for mainstream supercomputers. Even back then, the TOP500 was predicting that accelerators would deliver a one-time leap in performance rankings, but not an improved rate of performance improvement.
But the implications of these findings go beyond the HPC market. Examining GPUs, for example, Wentzlaff and Fuchs found that the gains specifically attributable to non-CMOS gains were quite small.
Figure 5 shows the gains in absolute GPU performance (with CMOS advances included) and those improvements attributable strictly to advances in CSR. CSR can be loosely thought of as the improvements that are left when advances in underlying CMOS technology are stripped out of a GPU’s design.
Figure 6 makes the relationship a bit more clear:
A decrease in CSR doesn’t mean that a later GPU is slower, in absolute terms, than an earlier model. According to Adi Fuchs:
CSR normalizes gains “per CMOS potential”, and that “potential” takes under account transistor counts, and also different speeds, power/area/energy efficiencies, etc. (across CMOS generations). In figure 6, we approximated an apples-to-apples comparison of “Architecture+CMOS node” combinations, by triangulating all benchmarked applications shared between combinations, and applying transitive relations across combos that do not share enough applications (i.e., less than five).
An intuitive way to approach this analysis that figure 6(a) as “what the engineers and managers see” and figure 6(b) is “what we see, when weeding out CMOS potential“. I can speculate and say that you care more about whether your chip outperforms its predecessor, than whether it is due to ‘better transistors’ or ‘better X’ (where X is the different parts of the specialization stack in the form CSR).
GPUs are a well-established, mature, and specialized market and both AMD and Nvidia have every reason to one-up the other with improved designs. Despite this, we see that the majority of improvements in performance have come from CMOS-related factors, not the impact of CSR.
FPGAs and hardware video decoder blocks that the researchers examined fit these fundamental characteristics, even if the relative expected gains over time were larger or smaller due to market maturity. The same characteristics that make a field respond well to acceleration ultimately constrain the ability of accelerators to improve performance. Of GPUs, Fuchs and Wentzlaff write: “While GPU graphics frame rate improved by a rate of 16x, we project further performance and energy efficiency improvements of 1.4 − 2.5x and 1.4 − 1.7x, respectively.” There may not be a lot of headroom left for AMD and Nvidia to ramp performance via CMOS-specific improvements if this proves true.
The implications of this work are significant. It predicts domain-specific architectures will not continue to deliver significant improvements in performance once Moore’s law scaling has broken down. Even if chip designers are able to focus more tightly on improving performance in fixed transistor budgets, such gains are intrinsically limited by diminishing marginal returns for well-understood problems.
Wentzlaff and Fuch’s work points to a need for a fundamentally new approach to computing. Intel’s Meso architecture is one potential alternative. Fuchs and Wentzlaff have also suggested the use of non-CMOS materials and other types of beyond-CMOS specialization, including exploring the use of non-volatile emerging memory storage arrays as a type of workload accelerator. You can read more about the team’s effort in that domain here.
- Intel’s Fundamentally New MESO Architecture Could Arrive in a Few Years
- Facebook is Working on Its Own Custom AI Silicon
- Google Announces 8x Faster TPU 3.0 For AI, Machine Learning