Chipmaking has never been easy. But for much of the last 50 years, the recipe was relatively straightforward: shrink the transistors at a steady pace, and use the increased density to either boost performance (and reduce power) or add new features. Scaling hasn’t stopped (yet), but it is taking significantly longer to get to each new node, and costs are increasing, while the payback in performance and power are diminishing.
As a result, chip companies are looking for new ways to deliver more bang for the buck through increased parallelization (more cores), better power management, more memory capacity and bandwidth, faster interconnects and the use of off-chip accelerators. (I recently wrote about accelerators designed specifically for computer vision.) All of these trends have been on display in the recent processor announcements.
AMD gets back in the game with Zen
For example, at Hot Chips last month, the headliner was AMD’s presentation on its new “Zen” microarchitecture, a ground-up redesign that promises to make its processors competitive with Intel’s fastest silicon once again. AMD says the goal was 40 percent more instructions per clock at the same power. Zen can also scale all the way from thin-and-light laptops to high-performance servers, eliminating the need for the separate low-power platform based on the Puma cores, an important step for a company with limited resources.
In the current Bulldozer architecture, two integer cores share a front-end to fetch and decode instructions, and a floating-point unit. The idea was to create room to pack more integer cores on a single die, but the design also created performance bottlenecks. With, Zen the cores no longer share resources. Each core has dual schedulers (one of integer and the other for floating-point) and its own floating-point unit.
Significantly, Zen is the first AMD processor to support true simultaneous multi-threading (SMT) with two threads per core. The new design has a number of other features designed to improve performance including a micro-op cache and better branch prediction. Finally, it supports a number of new instructions–some that are already included in Intel Core processors plus a few AMD exclusives–though it isn’t releasing details on all of them yet.
One of the most important changes from a performance perspective is larger and faster caches to feed these improved cores. Each Zen core has 4KB L1 instruction cache and 32KB L1 data cache, 512KB of L2 cache, and 2MB of L3 cache. The L1 and L2 caches have twice the bandwidth while the L3 cache has up to five times the bandwidth, according to AMD.
As with all new chip designs, Zen also enhances power management. The more advanced process with less leaky FinFET transistors no doubt helps out a lot here, but the new design also includes much finer control of the frequency and power across the die.
At a higher level, the Zen processors are organized into modules, each of which has four cores and a total of 8MB of L2 cache. The computer modules communicate with each other through a proprietary fabric that connects the CPUs, memory controller and IO, though AMD has so far said very little about this (perhaps it is a derivative the SeaMicro fabric from the $334 million acquisition back in 2012).
The first product, the “Summit Ridge” processor, will have two of these four-core modules for a total of 8 physical cores (16 threads) and 16MB of L3 cache. It will replace the current FX Series CPUs targeted at enthusiast desktops. This will be followed by “Naples,” a server processor with up to 32 cores (64 threads) designed to return AMD to the mainstream, two-socket server market.
To be sure, Zen will also get a big boost from the shift to GlobalFoundries’ 14nm process with FinFET transistors, but it’s important to note how long this process took. AMD’s current FX desktop processors are based on a Vishera platform with Piledriver cores that is four years old, and the 32nm manufacturing process is even older dating from the original Zambezi design with Bulldozer CPU cores. (Intel’s first 32nm Sandy Bridge processors were released in January 2011.)
Zen is a significant accomplishment, but the Hot Chips announcement leaves several big question marks.
The first is how quickly it will ramp. Though AMD’s official roadmap continues to show Summit Ridge as a 2016 product, the reality is that the first Zen processors won’t be available until the first quarter of 2017. And while FX desktop CPUs are good for bragging rights, other products will almost certainly be more important to AMD’s future. The timing and performance of Naples will determine whether AMD can be relevant once again in the enterprise where its server market share has dwindled to close to zero. Moreover, the Zen versions of its APUs combining CPU cores and on-die Radeon graphics, aren’t even on the official roadmap yet; instead AMD is shipping the stopgap “Bristol Ridge” with up to four Excavator cores, under the A10, A12 and FX brands, for the foreseeable future. These APUs go into the high-volume desktop and laptop segments, so a timely shift to 14nm and the Zen microarchitecture will be important.
The second question is how well it will really perform. Based on what we know about the process and architecture, it is almost certainly more competitive. But AMD’s hand-picked Blender rendering test, which essentially showed a tie with a Core i7-6900K Broadwell clocked at 3GHz–tells us little about how it will really stack up on mainstream client and server workloads. It is worth noting that AMD has been talking about a three- to five-year strategy with several generations of Zen+ processors to be successful in the datacenter, which suggests that while Zen may narrow the gap with Intel, it isn’t likely to take the performance crown in 2017.
Intel fine-tunes 14nm processors
Intel did not break any news Hot Chips, but a week later it announced its first Kaby Lake seventh-generation Core processors. Kaby Lake does not use a new process node–Intel has delayed 10nm to late 2017–but it does use an enhanced version the existing 14nm process that the company says delivers 12 percent better transistor performance. The overall design is the same but Intel has added hardware decoding of high-resolution HEVC in 10-bit color (Netflix) as well as Google VP9 (YouTube), which it says significantly increases battery life when playing 4K video.
Intel had previously stated that Kaby Lake processors had started shipping to customers, but we now have the details. The first wave includes three 4.5-wattt Y series processors and three 15-watt U series chips–all based on the same die with two CPU cores (four threads) and Intel HD on-die graphics with 24 execution units–for 2-in-1s and thin laptops. Intel said the first systems will be available this month with more than 100 designs in the market by the holidays. Kaby Lake processors for larger notebooks and desktops, business PCs and workstations will follow in 2017. The first 10nm Cannon Lake processors are due in late 2017, though whether we’ll see desktop parts at that node anytime soon is unclear.
IBM’s “post-Moore’s Law” Power9 processor
Like AMD, IBM is hoping that a new version of its high-performance Power processor, announced at Hot Chips, will be more competitive in the mainstream server market. This will be the first Power processor manufactured not on IBM’s own process on SOI (silicon-on-insulator) wafers, but rather by GlobalFoundries using standard silicon wafers and Samsung’s 14nm process with FinFET transistors. The CPU core has been redesigned with a new instruction set (Power ISA 3.0), better branch prediction and a more efficient pipeline delivering 1.5 to 2.5 times better throughput than Power8 at the same frequency. Power9 will be available with either 12 SMT8 cores with eight threads each or 24 SMT4 cores with four threads each–for a total of 96 threads in both versions. The SMT4 has 32K L1 instruction cache and 32K of data cache; the SMT8 consist of two of these slices put together. Both also have 6MB of total L2 cache and 120MB of L3 cache using IBM’s embedded DRAM.
Both the SMT8 and SMT4 are available in either scale-up or scale-out versions. The scale-up version, like the Power8, uses up to eight memory buffer chips to increase the capacity and bandwidth. Each buffer chip has 16MB of eDRAM (for a total of 128MB of L4 cache) and connects to four DDR4-1600 modules for a total of 8TB per socket with sustained bandwidth of 230GBps. This version will compete primarily with Intel’s Xeon E7 and Oracle’s SPARC processors in multi-socket servers for workloads that require the maximum performance per core and memory bandwidth. The scale-out version has eight standard DDR4 memory channels for a total of up to 4TB of memory per socket with sustained throughput of 120GBps. It will compete primarily with the Xeon E5 in two-socket systems that can scale to hundreds or thousands of nodes in a large cluster to handle parallel workloads such as web servers.
IBM has been trying to broaden the market through its OpenPower Foundation, which allows other companies to license the technology and develop their own servers. Since the scale-out Power9 is compatible with commodity hardware (no buffer chips), it should be easier to use in third-party designs and less expensive, which should make it more attractive to OpenPower partners.
At Hot Chips, IBM’s Brian Thompto described the Power9 as a “platform for accelerated computing in the post Moore’s Law era” because it is designed for heterogeneous computing with GPUs or FPGAs to offload tasks from the CPU and boost performance. The Power9 is likely to be one of the first chips to support PCI-Express 4.0, which at 16Gbps has twice the bandwidth per lane of the current specification. With 48 lanes, the Power9 has a total of 192GBps of bidirectional bandwidth. This can be used for any PCIe accelerator, but IBM’s cache-coherent CAPI 2.0 interface also runs over this for connecting ASICs and FPGAs. Power9 will also introduce an IBM BlueLink physical interface that runs at 25Gbps per lane. The scale-out version has 48 lanes, for a total of 300GBps of bi-directional bandwidth, and the scale-up reportedly has 96 lanes yielding 600GBps in both directions. BlueLink can be used with Nvidia’s NVLink 2.0 to attach Tesla GPUs or with a version of CAPI for ASICs and FPGAs, but it could eventually also be used to attach large pools of persistent memory, or what IBM calls “storage-class memory” directly to the CPUs.
In comparison to a standard Xeon or Opteron server using PCIe 3.0 to connect with accelerators (including Nvidia’s own DGX-1), a Power8 server with NVLink delivers five times the performance, according to IBM. (The company just announced a new version of its S822LC 1U server that uses this setup.) The move to NVLink 2.0 over BlueLink will make the Power9 servers seven to times faster, IBM said, though by the time Power9 is available in the second half of 2017 Intel is likely to be shipping new Xeon E5 and E7 processors.
IBM has already notched two big supercomputer wins for Power9. Oak Ridge’s Summit system will have about 3,400 nodes–each with multiple Power9 processors, multiple next-generation Tesla GPUs connected over NVLInk, and 500GB of High-Bandwidth DDR4 Memory and 800GB of non-volatile memory. It is expected to reach peak performance of around 200 petaflops. Lawrence Livermore hasn’t said as much about its Sierra system, but we know it will also use Power9 and a future Nvidia Tesla GPU, and have a total of 2.0 to 2.4 petabytes of memory to deliver up to 150 petaflops. Both are expected to go online in 2017-2018.
Google, which is a member of OpenPower and already uses some Power8 servers in its data centers, is working with Rackspace to develop a “Zaius” rack server with two of the 24-core (SMT4) scale-out Power9 processors that will fit in a standard 19-inch rack. Other companies are likely to build Power9 servers based on this open design, and chip companies, such as China’s PowerCore, are designing their own 10nm and 7nm Power processors, which could be available starting around 2018-2019.
Solving the memory bottleneck
One of the biggest bottlenecks to increased system performance is memory, which is why there is so much innovation this area. Desktops and servers have shifted to DDR4 system memory, but the roadmap beyond is unclear. The industry has not settled on standards for DDR5, and it isn’t likely to arrive until 2018-2019.
One alternative is to create stacks of DRAM chips on top of a logic controller packaged alongside the processor on a silicon interposer. SK Hynix and Samsung are both manufacturing these, and at Hot Chips the two companies talked about the roadmap for High-Bandwidth Memory. The current HBM2 devices use 8Gb (1GB) chips in stacks of up to eight with a maximum bandwidth of 256GBps. In its presentation, Samsung said HBM3 stacks will use more than eight 16Gb or denser chips with more than twice the bandwidth. AMD’s Radeon HD R9 Series and Nvidia’s Tesla P100 GPUs already use HBM, but many graphics cards continue to use graphics memory (GDDR) memory because HBM is so expensive. Samsung talked about creating a low-cost version of HBM by using fewer through-silicon vias (at the expense of some bandwidth), eliminating the separate controller chip, and shifting from silicon to cheaper organic substrates. SK Hynix also talked about making HBM accessible to more applications without providing details.
Micron has taken a different approach, boosting the speed of graphics memory with GDDR5X (used in Nvidia’s GeForce GTX 1070 and 1080) while developing a more specialized Hybrid Memory Cube (HMC) for networking and storage applications that need maximum throughput and RAS (reliability, availability, and serviceability) features. For servers, Micron is focused on 3D XPoint memory, and at Hot Chips the company talked about how this new type of non-volatile memory will serve as a “far memory” in the hierarchy between PCIe-based SSDs using 3D NAND flash and DRAM system memory. Eventually Intel and Micron will put 3D XPoint on memory modules, or DIMMs, providing DRAM-like performance with the density (and cost) of storage for applications such as in-memory databases. This persistent memory or storage-class memory will require changes to operating systems and software, but it could significantly boost system performance.
Moore’s Law may be coming to an end, but chipmakers and system designers continue to find ways to deliver better performance and integrate more features. In fact, as the industry approaches the fundamental limits of silicon CMOS technology, it seems to be getting more creative, not less. Over the next few years we are likely to see some revolutionary changes to system architecture.