The idea of using GPUs for more than just fun and games is nothing new. It started with niche high-performance computing applications such as seismic data processing for oil and gas, fluid dynamics simulations and options pricing. But now Nvidia thinks it has found its killer app in the form of deep learning.
“I think we are going to realize looking back that one of the biggest things that ever happened is AI,” CEO Jen-Hsun Huang said in his opening keynote at this year’s GPU Technology Conference. “We think this is a new computing model, a fundamentally different approach to developing software.”
The combination of lots of data, better algorithms and powerful GPUs has led to a big bang in modern AI. In many cases, deep learning is now surpassing the capabilities of humans. Examples of this progress in the past year including Microsoft’s work on image recognition with the ImageNet database, Berkeley’s work on robotics, Baidu’s speech recognition services, and most recently Google DeepMind’s AlphaGo.
This is why Nvidia has gone “all in” in on deep learning, as Huang said repeatedly. And no product is more indicative of that than the company’s new Tesla P100 GPU, a big bet that took three years, thousands of engineers and some $3 billion in investment.
The Tesla P100 isn’t the first Nvidia GPU to use an advanced 16nm manufacturing process with 3D FinFET transistors and the new Pascal architecture–the company announced its Drive PX 2 for self-driving cars at CES earlier this year–but it is by far the most complex with 15.3 billion transistors on a chip measuring 610 square millimeters. To make things a bit more challenging, it also includes four stacks of high-bandwidth memory–16GB in all–in the same package using foundry TSMC’s CoWoS (Chip-On-Wafer-On-Substrate) technology. “The odds of this working at all is approximately zero,” Huang joked.
Based on what is known internally as the GP100 GPU with 60 streaming multiprocessors, the Tesla P100 uses 56 of these SMs, each with 64 FP32 (32-bit) CUDA cores and 32 FP64 (64-bit) CUDA cores clocked at 1.3GHz though it can also burst a bit higher. The result is peak performance of 10.6 teraflops single-precision and 5.3 teraflops double-precision. The 3,584 FP32 CUDA cores can also be used in FP16 half-precision mode, which is sufficient for most deep-learning tasks, and pushes the performance to 21.2 teraflops. That’s nearly 2.5 times the performance of the current Tesla K80, which is manufactured on a 28nm process (Nvidia skipped 20nm) and uses two Kepler GPUs. The GP100 also has more cache and 14MB shared register files, as well as significantly more bandwidth (80TBps), which means it can handle larger jobs more efficiently.
Rival AMD was the first to introduce High Bandwidth Memory (HBM) in its Radeon R9 Fury X consumer cards based on the 28nm Fiji GPU. But Nvidia is the first to use second-generation HBM, which delivers higher capacity and greater bandwidth (AMD had once planned to release a Polaris part with HBM2 this year, but an updated roadmap from the recent Game Developers Conference shows this has been pushed back to Vega in 2017). Each stack consists of four 8Gb (1GB) memory chips each with 5,000 TSVs (through-silicon vias) to connect them to each other and to the rest of the system. The Tesla P100 has four of these stacks for a total of 16GB with 720GB per second of peak bandwidth. HBM2 also supports error correction, a key requirement for many HPC applications.
The Tesla P100 also uses a new interconnect called NVLink which provides better performance than PCI-Express 3.0 in workstations or HPC clusters that use multiple GPUs (most deep-learning algorithms use four or eight GPUs for training). The Tesla P100 has four 40GB per second links for a total of 160GB per second of bidirectional bandwidth between GPUs. NVLink can also be used to connect the GPUs with IBM Power CPUs in servers (Nvidia is part of the OpenPower consortium). The GP100 also improves on the unified memory model in CUDA 6 by allowing programs to access all of the memory in the CPUs and GPUs in the system as a single virtual address space simultaneously while maintaining coherency without a big performance hit.
The Tesla P100 is already in volume production and Nvidia has started delivering it to key hyperscale customers that build their own servers. Cray, Dell, Hewlett-Packard Enterprise and IBM are also building Tesla P100, which will be announced later this year and start shipping in the first quarter of 2017.
In the meantime, to get the Tesla P100 in the hands of researchers that do a lot of the foundational work on deep learning, Nvidia has built its own server, which it is billing as the “world’s first deep learning supercomputer.” The DGX-1 is a 3U server with two 16-core Xeon CPUs, 512GB of memory, eight Tesla P100 GPUs, 7TB of solid-state storage, and dual 10Gbps Ethernet and 100Gbps InfiniBand ports. Nvidia says the DGX-1 is capable of 170 teraflops and a full rack will deliver up to two petaflops, though it’s worth noting that Nvidia is talking about half precision (FP16) for deep learning here, which makes some of the comparisons with Xeon-only or older GPU servers a bit misleading. The bottom line is that Tesla P100-based servers like the DGX-1 can train models much faster. Nvidia is already taking orders, and the DGX-1 will start shipping in June for $129,000. The company also announced partnership with Mass General, which it said will use the DGX-1 to process 10 billion images to advance the hospital’s work in radiology, pathology and genomics
The Tesla P100 and DGX-1 join a product family that already includes the Tesla M40 for training and the more power-efficient M4 for execution (or inference), which Nvidia said has rapidly become its fastest-growing business due to adoption by cloud service providers all over the world. During the keynote, the company also announced the Nvidia SDK, a single toolkit that combines all of its libraries: GameWorks; DesignWorks (ray-traced photorealistic images); VRWorks for using popular game engines to develop content for Oculus Rift and HTV Vive headsets; ComputeWorks; the IndeX plug-in for data visualization; and DriveWorks, a suite of libraries or algorithms for self-driving cars that is still under development. For deep learning, the key announcements were a new version of its training library, cuDNN 5, due in April and GIE, short for GPU Inferencing Engine, to make execution more energy efficient. By using GIE, a Kepler-based Jetson TK1 embedded board improved its performance from 4 images per second per watt to 24 images per second per watt. “This makes CUDA both the highest performance and most energy efficient approach to doing GPU computing,” Huang said.
Nvidia has clearly made inroads with the hyperscale guys. At the keynote, Baidu’s Bryan Catanzaro talked about the company’s work on end-to-end deep learning for speech recognition and said the Tesla P100 will enable it to handle models that are 30 times larger. Google’s Rajat Monga talked about TensorFlow, which the company already uses for some 1,200 applications and is now available as open source on Github. But just how big a business all of this will be is the subject of debate. There are some 1,000 AI start-ups with $5 billion in funding. Amazon, IBM, Google and Microsoft are all rolling out AI as a service (and Salesforce just acquired MetaMind). And it is now starting to make its way into industries such as retail, life sciences and automotive. Huang argues that AI won’t just be a big business, it will be part of every business. Whether that really happens over the next couple of years will depend in part on whether Nvidia can successfully deliver a hardware and software platform that makes deep learning faster, more efficient and easier.