The rebirth of artificial intelligence over the past five years has led to rapid progress in challenging areas such as computer vision and speech recognition. As computers begin to learn about the world around them, this is in turn opening up new possibilities in fields such as healthcare, transportation and robotics.
“Machine learning is one of the most important computer revolutions ever,” Nvidia CEO Jensen Huang said at last week’s annual GPU Technology Conference, “Computers are learning by themselves.”
The emergence of deep learning is the result of three factors: smarter algorithms, lots of data, and the use of GPUs to speed up training. The world’s largest cloud companies are increasingly relying on GPUs to develop their own models and making this infrastructure available to their customers. It’s little surprise that Nvidia’s sales to datacenters for GPU computing nearly tripled year-over-year in its most recent quarter.
But the company’s ambitions go beyond selling GPUs. Nvidia’s strategy, as Huang put it, is to provide the most productive hardware and software platform for deep learning. And at the GPU Technology Conference, Nvidia made a compelling case.
The most significant announcement was the first GPU based on Nvidia’s new Volta architecture. Like the current Pascal GP100, the GV100 is designed for both high-performance computing and deep learning workloads. But the similarities end there.
The GV100 is manufactured on a more advanced process, foundry TSMC’s 12nm recipe, and is a much larger chip with an astounding 21.1 billion transistors on a die measuring 815 square millimeters. By comparison, the 16nm GP100 has 15.3 billion transistors on a die that measures 610 square millimeters. The Tesla V100 is so big and complex that Huang said it perhaps the most expensive chip ever built. “If anyone would like to buy this, it’s approximately $3 billion,” he joked while holding up what he said was the first one back from the foundry.
It also has a new architecture that consists of up to 84 Streaming Multiprocessors (SMs)–each with 64 single-precision floating-point units, 64 single-precision integer units, and 32 double-precision floating-point units–a total of 5,376 FP64 CUDA cores and 2,688 FP32 CUDA cores. The GV100 also includes a new type of core, called a Tensor Core, that is designed specifically to speed up the kind of matrix math used heavily in deep learning. Each SM has eight of these for a total of 672 Tensor Cores.
The first product based on this GPU is the Tesla V100, which has 80 active SMs or a total of 5,120 FP64 CUDA cores and 640 Tensor Cores. (It’s common for large chips such as GPUs to use most–but not all–of the resources on the die to improve yields.) It also has enhanced shared memory, 16GB of Samsung’s HBM2 stacked memory with 900GBps of bandwidth (50 percent more throughput than the P100 when combined with other efficiencies), and an updated version of the NVLink interconnect with six 25GBps bidirectional links.
Despite the much larger size and greater number CUDA cores, the V100 runs at around the same speed, with a peak frequency of 1,455MHz, and in the same 300-watt envelope. The result is a GPU with significantly higher peak performance. The Tesla V100 is capable of 7.5 teraflops (trillions of floating-point operations per second) double-precision, 15 teraflops single-precision and 120 mixed-precision “Tensor teraflops” for deep learning.
To illustrate the power of the V100, Nvidia showed demonstrations of the graphics (a scene from Kingsglaive: Final Fantasy XV, a Japanese sci-fi film developed Square Enix’s CGI studio using its Luminous gaming engine), high-performance computing (a simulation of one million stars in the Milky Way and Andromeda galaxies) and deep learning (Adobe and Cornell University’s Deep Photo Style Transfer) capabilities.
Nvidia said the Tesla V100 delivers up to a 12x speed-up on tensor operations for deep learning training. For example, training Microsoft’s ResNet-50 for object recognition in images and videos using the Caffe 2 framework with eights GPUs can be cut from two days with the Tesla K80 and 20 hours with the current P100 to less than a single workday with V100. The V100 can train an LSTM recurrent neural network for speech recognition using Apache MXNet, the open-source framework of choice for Amazon Web Services, in a few hours.
The first systems to offer the Tesla V100, starting sometime in the third quarter, will be Nvidia’s own deep learning workstations. The first is an update to the existing DGX-1, dubbed the DGX-1V, which is identical to the current model–including the dual 20-core 2.2GHz Xeon E5-2698 v4 processors–but swaps the eight Pascal P100 GPUs for eight V100s. This boosts the peak performance for deep learning training from 170 teraflops to 960 teraflops at half-precision in the same power envelope (3,200 watts), bringing the $149,000 DGX-1V tantalizingly close to a petaflop in a box (albeit not for HPC applications, which require a higher level of precision).
Nvidia also announced a new Personal DGX Station with a single Xeon E5-2698 v4, four Tesla V100s and 256GB of DDR4 memory capable of 480 teraflops at half-precision. The $69,000 DGX Station is water-cooled and uses about half the power. These will be followed by Tesla V100 GPU servers from the usual suspects–Cisco, Dell, Hewlett Packard Enterprise, IBM and Lenovo–starting in the fourth quarter.
At this point every single major cloud provide in the world offers servers with Nvidia GPUs. In a sign of just how important cloud service providers have become-both as customers using GPU servers for internal applications and as infrastructure providers–both Matt Wood, GM of Deep Learning and AI for AWS, and Jason Zander, Corporate Vice President of Microsoft Azure, appeared on stage with Huang. Amazon announced it will offer AWS instances with Volta GPUs to deliver a 3x speed-up on LSTM training and partnered with Nvidia’s Deep Learning Institute to train more than 100,000 developers on AWS infrastructure. Microsoft has been working with Nvidia on its HGX-1 JBOG (Just a Bunch Of GPUs) with eight Tesla P100s, announced at this year’s Open Compute Summit, which will also be upgraded to the V100.
To make it easier to develop models on the DGX-1 or DGX Station and then “burst into the cloud,” Nvidia has containerized the entire software stack (GPU drivers and CUDA, deep learning frameworks and libraries) and it announced the Nvidia GPU Cloud, a hybrid cloud platform that makes it easier to run workloads on premises or in the cloud. The NGC Software Stack runs on a PC (with a Titan X or GeForce GTX 1080 Ti card), on a DGX system or directly from the cloud through a browser. The NGC will be in public beta starting in July and Nvidia has not yet announced pricing.
While Nvidia has amassed a significant lead in training, most of the inferencing, or running the models, is still done on Xeon CPUs. Training is more compute intensive–requiring massive parallelism–but inferencing requires far more chips to support millions of users, so it’s not surprising that Nvidia is now going after it in a more serious way.
The company announced a version of the Tesla V100 on a 150-watt PCI-Express card for commodity servers and a TensorRT (runtime) that optimizes models built using Caffe and TensorFlow for Volta GPUs. Nvidia says its solution will deliver 15 to 25 times better performance for inferencing than the upcoming Skylake Xeon Processor Scalable. It showed data demonstrating that on ResNet-50 using the Caffe framework, the V100 would provide a significant increase in throughput and lower latency than both its current GPUs and the Broadwell and Skylake Xeons. Huang said that with the V100, a customer that requires 300,000 inferences per second can replace an entire row of 500 servers (with 1,000 Xeon processors) with 33 servers with Tesla V100 GPUs and get the same throughput.
In making the case for the “accelerated datacenter,” Nvidia is also positioning heterogeneous systems using GPUs, FPGAs and other accelerators as part of broader industry trends. Moore’s Law has had an incredible run, but in the past few years it has started to slow. “We’re now up against the laws of semiconductor physics, there’s only so far that we can push,” Huang said. “We need to find a path forward to life after Moore’s Law.”
At the same time, the complexity of algorithms is “exploding.” At the extreme, Google’s Neural Machine Translation System has 8.7 billion parameters and requires 105 exaflops of total compute complete training (an exaflop is 1,000 petaflops, or more than ten times the world’s fastest supercomputerthe world’s fastest supercomputer). Huang said it would take one CPU server two years to run through this network once. “The amount of computation necessary is just incredible,” he said. “That’s one of the reasons we need to continue to push the living daylights out of computing.”
Today these are used by the large cloud service providers for specific workloads, but Nvidia believes that in the future most software will take advantage of AI in some way. To illustrate this, it showed how an Auto Encoder can improve the performance of ray tracing by teaching a network to automatically fill in the spots that haven’t been rendered yet by looking at the objects around it.
As these kinds of models become larger and more prevalent, the industry will need to come up with new ways to go faster. This was the impetus behind GPU computing and the CUDA architecture, Huang said. “It is very clear that as Moore’s Law comes to an end, accelerated computing is a wonderful path forward.”
All of this is clearly resonating with developers and users. More than $5 billion was invested in AI start-ups last year, and Nvidia counts more than 500,000 GPU developers and one million CUDA downloads in the past year. Conference attendance has grown five-fold over the past three years and this year included the 15 largest tech companies in the world, all of the top 10 automakers, and a long list of healthcare and pharmaceutical companies. Of the 600 sessions at this year’s conference, 60 percent were related to artificial intelligence.
The rapid growth and investment is attracting a lot of new competition. Intel is attacking AI on multiple fronts including new Xeon processors with wider AVX-512 instructions, the many-core Xeon Phi processors, the $16.7 billion acquisition of Altera (whose FPGAs are widely used by Microsoft to accelerate its server workloads), and smaller deals for Nervana and Movidius. AMD just announced a Radeon Vega Frontier Edition graphics card that should rival the Tesla P100 on AI workloads. Google is making its own chips, and just announced a new version of its TPU (Tensor Processing Unit) with the chops to handle training, as well as inferencing. And any number of smaller companies including Graphcore, KnuEdge and Wave Computing are also vying for a piece of this market. But Nvidia has amassed a sizable lead, and is doing an impressive job of evolving from selling GPUs to developing the systems and software to power emerging workloads.