Nvidia and Google are still kings of AI hardware
Written by James Orme Thu 11 Jul 2019
Nvidia and Google are producing some of the world’s fastest hardware for training machine learning models
The latest AI hardware from Nvidia and Google have set new records for the time taken to train machine learning models, according to new AI benchmarking results from the MLPerf consortium.
MLPerf is a benchmarking suite that measures the speed of machine learning software and hardware. The project was launched in May 2018 by a group of leading industry and academic figures, including Google, Baidu, Intel, AMD, and Harvard.
The latest training results come from version 0.6 of the suite, which measures hardware against six “active” benchmarks, or use categories. The first training time benchmarks were released in December last year and showed that Nvidia and Google were producing the speediest AI chips. The latest results show both companies are still top of the pile — Intel, Google, and Nvidia were the companies that submitted to the latest round of testing.
Nvidia broke eight records in total, with the company’s Nvidia DGX superpod a standout performer. The on-premises AI supercomputer, loaded with 1,536 Nvidia V100 Tensor Core GPUs, trained the image recognition model ResNet-50 in just 80 seconds, slashing the eight hours the Nvidia DGX-1 system took to complete the same task in spring 2017.
The company also set a new 13.5 min record for reinforcement learning with MiniGo, the open source implementation of the AlphaGoZero model, using eight v100 GPUs.
“Supercomputers are now the essential instruments of AI, and AI leadership requires strong AI computing infrastructure,” Nvidia said.
“Our latest MLPerf results bring all these strands together, demonstrating the benefits of weaving our Nvidia V100 Tensor Core GPUs into supercomputing-class infrastructure.”
Google’s TPU v3 Pods (packing 1,000 TPU chips) broke three records themselves. Unlike Nvidia’s DGX, the Alphabet subsidary’s supercomputing beast is available (in public beta at least) as a service on Google Cloud.
The results from MLPerf 0.6 reveal that customers can get superior machine learning training performance on Google Cloud compared to on-premises equivalents, when training the Transformer, Single Shot Detector and ResNet-50 models.
In the Transformer and SSD categories, Cloud TPU v3 Pods trained models over 84 percent faster than the fastest on-premise systems in the MLPerf Closed Division. The TPU v3 models trained the Transformer model in 51 seconds, and the ResNet-50 model with ImageNet data in 1minute and 12 seconds.
Chirag Dekate, senior director and analyst of AI infrastructure at Garnter, said the results are of enormous significance for end users actively involved in developing production AI pipelines.
“These results mean that for some use cases, assuming that the kernels and dataset accuracy rates are transferrable (not a given), training times can drastically be reduced by utilizing purpose specific acceleration technologies like Google TPUs,” Dekate said.
“For Data Scientists this means they can iterate over more models in the same quantum of time and deliver higher quality business value oriented results, and for IT infrastructure leaders, these performance advances directly translate to optimization of CAPEX and OPEX, due to higher productivity enabled by compute environments.”
Dekate cautioned however that users should use these performance measures as guidelines and not extrapolate benchmarking results into real-world application performances.
“[MLPerf] provides one data point to compare competing solutions in the industry,” he said
“Real world application performance will vary as there are more variable factors to consider for instance variance in custom data sets with varying labeling and data accuracies, scale of ML models, compute density and specific business relevant use cases that are usually more nuanced than what is characterized by the benchmarks.”
Written by James Orme Thu 11 Jul 2019
Tags:chips Cloud Google hardware nvidia supercomputer
AI Thu 11 Jul 2019IBM’s Watson proves AI can understand the offside rule
AI Thu 11 Jul 2019MIT researchers propose model for debiasing AI algorithms
Cloud Thu 11 Jul 2019Nvidia T4 GPUs now generally available on Google Cloud
Cloud Thu 11 Jul 2019Scalable AI supercomputers now available as a service o...