Is the calculation on the CPU faster than the GPU? The latest research from Rice University overcomes hardware obstacles

image

Author | JADE BOYD

Translator | Planning by Yang Zhiang | Yu Ying Computer scientists at Rice University in the United States invented SLIDE, an algorithm that trains deep neural networks on CPUs faster than GPUs. It overcomes a major obstacle to the rapid development of the artificial intelligence industry and proves Without relying on professional-grade acceleration hardware such as graphics processing unit (GPU), it can also accelerate deep learning technology.

According to foreign media reports, computer scientists at Rice University have overcome a major obstacle to the rapid development of the artificial intelligence industry. They have proven that they can achieve this without relying on professional-grade acceleration hardware such as graphics processing units (GPUs). The acceleration of deep learning technology. This algorithm called SLIDE is the first algorithm to train deep neural networks on the CPU faster than the GPU.

At the 2020 machine learning system conference MLSys in Austin, computer scientists from Rice University, with the support of partners from Intel Corporation, presented their latest research results at the Austin Convention Center on March 2.

At present, in order to realize deep learning, many companies are investing heavily in graphics processing units (GPUs) and other professional-grade hardware. Deep learning is a powerful artificial intelligence. Today, smart assistants such as Amazon Alexa and Apple Siri, facial recognition, product recommendation systems, and other technologies all have deep learning behind them. The popularity of deep learning can be cited as an example. As the manufacturer of the industry’s flagship "Tesla V100 Tensor Core GPU" chip, Nvidia's recent financial report shows that its revenue in the fourth quarter of 2019 increased by 41% year-on-year. .

Researchers at Rice University have created a cost-saving algorithm that can replace GPUs, called "sub-linear deep learning engine" (SLIDE). This algorithm only needs to use general-purpose algorithms. Central processing unit (CPU), without the need for professional-grade acceleration hardware.

"Our tests show that SLIDE is the first deep learning intelligent algorithm based on CPU, and its performance can surpass those methods that use large-scale fully connected architecture and GPU hardware acceleration to implement data sets according to industry-scale recommendations.", Anshumali Shrivastava Say so. The assistant professor of Rice University's Brown School of Engineering, along with graduate students Beidi Chen and Tharun Medini, developed the SLIDE algorithm.

SLIDE does not need to rely on GPU, because this algorithm fundamentally uses a completely different deep learning method. The standard "backpropagation" algorithm of deep neural network training technology requires matrix multiplication. Such a heavy calculation is an ideal place for GPU to perform performance. However, through the SLIDE algorithm, Shrivastava, Chen, and Medini turned neural network training into a search problem that can be solved with a hash table.

Compared with backpropagation training technology, this SLIDE algorithm can fundamentally reduce a lot of computational overhead. Shrivastava cites, for example, that today's top platforms for cloud-based deep learning services that Amazon, Google, and other companies use GPUs to build generally use 8 "Tesla V100" chips at a cost of about $100,000.

Beidi Chen and Tharun Medini, graduate students of computer science at Rice University, participated in the development of SLIDE, an algorithm that does not rely on graphics processing units to train deep neural networks. (Photo credit: Jeff Fitlow/Rice University)

"We have a test case running in the laboratory. It can fully carry the workload of a V100 chip, that is, it is suitable for GPU memory and runs on a large fully connected network with more than 100 million parameter calculations." , Shrivastava said, "We used the most advanced Google TensorFlow software package to train the algorithm, and it only took 3 and a half hours to complete the training."

"We later proved that our new algorithm can even complete the training within an hour, and it is not running on a GPU, but on a 44-core xeon-class CPU," Shrivastava said.

Deep learning networks are inspired by biology. Its core feature is artificial neurons, which are a small piece of computer code that can learn and perform specific tasks. A deep learning network may contain millions or even billions of such artificial neurons. As long as these neurons work together through the learning of massive data, it is possible to learn and make expert decisions at a human level. For example, if a deep neural network is trained to recognize objects in a photo, it will use different neurons to learn when it recognizes a photo of a cat or a school bus.

"You don't need to train all the neurons in each use case," Medini explained. "We think like this,'If we only pick out the relevant neurons, it becomes a search problem. 'Therefore, from an algorithmic point of view, our idea is to use a locality sensitive hash algorithm to avoid the complexity of matrix multiplication."

The hash algorithm is a data indexing method invented for Internet search in the 1990s. It uses digital methods to encode large amounts of information, such as all pages of an entire website or all chapters of a book, into a series of numbers called hashes. The hash table is a list that records these hash values ​​and can be quickly searched.

"It is meaningless to implement our algorithm on TensorFlow or PyTorch, because the first thing these software perform is to convert what you are doing into a matrix multiplication problem, regardless of the three seven twenty one," Chen said. "And this is what our algorithm wants to avoid. So we wrote our own C++ code from scratch."

Shrivastava said that the biggest advantage of SLIDE over backpropagation is that it uses a data parallel approach.

"I mean, through data parallelism, if I want to train two data instances, for example, one is the image of a cat and the other is a bus, they may activate different neurons. The SLIDE algorithm can These two instances are updated or trained independently," he said, "this greatly improves the utilization of CPU parallelism."

"On the other hand, compared with GPU, we need more storage space," he said, "There is a cache hierarchy in the main memory. If you are not careful when using it, you may encounter a memory thrashing (cache thrashing), then a large number of page faults will occur."

Shrivastava 说,他的团队 第一次使用 SLIDE 进行实验时,就发生了严重的内存颠簸,但他们的训练时间仍然与 GPU 的训练时间相当,甚至更快。于是,他、Chen 和 Medini 于 2019 年 3 月在 arXiv 上发布了初步实验结果,并将他们的代码上传到 GitHub。几周后,英特尔公司主动联系了他们。

“来自英特尔的合作伙伴注意到了我们实验中的缓存问题,” 他说,“他们告诉我们,他们可以与我们进行合作,让这个算法更快地完成训练,之后的事实证明他们是正确的。在他们的帮助下,我们的实验性能又提高了约 50%。”

Shrivastava 说,SLIDE 还远远未达到其最大潜力。

“我们只能算是初尝甜头而已,” 他说,“我们还可以做很多事情来对这个算法进行优化。例如,我们还没有使用矢量化,也没有在 CPU 中使用内置的加速器,比如 Intel Deep Learning Boost 技术。我们还有很多其他的技巧可以让这个算法变得更快。”

Shrivastava 说,SLIDE 的重要性在于,它证明了还有其他方式来实现深度学习。

“我们想要传达的整个信息是,‘我们不要被矩阵乘法和 GPU 内存这两个瓶颈所限制住,’ ” Chen 说,“我们的算法可能是第一个击败 GPU 的算法,但我希望它不是最后一个。这个领域需要新的想法,而这正是这次 MLSys 机器学习系统会议的重要意义所在。”

该算法的其他共同作者包括 James Farwell、Sameh Gobriel 和 Charlie Tai,他们都是来自英特尔实验室的成员。

该研究还得到了美国国家科学基金会(NSF-1652131, NSF-BIGDATA 1838177)、空军科研办公室(FA9550-18-1-0152)、亚马逊和海军研究办公室的支持。

相关资源:

MLSys 机器学习系统会议文件:

https://www.cs.rice.edu/~as143/Papers/SLIDE_MLSys.pdf

作者介绍:

Jade Boyd is the science editor and deputy director of news and media relations at Rice University's Office of Public Affairs.


Guess you like

Origin blog.51cto.com/15060462/2675594