The choice of deep learning for gpu

Since deep learning shined in 2012, GPU computing has also entered the attention of people, which makes large-scale computing neural networks possible. People can use code to control the GPU for parallel computing through the CUDA (Compute Unified Device Architecture) introduced in 2007. This article first recommends which GPU graphics card to choose based on some parameters of the graphics card, and then talks about the hardware architecture related to cuda programming.

####1. What kind of GPU model to choose

In recent years, AMD and NVIDIA are mainly engaged in graphics cards. So far, there are hundreds of GeForce series cards launched by NVIDIA [1]. Although many of them have been eliminated, how to choose a suitable card for Algorithms are also a question worth thinking about. Tim Dettmers[2]'s article gives many useful suggestions. I also give some suggestions based on my own understanding and experience (in fact, I have only used GTX 970...).

1

Figure 1 GPU selection

The above does not consider the graphics card of the notebook, and it is better to choose a desktop computer for algorithm acceleration. The most cost-effective I think is the GTX 980ti. Judging from the parameters or some user evaluations, the performance has not lost much to the TITAN X, but the price is much cheaper. As can be seen from Figure 1, graphics cards with similar prices will have their own areas of expertise, and you can choose them according to your needs. If the amount of data to be processed is relatively small, choose the one with high frequency. If the amount of data to be processed is large, choose the one with a large number of cores in video memory. If there is a double precision requirement, it is best to choose the kepler architecture. tesla's M40 is specially made for deep learning. If there is only deep learning training, although this card is expensive, it is more suitable for enterprises or institutions to buy it (Baidu's Deep Learning Research Institute uses this one [3]), Compared with the K40 single-precision floating-point operation performance of 4.29Tflops, the M40 can reach 7Tflops. The QUADRO series is rarely mentioned. Its M6000 is more expensive than K80, and its performance parameters are not much better.

Several parameters to pay attention to when choosing are the processor core (core), operating frequency, video memory bit width, single card or dual card. Some people think that the bit width is the most important, and some people think that the number of cores is the most important. I think the number of processor cores and the size of the video memory are more important for deep learning computing. The more and higher these parameters are, the better, but the program should also be written accordingly. If all cores cannot be made to work, resources will be wasted. And when purchasing a graphics card, if a host is plugged into multiple graphics cards, pay attention to the choice of power supply.

####2. Some common name meanings

We talked about what kind of gpu to choose. This part introduces some common terms. With the update of graphics card performance from generation to generation, there are many changes and updates in hardware design or naming methods, among which the following are the more common ones.

  • GPU architecture: Tesla, Fermi, Kepler, Maxwell, Pascal
  • Chip model: GT200, GK210, GM104, GF104, etc.
  • Graphics card series: GeForce, Quadro, Tesla
  • GeForce graphics card models: G/GS, GT, GTS, GTX

The gpu architecture refers to the way the hardware is designed, such as how many cores there are in the stream processor cluster, whether there are L1 or L2 caches, whether there are double-precision computing units, and so on. The architecture of each generation is an idea, how to better complete the idea of ​​parallelism, and the chip is the realization of the above idea, the second letter in the chip model GT200 represents which generation of architecture, and sometimes there are 100th and 200th generation chips , their basic design ideas are consistent with the architecture of this generation, but some changes have been made in details. For example, GK210 has twice as many registers as GK110. Sometimes there may be two chips in a graphics card, Tesla k80 uses two GK210 chips. The name of the first-generation gpu architecture here is also Tesla, but there are basically no cards with this design. If mentioned below, the Tesla architecture and the Tesla series will be used to distinguish.

而显卡系列在本质上并没有什么区别,只是NVIDIA希望区分成三种选择,GeFore用于家庭娱乐,Quadro用于工作站,而Tesla系列用于服务器。Tesla的k型号卡为了高性能科学计算而设计,比较突出的优点是双精度浮点运算能力高并且支持ECC内存,但是双精度能力好在深度学习训练上并没有什么卵用,所以Tesla系列又推出了M型号来做专门的训练深度学习网络的显卡。需要注意的是Tesla系列没有显示输出接口,它专注于数据计算而不是图形显示。

最后一个GeForce的显卡型号是不同的硬件定制,越往后性能越好,时钟频率越高显存越大,即G/GS<GT<GTS<GTX。

####3.gpu的部分硬件

这一部分以下面的GM204硬件图做例子介绍一下GPU的几个主要硬件(图片可以点击查看大图,不想图片占太多篇幅)[4]。这块芯片它是随着GTX 980和970一起出现的。一般而言,gpu的架构的不同体现在流处理器簇的不同设计上(从Fermi架构开始加入了L1、L2缓存硬件),其他的结构大体上相似。主要包括主机接口(host interface)、复制引擎(copy engine)、流处理器簇(Streaming Multiprocessors)、图形处理簇GPC(graphics processing clusters)、内存等等。

1

图2 GM204芯片结构

主机接口,它连接了gpu卡和PCI Express,它主要的功能是读取程序指令并分配到对应的硬件单元,例如某块程序如果在进行内存复制,那么主机接口会将任务分配到复制引擎上。

复制引擎(图中没有表示出来),它完成gpu内存和cpu内存之间的复制传递。当gpu上有复制引擎时,复制的过程是可以与核函数的计算同步进行的。随着gpu卡的性能变得强劲,现在深度学习的瓶颈已经不在计算速度慢,而是数据的读入,如何合理的调用复制引擎是一个值得思考的问题。

流处理器簇SM是gpu最核心的部分,这个翻译参考的是GPU编程指南,SM由一系列硬件组成,包括warp调度器、寄存器、Core、共享内存等。它的设计和个数决定了gpu的计算能力,一个SM有多个core,每个core上执行线程,core是实现具体计算的处理器,如果core多同时能够执行的线程就多,但是并不是说core越多计算速度一定更快,最重要的是让core全部处于工作状态,而不是空闲。不同的架构可能对它命名不同,kepler叫SMX,maxwell叫SMM,实际上都是SM。而GPC只是将几个sm组合起来,在做图形显示时有调度,一般在写gpu程序不需要考虑这个东西,只要掌握SM的结构合理的分配SM的工作即可。

图中的内存控制器控制的是L2内存,每个大小为512KB。

####4.流处理器簇的结构

上面介绍的是gpu的整个硬件结构,这一部分专门针对流处理器簇SM来分析它内部的构造是怎样的。首先要明白的是,gpu的设计是为了执行大量简单任务,不像cpu需要处理的是复杂的任务,gpu面对的问题能够分解成很多可同时独立解决的部分,在代码层面就是很多个线程同时执行相同的代码,所以它相应的设计了大量的简单处理器,也就是stream process,在这些处理器上进行整形、浮点型的运算。下图给出了GK110的SM结构图。它属于kepler架构,与之前的架构比较大的不同是加入了双精度浮点运算单元,即图中的DP Unit。所以用kepler架构的显卡进行双精度计算是比较好的。

1

图2 GK110的SMX结构图

上面提到过的一个SM有多个core或者叫流处理器,它是gpu的运算单元,做整形、浮点型计算。可以认为在一个core上一次执行一个线程,GK110的一个SM有192个core,因此一次可以同时执行192个线程。core的内部结构可以查看[5],实现算法一般不会深究到core的结构层面。SFU是特殊函数单元,用来计算log/exp/sin/cos等。DL/ST是指Load/Store,它在读写线程执行所需的全局内存、局部内存等。

一个SM有192个core,8个SM有1536个core,这么多的线程并行执行需要有统一的管理,假如gpu每次在1536个core上执行相同的指令,而需要计算这一指令的线程不足1536个,那么就有core空闲,这对资源就是浪费,因此不能对所有的core做统一的调度,从而设计了warp(线程束)调度器。32个线程一组称为线程束,32个线程一组执行相同的指令,其中的每个thread称为lane。一个线程束接受同一个指令,里面的32个线程同时执行,不同的线程束可执行不同指令,那么就不会出现大量线程空闲的问题了。但是在线程束调度上还是存在一些问题,假如某段代码中有if…else…,在调度一整个线程束32个线程的时候不可能做到给thread0~15分配分支1的指令,给thread16~31分配分支2的指令(实际上gpu对分支的控制是,所有该执行分支1的线程执行完再轮到该执行分支2的线程执行),它们获得的都是一样的指令,所以如果thread16~31是在分支2中它们就需要等待thread0~15一起完成分支1中的计算之后,再获得分支2的指令,而这个过程中,thread0~15又在等待thread16~31的工作完成,从而导致了线程空闲资源浪费。因此在真正的调度中,是半个warp执行相同指令,即16个线程执行相同指令,那么给thread0~15分配分支1的指令,给thread16~31分配分支2的指令,那么一个warp就能够同时执行两个分支。这就是图中Warp Scheduler下为什么会出现两个dispatch的原因。

另外一个比较重要的结构是共享内存shared memory。它存储的内容在一个block(暂时认为是比线程束32还要大的一些线程个数集合)中共享,一个block中的线程都可以访问这块内存,它的读写速度比全局内存要快,所以线程之间需要通信或者重复访问的数据往往都会放在这个地方。在kepler架构中,一共有64kb的空间大小,供共享内存和L1缓存分配,共享内存实际上也可看成是L1缓存,只是它能够被用户控制。假如共享内存占48kb那么L1缓存就占16kb等。在maxwell架构中共享内存和L1缓存分开了,共享内存大小是96kb。而寄存器的读写速度又比共享内存要快,数量也非常多,像GK110有65536个。

此外,每一个SM都设置了独立访问全局内存、常量内存的总线。常量内存并不是一块内存硬件,而是全局内存的一种虚拟形式,它跟全局内存不同的是能够高速缓存和在线程束中广播数据,因此在SM中有一块常量内存的缓存,用于缓存常量内存。

####5.小结

本文谈了谈gpu的一些重要的硬件组成,就深度学习而言,我觉得对内存的需求还是比较大的,core多也并不是能够全部用上,但现在开源的库实在完整,想做卷积运算有cudnn,想做卷积神经网络caffe、torch,想做rnn有mxnet、tensorflow等等,这些库内部对gpu的调用做的非常好并不需用户操心,但了解gpu的一些内部结构也是很有意思的。

另,一开始接触GPU并不知道是做图形渲染的…所以有些地方可能理解有误,主要基于计算来讨论GPU的构造。

参考:

[1] List of Nvidia graphics processing units

[2] Which GPU(s) to Get for Deep Learning: My Experience and Advice for Using GPUs in Deep Learning

[3] Inside the GPU Clusters that Power Baidu’s Neural Networks

[4] Whitepaper NVIDIA GeForce GTX 980

[5] Life of a triangle - NVIDIA’s logical pipeline

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326724913&siteId=291194637