CUDA(四) 周斌 GPU体系架构概述

FLOPS - FLoating-point OPerations per Second

GFLOPS - One bilion ( $10^{^{9}}$ )FLOPS 十亿

TFLOPS - 1,000GFLOPS 一万亿 T->P->E

GPU(Graphic Processing Unit) 结构图

GPU是一个异构的多处理器芯片，为图形图像处理优化A GPU is a heterogeneous chip muti-processor (highly tuned for graphics)

Shader Core(渲染器，ALU计算单元)	Shader Core	Tex（）	Input Assembly
Shader Core	Shader Core	Tex	Rasterizer
Shader Core	Shader Core	Tex	Output Blend
Shader Core	Shader Core	Tex	Video Decode
			Work Distributer（工作分派器）

执行单元Execute shader

Fetch/Decode取指译码

ALU(Execute)处理核心

Excution Context执行的上下文（执行空间，管道）

CPU类型的内核“CPU-style” cores

Fetch/Decode取指译码

Data cache(a big one)数据缓存

ALU(Execute)处理核心

Out-of-Order control logic

Excution Context执行的上下文（执行空间，管道）

Fancy branch predictor分支预测器

Memory per-fetcher存储器管理单元

Ideal 精简Slimming down

Fetch/Decode取指译码

Remove components that help a single instruction stream run fast去掉与计算无关的部分

ALU(Execute)处理核心

Excution Context执行的上下文（执行空间，管道）

2个核同时执行2个程序片元Two cores (two fragments in parallel)独立的程序段/程序流

Fetch/Decode取指译码

ALU(Execute)处理核心

Excution Context执行的上下文（执行空间，管道）

Fetch/Decode取指译码

ALU(Execute)处理核心

Excution Context执行的上下文（执行空间，管道）

4个核同时执行4个程序片元Four cores (Four fragments in parallel)

16个核同时执行16个程序片元Sixteen cores (Sixteenfragments in parallel)

16 cores=16 simultaneous instruction streams

指令流共享，多个程序片元共享指令流Instruction stream sharing But...many fragments should be able to share an instruction stream!

Idea #2增加ALU，SIMD

Fetch/Decode取指译码

Idea # 2:

Amortize cost/complexity of managing an instruction stream across many ALUs

SIMD processing

单指令多数据

一个核处理更多的数据

ALU1	ALU2	ALU3	ALU4
ALU5	ALU6	ALU7	ALU8

Ctx	Ctx	Ctx	Ctx
Ctx	Ctx	Ctx	Ctx

Shared Ctx Data

改进的处理单元Modifying the shader 多数据执行

Fetch/Decode取指译码

Original complied shader:

Processes one fragment using scalar ops on scalar registers

New complied shader:

Processes eight fragments using vector ops on vector registers

向量操作

ALU1	ALU2	ALU3	ALU4
ALU5	ALU6	ALU7	ALU8

Ctx	Ctx	Ctx	Ctx
Ctx	Ctx	Ctx	Ctx

Shared Ctx Data

128个程序片元同时执行，并发16路指令流（可相同，也可以不同）

16 cores=128(16*8)ALUs ,16 simultaneous instruction streams

分支处理 branches

有的执行if后的程序，有的执行else后的程序。

只有一组ALU的取指译码单元。内部ALU步调一致的执行，只能同时处理if或者else。指令调度只有一个。

Not all ALUs do useful work ? Worst case :1/8 peak performance

SIMD处理并不总是需要显示的SIMD指令，单指令多数据，

1、显示的向量运算指令：SSE

2、标量指令，但是硬件进行矢量化：多片元共享指令流，硬件进行指令流的共享。NVIDIA

停滞Stalls，依赖关系

stalls occur when a core cannot run the next instruction because of dependency on a previous operation.

Texture access latency = 10% of cycles

We've removed the fancy caches and logic that helps avoid stalls

思路3，做一些无关的事情

大量的独立片元相互切换

通过片元切换来掩藏延迟

掩藏延迟，延迟掩藏Hiding shader stalls

相互独立任务队列，停滞的时候总有其他的活要干

获得较高的吞吐率，Throughput: increase run time of one group to increase throughput of many groups。

上下文存储空间 Storing contexts

Pool of context storage上下文存储池，放的是数据。

18个小的上下文Eighteen small contexts：好的延迟掩藏效果18个事情/文档。maximal latency hiding，一件事情在等待，马上有一件新的事情要去做，做的工作小

12个中等大小的上下文 Twelve medium contexts:

4个大的上下文：延迟掩藏效果差 low latency hiding ability，每一个工作能存储的东西比较多。

澄清：上下文切换可以软件也可以硬件管理

Clarification

Interleaving between contexts can be managed by hardware or software (or both)

不同体系结构采取策略不同，GPU采用硬件管理，上下文较多，芯片内部保存上下文状态，CPU上下文有大有小，软硬件结合管理。

设计：16核*8ALU=128个计算单元，同时计算128组数据。承载16路指令流*每一个有4个上下文存储空间（待办的事情）=64，同时执行512个独立程序片元=16*4*8，如果主频是1G，峰值是256GFPS。

理想化设计（enthusiast）

32cores 16ALUs per core (512 total) =1TFPS(1GHz)

Summary: three key ideas

1、Use many "slimmed down cores" to run in parallel使用很多精简核心做一些并行处理任务，只保留处理核心

2、Pack cores full of ALUs (by sharing instruction stream across groups of fragments)，把每个核心里面塞满了ALU，增强了处理能力

-Option 1: Explicit SIMD vector instructions

-Option 2: Implicit sharing managed by hardware

3、Avoid latency stalls by interleaving execution of many groups of fragments.让处理器不出现停滞，一直在干活。

Fermi架构细节

存储和数据访问

CPU有多级缓存

GPU型的吞吐处理核，存储器放在外面，显存，访存带宽要很大。

带宽测试Bandwidth thought experiment

带宽受限Bandwidth limited

减少带宽Reducing bandwidth requirements

减少数据请求，尽可能的做运算。Request data less often(instead , do more math) arithmetic intensity

减少频率，可以一次取多个数据，访存压缩到一起Fetch data from memory less often (share/reuse data across fragments) :on-chip communication or storage

现代GPU的存储器层次结构增加缓存

GPI是异构（与CPU不同）众核处理器针对吞吐优化

高效的GPU任务具备的条件

具有成千上万的独立工作，活多不闲着

尽可能利用大量的ALU单元

大量的片元切换掩藏延迟

可以共享指令流，做的事情规则化

适用于SIMD处理

最好是计算密集任务，

通信和计算开销比例合适

不要受制于访存带宽

CUDA(四) 周斌 GPU体系架构概述

猜你喜欢