Intel GEN9 GPU

1. 

 

 

 An Intel® Core™ i7 processor 6700K SoC and its ring interconnect architecture.

RING INTERCONNECT

RING INTERCONNECT
The on-die bus between CPU cores, caches, and Intel processor graphics is a ring based topology with dedicated local interfaces for each connected “agent”. This SoC ring interconnect is a bi-directional ring that has a 32-byte wide data bus, with separate lines for request, snoop, and acknowledge. Every on-die CPU core is regarded as a unique agent.
Similarly, Intel processor graphics is treated as a unique agent on the interconnect ring. A system agent is also connected to the ring, which bundles the DRAM memory management unit, display controller, and other off chip I/O controllers such as PCI Express*. Importantly, all off-chip system memory transactions to/from CPU cores and to/from Intel processor graphics are facilitated by this interconnect, through the system agent, and the unified DRAM memory controller

CPU内核,缓存和Intel处理器图形之间的片上总线是基于环的拓扑,具有用于每个连接的“代理”的专用本地接口。 SoC环互连是一个双向环,具有32字节宽的数据总线,并带有用于请求,侦听和确认的单独线路。 每个片上CPU内核都被视为唯一的代理。
同样,英特尔处理器图形被视为互连环上的唯一代理。 系统代理也连接到该环,该环捆绑了DRAM内存管理单元,显示控制器和其他片外I / O控制器,例如PCI Express *。 重要的是,这种互连,通过系统代理和统一的DRAM存储器控制器,可以方便地实现往返于CPU内核和往返于英特尔处理器图形的所有片外系统内存事务。

 Gen9 Memory Hierarchy Refinements:
Coherent SVM write performance is significantly improved via new LLC cache management policies.
The available L3 cache capacity has been increased to 768 Kbytes per slice (512 Kbytes for application data).
The sizes of both L3 and LLC request queues have been increased. This improves latency hiding to achieve better effective bandwidth against the architecture peak theoretical.
In Gen9 EDRAM now acts as a memory-side cache between LLC and DRAM. Also, the EDRAM memory controller has moved into the system agent, adjacent to the display controller, to support power efficient and low latency display refresh.
Texture samplers now natively support an NV12 YUV format for improved surface sharing between compute APIs and media fixed function units

Gen9内存层次结构改进:
通过新的LLC缓存管理策略,相干SVM的写入性能得到了显着改善。
可用的L3缓存容量已增加到每片768 KB(应用程序数据为512 KB)。
L3和LLC请求队列的大小均已增加。 这样可以改善延迟隐藏,从而针对架构峰值理论值获得更好的有效带宽。
在Gen9中,EDRAM现在充当LLC和DRAM之间的内存侧缓存。 同样,EDRAM存储器控制器已移入系统代理,与显示控制器相邻,以支持省电且低延迟的显示刷新。
纹理采样器现在原生支持NV12 YUV格式,以改善计算API和媒体固定功能单元之间的表面共享。

Gen9 Compute Capability Refinements:
Preemption of compute applications is now supported at a thread level, meaning that compute threads can be preempted (and later resumed) midway through their execution.
Round robin scheduling of threads within an execution unit.
Gen9 adds new native support for the 32-bit float atomics operations of min, max, and compare/exchange. Also the performance of all 32-bit atomics is improved for kernel scenarios that issued multiple atomics back to back.
16-bit floating point capability is improved with native support for denormals and gradual underflow.
Gen9 Product Configuration Flexibility:
Gen9 has been designed to enable products with 1, 2 or 3 slices.
Gen9 adds new power gating and clock domains for more efficient dynamic power management. This can particularly improve low power media playback modes.

Gen9计算能力优化:
现在在线程级别支持抢占计算应用程序,这意味着可以在执行计算过程中抢占(然后恢复)计算线程。
执行单元内线程的循环调度。
Gen9为最小,最大和比较/交换的32位浮点原子操作添加了新的本机支持。 此外,对于内核场景(背对背发出多个原子),所有32位原子的性能也得到了改善。
通过对异常和渐进下溢的本机支持,改进了16位浮点功能。
Gen9产品配置灵活性:
Gen9旨在支持具有1、2或3片的产品。
Gen9添加了新的电源门控和时钟域,以实现更有效的动态电源管理。 这可以特别改善低功耗媒体播放模式。

5.2 MODULAR DESIGN FOR PRODUCT SCALABILITY

The gen9 compute architecture is designed for scalability across a wide range of target products. The architecture’s modularity enables exact product targeting to a particular market segment or product power envelope. The architecture begins with compute components called execution units. Execution units are clustered into groups called subslices. Subslices are further clustered into slices. Together, execution units, subslices, and slices are the modular building blocks that are composed to create many product variants based upon Intel processor graphics gen9 compute architecture. Some example variants are shown in Figure 6, Figure 7, and in Figure 8. The following sections describe the architecture components in detail, and show holistically how they may be composed into full products.

gen9计算体系结构旨在实现跨多种目标产品的可伸缩性。 该架构的模块化特性可将产品精确定位到特定细分市场或产品功率范围。 该架构始于称为执行单元的计算组件。 执行单元分为称为子切片的组。 子切片进一步聚集成切片。 执行单元,子切片和切片一起构成了模块化的构建基块,这些模块可根据英特尔处理器图形gen9计算架构创建许多产品变体。 图6,图7和图8中显示了一些示例变体。以下各节详细描述了体系结构组件,并全面显示了它们如何组成完整的产品。

 

5.3 EXECUTION UNIT (EUS) ARCHITECTURE
The foundational building block of gen9 compute architecture is the execution unit, commonly abbreviated as EU. The architecture of an EU is a combination of simultaneous multi-threading (SMT) and fine-grained interleaved multi-threading (IMT). These are compute processors that drive multiple issue, single instruction, multiple data arithmetic logic units (SIMD ALUs) pipelined across multiple threads, for high-throughput floating-point and integer compute. The fine-grain threaded nature of the EUs ensures continuous streams of ready to execute
instructions, while also enabling latency hiding of longer operations such as memory scatter/gather, sampler requests, or other system communication.

gen9计算体系结构的基本构建块是执行单元,通常缩写为EU。 EU的体系结构是同时多线程(SMT)和细粒度交错多线程(IMT)的组合。 这些是计算处理器,可驱动跨多个线程流水线化的多个问题,单指令,多个数据算术逻辑单元(SIMD ALU),以实现高吞吐量的浮点和整数计算。 EU的细粒度线程特性确保了准备执行指令的连续流,同时还可以隐藏较长的操作(例如内存分散/收集,采样器请求或其他系统通信)的延迟。

 

 

 

 The Execution Unit (EU). Each gen9 EU has seven threads. Each thread has 128 SIMD-8 32-bit registers (GRF) and supporting architecture specific registers (ARF). The EU can co-issue to four instruction processing units including two FPUs, a branch unit, and a message send unit.

 

 

执行单位(EU)。 每个gen9 EU都有七个线程。 每个线程具有128个SIMD-8 32位寄存器(GRF)和支持特定于体系结构的寄存器(ARF)。 EU可以共同发布四个指令处理单元,包括两个FPU,一个分支单元和一个消息发送单元。

Product architects may fine-tune the number of threads and number of registers per EU to match scalability and specific product design requirements. For gen9-based products, each EU thread has 128 general purpose registers. Each register stores 32 bytes, accessible as a SIMD 8-element vector of 32-bit data elements. Thus each gen9 thread has 4 Kbytes of general purpose register file (GRF). In the gen9 architecture, each EU has seven threads for a total of 28 Kbytes of GRF per EU. Flexible addressing modes permit registers to be addressed together to build effectively wider registers, or even to represent strided rectangular block data structures. Per-thread architectural state is maintained in a separate dedicated architecture register file (ARF).

产品架构师可以微调每个EU的线程数和寄存器数,以匹配可伸缩性和特定的产品设计要求。 对于基于Gen9的产品,每个EU线程都有128个通用寄存器。 每个寄存器存储32个字节,可以将其作为32位数据元素的SIMD 8元素向量进行访问。 因此,每个gen9线程都有4 KB的通用寄存器文件(GRF)。 在gen9架构中,每个EU都有七个线程,每个EU总共有28 KB GRF。 灵活的寻址模式允许将寄存器一起寻址,以有效地构建更宽的寄存器,甚至代表跨步的矩形块数据结构。 每个线程的架构状态都在单独的专用架构寄存器文件(ARF)中维护。


5.3.1 Simultaneous Multi-Threading and Multiple Issue Execution
Depending on the software workload, the hardware threads within an EU may all be executing the same compute kernel code, or each EU thread could be executing code from a completely different compute kernel. The execution state of each thread, including its own instruction pointers, are held in thread-specific ARF registers.
On every cycle, an EU can co-issue up to four different instructions, which must be sourced from four different threads. The EU’s thread arbiter dispatches these instructions to one of four functional units for execution. Although the issue slots for the functional units pose some instruction co-issue constraints, the four instructions are independent, since they are dispatched from four different threads. It is theoretically possible for just two non-stalling threads to fully saturate the floating-point compute throughput of the machine. More typically all seven threads are loaded to deliver more ready-to-run instructions from which the thread arbiter may choose, and thereby promote the EU’s instruction-level parallelism.

5.3.1同时多线程和多发射执行
根据软件的工作量,EU中的硬件线程可能都在执行相同的计算内核代码,或者每个EU线程可能正在从完全不同的计算内核执行代码。每个线程的执行状态(包括其自己的指令指针)都保存在特定于线程的ARF寄存器中。
在每个周期中,EU最多可以共同发布四个不同的指令,这些指令必须来自四个不同的线程。EU的线程仲裁员将这些指令分派给四个功能单元之一执行。尽管功能单元的发布槽存在一些指令共发行约束,但是这四个指令是独立的,因为它们是从四个不同的线程分派的。从理论上讲,只有两个非固定线程可以完全饱和计算机的浮点计算吞吐量。通常,所有七个线程都被加载,以传递更多可供线程仲裁者选择的随时运行的指令,从而促进EU的指令级并行性。


5.3.2 SIMD FPUs
In each EU, the primary computation units are a pair of SIMD floating-point units (FPUs). Although called FPUs, they support both floating-point and integer computation. These units can SIMD execute up to four 32-bit floating-point (or integer) operations, or SIMD-execute up to eight 16-bit integer or 16-bit floating-point operations. The 16-bit float (half-float) support is new for gen9 compute architecture. Each SIMD FPU can complete simultaneous add and multiply (MAD) floating-point instructions every cycle. Thus each EU is capable of 16 32-bit floating-point operations per cycle: (add + mul) x 2 FPUs x SIMD-4. In gen9, both FPUs support native 32-bit integer operations. Finally, one of the FPUs provides extended math capability to support highthroughput transcendental math functions and double precision 64-bit floating-point.
In each EU, gen9 compute architecture offers significant local bandwidth between GRF registers and the FPUs. For example, MAD instructions with three source operands and one destination operand are capable of driving 96 bytes/cycle read bandwidth, and 32 bytes/cycle write bandwidth locally within every EU. Aggregated across the whole architecture, this bandwidth can scale linearly with the number of EUs. For gen9 products with multiple slices of EUs and higher clock rates, the aggregated theoretical peak bandwidth that is local between FPUs and GRF can approach multiple terabytes of read bandwidth

5.3.2 SIMD FPU

在每个EU中,主要计算单元是一对SIMD浮点单元(FPU)。尽管称为FPU,但它们同时支持浮点和整数计算。这些单元可以SIMD执行最多四个32位浮点(或整数)运算,或者SIMD执行最多八个16位整数或16位浮点运算。 16位浮点(half-float)支持是gen9计算架构的新增功能。每个SIMD FPU可以在每个周期完成同时加法和乘法(MAD)浮点指令。因此,每个EU每个周期能够执行16个32位浮点运算:(加+乘)x 2个FPU x SIMD-4。在gen9中,两个FPU均支持本机32位整数运算。最后,其中一个FPU提供扩展的数学功能,以支持高通量先验数学功能和双精度64位浮点。
在每个EU中,gen9计算架构在GRF寄存器和FPU之间提供了显着的本地带宽。例如,具有三个源操作数和一个目标操作数的MAD指令能够在每个EU内部本地驱动96字节/周期的读取带宽和32字节/周期的写入带宽。聚合在整个架构中,该带宽可以随EU数量线性扩展。对于具有多个EU片和更高时钟速率的gen9产品,FPU和GRF之间局部的合计理论峰值带宽可以接近多个TB的读取带宽


5.3.3 Branch and Send Units
Within the EUs, branch instructions are dispatched to a dedicated branch unit to facilitate SIMD divergence and eventual convergence. Finally, memory operations, sampler operations, and other longer-latency system communications are all dispatched via “send” instructions that are executed by the message passing send unit.

5.3.3分支和发送单位
在EU内部,分支指令被分派到专用分支单元,以促进SIMD的分歧和最终的收敛。 最后,内存操作,采样器操作和其他较长延迟的系统通信都是通过“传递”指令调度的,这些指令由消息传递发送单元执行。

5.3.4 EU ISA and Flexible Width SIMD
The EU Instruction Set Architecture (ISA) and associated general purpose register file are all designed to support a flexible SIMD width. Thus for 32-bit data types, the gen9 FPUs can be viewed as physically 4-wide. But the FPUs may be targeted with SIMD instructions and registers that are logically 1-wide, 2-wide, 4-wide, 8-wide, 16-wide, or 32-wide.

For example, a single operand to a SIMD-16 wide instruction pairs two adjacent SIMD-8 wide registers, logically addressing the pair as a single SIMD-16 wide register containing a contiguous 64 bytes. This logically SIMD-16 wide instruction is transparently broken down by the microarchitecture into physically SIMD-4 wide FPU operations, which are iteratively executed. From the viewpoint of a single thread, wider SIMD instructions do take more cycles to complete execution. But because the EUs and EU functional units are fully pipelined across multiple threads, SIMD-8, SIMD-16, and SIMD-32 instructions are all capable of maximizing compute throughput in a fully loaded system.
The instruction SIMD width choice is left to the compiler or low level programmer. Differing SIMD width instructions can be issued back to back with no performance penalty. This flexible design allows compiler heuristics and programmers to choose specific SIMD widths that precisely optimize the register allocation footprint for individual programs, balanced against the amount of work assigned to each thread.


EU指令集体系结构(ISA)和相关的通用寄存器文件都旨在支持灵活的SIMD宽度。因此,对于32位数据类型,可以将gen9 FPU视为物理上为4宽。但是FPU可以使用SIMD指令和寄存器作为目标,这些指令和寄存器在逻辑上为1宽,2宽,4宽,8宽,16宽或32宽。

例如,一个SIMD-16宽指令的单个操作数将两个相邻的SIMD-8宽寄存器配对,在逻辑上将该对寻址为包含连续64个字节的单个SIMD-16宽寄存器。该逻辑上SIMD-16宽的指令被微体系结构透明地分解为物理SIMD-4宽的FPU操作,这些操作以迭代方式执行。从单线程的角度来看,更宽的SIMD指令需要花费更多的周期才能完成执行。但是,由于EU和EU功能单元在多个线程之间完全流水线化,因此SIMD-8,SIMD-16和SIMD-32指令都能够最大程度地提高满载系统中的计算吞吐量。
SIMD宽度指令的选择权留给编译器或低级程序员。可以背对背发出不同的SIMD宽度指令,而不会降低性能。这种灵活的设计允许编译器试探法和程序员选择特定的SIMD宽度,以精确优化单个程序的寄存器分配占用空间,并与分配给每个线程的工作量保持平衡。

5.3.5 SIMD Code Generation for SPMD Programming Models
Compilers for single program multiple data (SPMD) programming models, such as RenderScript, OpenCL™1, Microsoft DirectX* Compute Shader, OpenGL* Compute, and C++AMP, generate SIMD code to map multiple kernel instances2 to be executed simultaneously within a given hardware thread. The exact number of kernel instances perthread is a heuristic driven compiler choice. We refer to this compiler choice as the dominant
SIMD-width of the kernel. In OpenCL and DirectX Compute Shader, SIMD-8, SIMD-16, and SIMD-32 are the most common SIMD-width targets.
On gen9 compute architecture, most SPMD programming models employ this style of code generation and EU processor execution. Effectively, each SPMD kernel instance appears to execute serially and independently within its own SIMD lane. In actuality, each thread executes a SIMD-width number of kernel instances concurrently. Thus for a SIMD-16 compile of a compute kernel, it is possible for SIMD-16 x 7 threads = 112 kernel instances to be executing concurrently on a single EU. Similarly, for a SIMD-32 compile of a compute kernel, 32 x 7 threads = 224 kernel instances could be executing concurrently on a single EU.
For a given SIMD-width, if all kernel instances within a thread are executing the same instruction, then the SIMD lanes can be maximally utilized. If one or more of the kernel instances chooses a divergent branch, then the thread will execute the two paths of the branch separately in serial. The EU branch unit keeps track of such branch divergence and branch nesting. The branch unit also generates a “live-ness” mask to indicate which kernel instances in the current SIMD-width need to execute (or not execute) the branch.

5.3.5用于SPMD编程模型的SIMD代码生成
用于单程序多数据(SPMD)编程模型(例如RenderScript,OpenCL™1,Microsoft DirectX * Compute Shader,OpenGL * Compute和C ++ AMP)的编译器会生成SIMD代码,以映射多个内核实例2,以便在同一环境中同时执行给定的硬件线程。每个线程的内核实例的确切数量是启发式驱动的编译器选择。我们将此编译器选择称为内核的主要SIMD宽度。在OpenCL和DirectX Compute Shader中,SIMD-8,SIMD-16和SIMD-32是最常见的SIMD宽度目标。
在gen9计算架构上,大多数SPMD编程模型都采用这种样式的代码生成和EU处理器执行。实际上,每个SPMD内核实例似乎都在其自己的SIMD通道中串行独立执行。实际上,每个线程并发执行一个SIMD宽度的内核实例。因此,对于计算内核的SIMD-16编译,有可能在单个EU上同时执行SIMD-16 x 7线程= 112个内核实例。同样,对于计算内核的SIMD-32编译,可以在单个EU上同时执行32 x 7线程= 224个内核实例。
对于给定的SIMD宽度,如果线程中的所有内核实例都在执行同一指令,则可以最大程度地利用SIMD通道。如果一个或多个内核实例选择分支分支,则线程将依次串行执行分支的两个路径。EU分支机构会跟踪此类分支分歧和分支嵌套。分支单元还生成“活动性”掩码,以指示当前SIMD宽度中的哪些内核实例需要执行(或不执行)分支。

 

Guess you like

Origin www.cnblogs.com/shaohef/p/12113774.html