What [reprint] AVX-512 really mean?

AVX-512 What does it mean?

https://baijiahao.baidu.com/s?id=1653677566154792925&wfr=spider&for=pc

 

Heart travel technology

Published: 19-12-23 10:57

Intel and AMD has released coincidentally choose their own platform for new products in high-end desktops HEDT near the end of the year: 10 generation Core X Series and third-generation ThreadRipper thread Ripper. As a pair of children have been "Know thyself," the old rivals, both HEDT platform product characteristics are even very close. Compared to the respective MSDT products we have a larger number of cores; 4-channel memory support to achieve higher data bandwidth, provides additional PCIe lanes to support more expansion devices. The only difference from the set of instructions: Core 3rd Generation X already supports AVX-512 instruction set of products, and the new TR AMD is just implement support for AVX2 of. Three common characteristics of both, are made up from the physical specifications, it is easy to understand. AVX-512 instruction set but for most people it is a somewhat invisible. Today we'll talk about AVX-512 instruction set on how to enhance the performance of which would bring.

What is the set of instructions? What is the AVX-512?

CPU is a logic circuit constituted by the transistors, such as we know NAND gate is one of the most simple digital logic circuits, in addition there will be a NOR gate, XOR gates, XNOR gates, Schmid special trigger the door, and so on. The simple logic circuit can process only a complete calculation, the general purpose computing unit is formed by a large number of combinations of these simple logic circuit, and inputs instructions and data, so that CPU can execute complex operations and judgment. When the instruction operation of more and more complex, in order to improve computational efficiency, human recombinant divide instructions, forming a relatively standardized set, which is a set of instructions.

Core i9-10980XE support dozen extended instruction set

The most widely used by the instruction set is Intel's X86 instruction set, the instruction set processor-based system, called X86 architecture - all computer processors we use, whether Intel or AMD, and the instruction set is the derivative. For decades with the enhancement of CPU computing power, constantly have new instructions are added in, such as support for 64bit computing was born X86-64 instruction set, which is our current PC processor instruction standards. Meanwhile, there are also some extensions designed for a particular instruction set computing developed for lifting CPU computing capacity in certain areas, such as the familiar SSE (Streaming SIMD Extensions, Streaming SIMD extension stream ) the instruction set is used to enhance the series multimedia capabilities, VT-x (Intel Vertualization) mainly for virtualization performance, AES-iN mainly for encryption / decryption algorithms, and AVX (advanced vector extensions, advanced vector extensions) instruction set .

Intel processors in SIMD instructions evolutionary lines on the data width set to 2011 as the boundary, until a SSE, followed by AVX

AVX instruction set and SSE instruction set series can be said to be the same strain, belong to the SIMD (Single Instruction Multiple Data) instruction set, introduced by Intel in March 2008, for the first time in January 2011 to support the release of Sandy Bridge family of processors. In June the same year, Intel released AVX2 (now commonly called AVX256) instruction set, the integer operations extend from 128bit to 256bit, and the introduction of the FMA (fused multiply-add) instruction set as a supplement. 2 years later to become the first family of processors Haswell AVX2 instruction set support CPU products.

2013年,英特尔正式发布了AVX-512指令集,将指令宽度进一步扩展到了512bit,相比AVX2在数据寄存器宽度、数量以及FMA单元的宽度都增加了一倍,所以在每个时钟周期内可以打包32 次双精度和 64 次单精度浮点运算,或者8个 64 位和16个 32 位整数,因此在图像/音视频处理、数据分析、科学计算、数据加密和压缩以及人工智能/深度学习等密集型计算应用场景中,会带来前所未有的强大性能表现,理论上浮点性能翻倍,整数计算则增加约33%的性能。

英特尔还在不断扩充AVX-512的指令范围,比如10代酷睿X作为第3代支持AVX-512指令集的民用级处理器,就扩展了VNNI(Vector Neural Network Instructions,矢量神经网络指令)指令集,用于加速深入学习中常用的整数矩阵运算。

AVX-512能带来多大提升?

与物理规格(比如核心/线程数量、主频)的提升会带来性能有立竿见影的变化不同,指令集对性能的贡献往往会拖后一段时间才能逐渐体现出来。因为在硬件就位之后,还需要软件本身对指令集进行充分利用和优化才行。在CPU对AVX-512的支持3年后,AVX-512的应用环境终于迈开了普及的步伐。至少现在有大量的Benchmark能够体现出AVX-512的威力。

Sandra 2020已经支持AVX-512指令集。由于我们暂时还没有拿到Zen2架构的TR处理器,因此只能选择AMD目前最强的Ryzen 9 3950X来进行对比。在处理器多媒体性能测试4个子项目中,Core i9-10980XE均大幅度领先Ryzen 9 3950X,其中最大优势达到了48.7%。虽然前者多了2个核心,但实际上Core i9-10980XE在默认设置下全核AVX-512是运行在2.8GHz下的,而3950X则运行在3.5GHz,Core i9-10980XE以更低的频率却跑出了远胜于对手的成绩,由此可见AVX-512的巨大威力。

除此以外,Sandra 2020在影像处理、加密/解密以及科学计算3个项目中也都提供了对AVX-512指令集的完善支持,因此Core i9-10980XE在这3各项目中的领先优势也相当惊人。

AIDA64从v5.97版本开始在测试模块中引入了对AVX-512的支持

AIDA64在v5.97版本之后也开始提供对AVX-512指令集的支持。其中性能测试中的CPU PhotoWorxx(检测图像处理性能)、FPU Julia(测试单精度浮点性能)、FPU Mandel(测试双精度浮点性能)、FP32/FP64 Ray-Trace(测试光线追踪计算中单/双精度浮点性能)优化了对AVX-512指令的调用,因此Core i9-10980XE在这几个项目中优势都非常明显。当然还是要指出一下:CPU PhotoWorxx和FP32/FP64 Ray-Trace这3个项目之所以领先优势如此巨大,也是因为都对内存带宽有着非常高的需求,Core i9-10980XE的4通道内存在其中贡献不小。

虽然在多媒体处理、科学计算等项目上有着巨幅的领先优势,只是这些应用环境对于普通PC用户来说听起来还是显得有点“虚”。其实目前游戏领域也已经逐渐开始引入AVX系列指令集用以加速坐标变换或者加密(最招人恨的Denuvo加密就使用AVX指令集进行加密)。其中最典型的例证就是育碧的《刺客信条:奥德赛》刚发售时不少玩家遭遇崩溃、闪退等问题,最终以育碧修改了部分AVX代码,让不支持该指令集的CPU也能够运行该游戏才解决了问题。

3DMark Time Spy Extreme的自定义界面中,允许手动指定CPU使用的指令集

3DMark在最新的Time Spy Extreme项目里提供了对AVX-512的支持。在自定义测试界面中,可以指定指令集进行运行。从物理测试的结果可以看到,当使用AVX2指令集的时候,Core i9-10980XE与Ryzen 9 3950X的成绩基本一致,而在AVX-512指令集下,后者由于并不支持,因此成绩几乎没有变化,但Core i9-10980XE的性能得到了大幅提升,领先幅度暴增至接近20%。

AVX-512的未来

英特尔一直将AVX-512作为Xeon和HEDT平台的王牌特性,因此只在部分商业软件和科学计算/模拟软件上得到有效利用。这样虽然增强了产品本身的技术优势,但也在一定程度上限制住了AVX-512的普及。毕竟AMD在指令集方面只是个“弟中弟”,只能亦步亦趋的跟在英特尔后面(例如AVX2指令集直到今年才被Zen2架构引入,落后英特尔6年之久)。

选择酷睿X,选择AVX-512,才能战未来

但这样的情况即将改变:目前移动端的10nm Ice Lake系列处理器已经支持AVX-512指令集,明年1季度即将发布的MSDT平台的Comet Lake处理器,也将成为第一款支持AVX-512的主流桌面级处理器。这意味着至少在英特尔方面,即将全面普及AVX-512,由此将会导致大量的应用,特别是游戏和日常应用将会开始使用AVX-512并进行针对性优化。考虑到AMD在这方面的落后,想必近未来一段时间,由于英特尔处理器会在各大Benchmark和应用软件中的性能会突然出现大幅度跃升,各大媒体的CPU性能天梯图和排行榜,又要大规模改版了。

 

 

采用,SHA256(SHA256哈希计算是有效负载处理管道的重要部分)

优点:

1、寄存器变化(与AVX2相比,不仅寄存器的宽度从256位增加到512位,而且寄存器的数量也增加了一倍,达到32)

2、比AVX2提供高达8倍的性能提升,由于并行处理了16条消息

 

如何最好地利用

为了获得AVX512实现的最佳性能,这里有一些提示:

有很多例行程序并行进行SHA256计算。
尝试使用Write()64字节的倍数的消息。
尝试将消息的总长度保持在大致相似的大小 - 这样AVX512计算中的所有16个“通道”都尽可能地做出贡献。
————————————————
版权声明:本文为CSDN博主「hi我是大嘴巴」的原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接及本声明。
原文链接:https://blog.csdn.net/weixin_38740463/article/details/93395476

Guess you like

Origin www.cnblogs.com/jinanxiaolaohu/p/12424748.html