为什么将0.1f更改为0会使性能降低10倍?

本文翻译自:Why does changing 0.1f to 0 slow down performance by 10x?

Why does this bit of code, 为什么这段代码,

const float x[16] = {  1.1,   1.2,   1.3,     1.4,   1.5,   1.6,   1.7,   1.8,
                       1.9,   2.0,   2.1,     2.2,   2.3,   2.4,   2.5,   2.6};
const float z[16] = {1.123, 1.234, 1.345, 156.467, 1.578, 1.689, 1.790, 1.812,
                     1.923, 2.034, 2.145,   2.256, 2.367, 2.478, 2.589, 2.690};
float y[16];
for (int i = 0; i < 16; i++)
{
    y[i] = x[i];
}

for (int j = 0; j < 9000000; j++)
{
    for (int i = 0; i < 16; i++)
    {
        y[i] *= x[i];
        y[i] /= z[i];
        y[i] = y[i] + 0.1f; // <--
        y[i] = y[i] - 0.1f; // <--
    }
}

run more than 10 times faster than the following bit (identical except where noted)? 比下面的位快10倍以上(相同的地方,除非特别说明)?

const float x[16] = {  1.1,   1.2,   1.3,     1.4,   1.5,   1.6,   1.7,   1.8,
                       1.9,   2.0,   2.1,     2.2,   2.3,   2.4,   2.5,   2.6};
const float z[16] = {1.123, 1.234, 1.345, 156.467, 1.578, 1.689, 1.790, 1.812,
                     1.923, 2.034, 2.145,   2.256, 2.367, 2.478, 2.589, 2.690};
float y[16];
for (int i = 0; i < 16; i++)
{
    y[i] = x[i];
}

for (int j = 0; j < 9000000; j++)
{
    for (int i = 0; i < 16; i++)
    {
        y[i] *= x[i];
        y[i] /= z[i];
        y[i] = y[i] + 0; // <--
        y[i] = y[i] - 0; // <--
    }
}

when compiling with Visual Studio 2010 SP1. 使用Visual Studio 2010 SP1进行编译时。 The optimization level was -02 with sse2 enabled. 启用sse2的优化级别为-02 I haven't tested with other compilers. 我没有与其他编译器一起测试过。


#1楼

参考:https://stackoom.com/question/d58Q/为什么将-f更改为-会使性能降低-倍


#2楼

In gcc you can enable FTZ and DAZ with this: 在gcc中,您可以通过以下方式启用FTZ和DAZ:

#include <xmmintrin.h>

#define FTZ 1
#define DAZ 1   

void enableFtzDaz()
{
    int mxcsr = _mm_getcsr ();

    if (FTZ) {
            mxcsr |= (1<<15) | (1<<11);
    }

    if (DAZ) {
            mxcsr |= (1<<6);
    }

    _mm_setcsr (mxcsr);
}

also use gcc switches: -msse -mfpmath=sse 也使用gcc开关:-msse -mfpmath = sse

(corresponding credits to Carl Hetherington [1]) (相当于学分卡尔·赫瑟灵顿[1])

[1] http://carlh.net/plugins/denormals.php [1] http://carlh.net/plugins/denormals.php


#3楼

It's due to denormalized floating-point use. 这是由于使用了非规范化的浮点数。 How to get rid of both it and the performance penalty? 如何摆脱它和性能损失? Having scoured the Internet for ways of killing denormal numbers, it seems there is no "best" way to do this yet. 搜寻Internet来消除异常数字的方法之后,似乎尚无“最佳”方法。 I have found these three methods that may work best in different environments: 我发现这三种方法可能在不同的环境中效果最好:

  • Might not work in some GCC environments: 在某些GCC环境中可能无法使用:

     // Requires #include <fenv.h> fesetenv(FE_DFL_DISABLE_SSE_DENORMS_ENV); 
  • Might not work in some Visual Studio environments: 1 在某些Visual Studio环境中可能不起作用: 1

     // Requires #include <xmmintrin.h> _mm_setcsr( _mm_getcsr() | (1<<15) | (1<<6) ); // Does both FTZ and DAZ bits. You can also use just hex value 0x8040 to do both. // You might also want to use the underflow mask (1<<11) 
  • Appears to work in both GCC and Visual Studio: 似乎可以在GCC和Visual Studio中使用:

     // Requires #include <xmmintrin.h> // Requires #include <pmmintrin.h> _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON); _MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON); 
  • The Intel compiler has options to disable denormals by default on modern Intel CPUs. 英特尔编译器具有在现代英特尔CPU上默认情况下禁用反常态的选项。 More details here 在这里更多细节

  • Compiler switches. 编译器开关。 -ffast-math , -msse or -mfpmath=sse will disable denormals and make a few other things faster, but unfortunately also do lots of other approximations that might break your code. -ffast-math-msse-mfpmath=sse将禁用异常,并使其他一些事情更快,但不幸的是,它还会执行许多其他近似操作,可能会破坏您的代码。 Test carefully! 仔细测试! The equivalent of fast-math for the Visual Studio compiler is /fp:fast but I haven't been able to confirm whether this also disables denormals. 对于Visual Studio编译器来说,快速运算的等效项是/fp:fast但是我无法确认这是否也禁用了异常。 1 1个


#4楼

Dan Neely's comment ought to be expanded into an answer: 丹·尼利的评论应扩展为一个答案:

It is not the zero constant 0.0f that is denormalized or causes a slow down, it is the values that approach zero each iteration of the loop. 0.0f化或导致减慢的不是零常数0.0f ,而是每次循环迭代时接近零的值。 As they come closer and closer to zero, they need more precision to represent and they become denormalized. 随着它们越来越接近于零,它们需要更高的精度来表示,并且它们变得规范化了。 These are the y[i] values. 这些是y[i]值。 (They approach zero because x[i]/z[i] is less than 1.0 for all i .) (它们接近零,因为所有i x[i]/z[i]都小于1.0。)

The crucial difference between the slow and fast versions of the code is the statement y[i] = y[i] + 0.1f; 代码的慢速版本和快速版本之间的关键区别是语句y[i] = y[i] + 0.1f; . As soon as this line is executed each iteration of the loop, the extra precision in the float is lost, and the denormalization needed to represent that precision is no longer needed. 在循环的每次迭代中执行此行后,浮点数中的额外精度就会丢失,并且不再需要代表该精度的非规范化。 Afterwards, floating point operations on y[i] remain fast because they aren't denormalized. 之后, y[i]上的浮点运算将保持快速状态,因为它们没有被非规格化。

Why is the extra precision lost when you add 0.1f ? 为什么添加0.1f失去额外的精度? Because floating point numbers only have so many significant digits. 因为浮点数只有很多有效数字。 Say you have enough storage for three significant digits, then 0.00001 = 1e-5 , and 0.00001 + 0.1 = 0.1 , at least for this example float format, because it doesn't have room to store the least significant bit in 0.10001 . 假设您有足够的存储空间来存储三个有效数字,然后至少对于本例float格式而言,则为0.00001 = 1e-50.00001 + 0.1 = 0.1 ,因为它没有空间存储0.10001的最低有效位。

In short, y[i]=y[i]+0.1f; y[i]=y[i]-0.1f; 简而言之, y[i]=y[i]+0.1f; y[i]=y[i]-0.1f; y[i]=y[i]+0.1f; y[i]=y[i]-0.1f; isn't the no-op you might think it is. 可能不是您想的那样。

Mystical said this as well : the content of the floats matters, not just the assembly code. 神秘主义者也这样说 :浮点数的内容很重要,而不仅仅是汇编代码。


#5楼

Using gcc and applying a diff to the generated assembly yields only this difference: 使用gcc并将diff应用于生成的程序集只会产生以下差异:

73c68,69
<   movss   LCPI1_0(%rip), %xmm1
---
>   movabsq $0, %rcx
>   cvtsi2ssq   %rcx, %xmm1
81d76
<   subss   %xmm1, %xmm0

The cvtsi2ssq one being 10 times slower indeed. cvtsi2ssq慢了10倍。

Apparently, the float version uses an XMM register loaded from memory, while the int version converts a real int value 0 to float using the cvtsi2ssq instruction, taking a lot of time. 显然, float版本使用从内存加载的XMM寄存器,而int版本使用cvtsi2ssq指令将实际的int值0转换为float ,这会花费很多时间。 Passing -O3 to gcc doesn't help. -O3传递给gcc并没有帮助。 (gcc version 4.2.1.) (gcc版本4.2.1)。

(Using double instead of float doesn't matter, except that it changes the cvtsi2ssq into a cvtsi2sdq .) (使用double而不是float没关系,只不过它将cvtsi2ssq更改为cvtsi2sdq 。)

Update 更新资料

Some extra tests show that it is not necessarily the cvtsi2ssq instruction. 一些额外的测试表明,它不一定是cvtsi2ssq指令。 Once eliminated (using a int ai=0;float a=ai; and using a instead of 0 ), the speed difference remains. 一旦消除(使用int ai=0;float a=ai;并使用a而不是0 ),则速度差仍然存在。 So @Mysticial is right, the denormalized floats make the difference. 因此,@ Mysticial是正确的,非规范化的浮点数会有所作为。 This can be seen by testing values between 0 and 0.1f . 通过测试00.1f之间的值可以看出这一点。 The turning point in the above code is approximately at 0.00000000000000000000000000000001 , when the loops suddenly takes 10 times as long. 上面的代码中的转折点大约为0.00000000000000000000000000000001 ,这时循环突然花费了10倍的时间。

Update<<1 更新<< 1

A small visualisation of this interesting phenomenon: 关于这个有趣现象的小图:

  • Column 1: a float, divided by 2 for every iteration 第1列:浮点数,每次迭代均除以2
  • Column 2: the binary representation of this float 第2列:此浮点数的二进制表示形式
  • Column 3: the time taken to sum this float 1e7 times 第3列:求和该浮点数所需的时间1e7次

You can clearly see the exponent (the last 9 bits) change to its lowest value, when denormalization sets in. At that point, simple addition becomes 20 times slower. 您可以清楚地看到,在进行非规格化设置时,指数(最后9位)变为最低值。这时,简单加法会慢20倍。

0.000000000000000000000000000000000100000004670110: 10111100001101110010000011100000 45 ms
0.000000000000000000000000000000000050000002335055: 10111100001101110010000101100000 43 ms
0.000000000000000000000000000000000025000001167528: 10111100001101110010000001100000 43 ms
0.000000000000000000000000000000000012500000583764: 10111100001101110010000110100000 42 ms
0.000000000000000000000000000000000006250000291882: 10111100001101110010000010100000 48 ms
0.000000000000000000000000000000000003125000145941: 10111100001101110010000100100000 43 ms
0.000000000000000000000000000000000001562500072970: 10111100001101110010000000100000 42 ms
0.000000000000000000000000000000000000781250036485: 10111100001101110010000111000000 42 ms
0.000000000000000000000000000000000000390625018243: 10111100001101110010000011000000 42 ms
0.000000000000000000000000000000000000195312509121: 10111100001101110010000101000000 43 ms
0.000000000000000000000000000000000000097656254561: 10111100001101110010000001000000 42 ms
0.000000000000000000000000000000000000048828127280: 10111100001101110010000110000000 44 ms
0.000000000000000000000000000000000000024414063640: 10111100001101110010000010000000 42 ms
0.000000000000000000000000000000000000012207031820: 10111100001101110010000100000000 42 ms
0.000000000000000000000000000000000000006103515209: 01111000011011100100001000000000 789 ms
0.000000000000000000000000000000000000003051757605: 11110000110111001000010000000000 788 ms
0.000000000000000000000000000000000000001525879503: 00010001101110010000100000000000 788 ms
0.000000000000000000000000000000000000000762939751: 00100011011100100001000000000000 795 ms
0.000000000000000000000000000000000000000381469876: 01000110111001000010000000000000 896 ms
0.000000000000000000000000000000000000000190734938: 10001101110010000100000000000000 813 ms
0.000000000000000000000000000000000000000095366768: 00011011100100001000000000000000 798 ms
0.000000000000000000000000000000000000000047683384: 00110111001000010000000000000000 791 ms
0.000000000000000000000000000000000000000023841692: 01101110010000100000000000000000 802 ms
0.000000000000000000000000000000000000000011920846: 11011100100001000000000000000000 809 ms
0.000000000000000000000000000000000000000005961124: 01111001000010000000000000000000 795 ms
0.000000000000000000000000000000000000000002980562: 11110010000100000000000000000000 835 ms
0.000000000000000000000000000000000000000001490982: 00010100001000000000000000000000 864 ms
0.000000000000000000000000000000000000000000745491: 00101000010000000000000000000000 915 ms
0.000000000000000000000000000000000000000000372745: 01010000100000000000000000000000 918 ms
0.000000000000000000000000000000000000000000186373: 10100001000000000000000000000000 881 ms
0.000000000000000000000000000000000000000000092486: 01000010000000000000000000000000 857 ms
0.000000000000000000000000000000000000000000046243: 10000100000000000000000000000000 861 ms
0.000000000000000000000000000000000000000000022421: 00001000000000000000000000000000 855 ms
0.000000000000000000000000000000000000000000011210: 00010000000000000000000000000000 887 ms
0.000000000000000000000000000000000000000000005605: 00100000000000000000000000000000 799 ms
0.000000000000000000000000000000000000000000002803: 01000000000000000000000000000000 828 ms
0.000000000000000000000000000000000000000000001401: 10000000000000000000000000000000 815 ms
0.000000000000000000000000000000000000000000000000: 00000000000000000000000000000000 42 ms
0.000000000000000000000000000000000000000000000000: 00000000000000000000000000000000 42 ms
0.000000000000000000000000000000000000000000000000: 00000000000000000000000000000000 44 ms

An equivalent discussion about ARM can be found in Stack Overflow question Denormalized floating point in Objective-C? 可以在Stack Overflow问题Objective-C中的非规范化浮点中找到关于ARM的等效讨论 .


#6楼

Welcome to the world of denormalized floating-point ! 欢迎来到非规范化浮点世界! They can wreak havoc on performance!!! 他们会对性能造成严重破坏!!!

Denormal (or subnormal) numbers are kind of a hack to get some extra values very close to zero out of the floating point representation. 非正规(或非正规)数字是一种破解,可以从浮点表示中获得非常接近于零的一些额外值。 Operations on denormalized floating-point can be tens to hundreds of times slower than on normalized floating-point. 在非标准化浮点上的操作可能比在标准化浮点上的操作慢几十到数百倍 This is because many processors can't handle them directly and must trap and resolve them using microcode. 这是因为许多处理器无法直接处理它们,而必须使用微码来捕获和解析它们。

If you print out the numbers after 10,000 iterations, you will see that they have converged to different values depending on whether 0 or 0.1 is used. 如果在10,000次迭代后打印出数字,您将看到它们已经收敛为不同的值,具体取决于使用0还是0.1

Here's the test code compiled on x64: 这是在x64上编译的测试代码:

int main() {

    double start = omp_get_wtime();

    const float x[16]={1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2.0,2.1,2.2,2.3,2.4,2.5,2.6};
    const float z[16]={1.123,1.234,1.345,156.467,1.578,1.689,1.790,1.812,1.923,2.034,2.145,2.256,2.367,2.478,2.589,2.690};
    float y[16];
    for(int i=0;i<16;i++)
    {
        y[i]=x[i];
    }
    for(int j=0;j<9000000;j++)
    {
        for(int i=0;i<16;i++)
        {
            y[i]*=x[i];
            y[i]/=z[i];
#ifdef FLOATING
            y[i]=y[i]+0.1f;
            y[i]=y[i]-0.1f;
#else
            y[i]=y[i]+0;
            y[i]=y[i]-0;
#endif

            if (j > 10000)
                cout << y[i] << "  ";
        }
        if (j > 10000)
            cout << endl;
    }

    double end = omp_get_wtime();
    cout << end - start << endl;

    system("pause");
    return 0;
}

Output: 输出:

#define FLOATING
1.78814e-007  1.3411e-007  1.04308e-007  0  7.45058e-008  6.70552e-008  6.70552e-008  5.58794e-007  3.05474e-007  2.16067e-007  1.71363e-007  1.49012e-007  1.2666e-007  1.11759e-007  1.04308e-007  1.04308e-007
1.78814e-007  1.3411e-007  1.04308e-007  0  7.45058e-008  6.70552e-008  6.70552e-008  5.58794e-007  3.05474e-007  2.16067e-007  1.71363e-007  1.49012e-007  1.2666e-007  1.11759e-007  1.04308e-007  1.04308e-007

//#define FLOATING
6.30584e-044  3.92364e-044  3.08286e-044  0  1.82169e-044  1.54143e-044  2.10195e-044  2.46842e-029  7.56701e-044  4.06377e-044  3.92364e-044  3.22299e-044  3.08286e-044  2.66247e-044  2.66247e-044  2.24208e-044
6.30584e-044  3.92364e-044  3.08286e-044  0  1.82169e-044  1.54143e-044  2.10195e-044  2.45208e-029  7.56701e-044  4.06377e-044  3.92364e-044  3.22299e-044  3.08286e-044  2.66247e-044  2.66247e-044  2.24208e-044

Note how in the second run the numbers are very close to zero. 请注意,在第二轮中,数字如何非常接近零。

Denormalized numbers are generally rare and thus most processors don't try to handle them efficiently. 非规范化的数字通常很少见,因此大多数处理器都不会尝试有效地处理它们。


To demonstrate that this has everything to do with denormalized numbers, if we flush denormals to zero by adding this to the start of the code: 为了证明这与非规格化数字有关,如果我们通过将非正规数添加到代码的开头将其冲洗为零 ,则可以:

_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);

Then the version with 0 is no longer 10x slower and actually becomes faster. 然后,具有0的版本不再慢10倍,而实际上变得更快。 (This requires that the code be compiled with SSE enabled.) (这要求在启用SSE的情况下编译代码。)

This means that rather than using these weird lower precision almost-zero values, we just round to zero instead. 这意味着我们不使用这些奇怪的较低精度的几乎为零的值,而是舍入为零。

Timings: Core i7 920 @ 3.5 GHz: 时间:Core i7 920 @ 3.5 GHz:

//  Don't flush denormals to zero.
0.1f: 0.564067
0   : 26.7669

//  Flush denormals to zero.
0.1f: 0.587117
0   : 0.341406

In the end, this really has nothing to do with whether it's an integer or floating-point. 最后,这确实与整数或浮点数无关。 The 0 or 0.1f is converted/stored into a register outside of both loops. 00.1f转换/存储到两个循环之外的寄存器中。 So that has no effect on performance. 因此,这对性能没有影响。

发布了0 篇原创文章 · 获赞 136 · 访问量 83万+

猜你喜欢

转载自blog.csdn.net/xfxf996/article/details/105219192
今日推荐