Why the change 0.1f 0 degrade performance 10 times?

This translation from: Why does Changing 0.1f to 0 SLOW Down Performance by 10X?

Why does this bit of code, Why this code,

const float x[16] = {  1.1,   1.2,   1.3,     1.4,   1.5,   1.6,   1.7,   1.8,
                       1.9,   2.0,   2.1,     2.2,   2.3,   2.4,   2.5,   2.6};
const float z[16] = {1.123, 1.234, 1.345, 156.467, 1.578, 1.689, 1.790, 1.812,
                     1.923, 2.034, 2.145,   2.256, 2.367, 2.478, 2.589, 2.690};
float y[16];
for (int i = 0; i < 16; i++)
{
    y[i] = x[i];
}

for (int j = 0; j < 9000000; j++)
{
    for (int i = 0; i < 16; i++)
    {
        y[i] *= x[i];
        y[i] /= z[i];
        y[i] = y[i] + 0.1f; // <--
        y[i] = y[i] - 0.1f; // <--
    }
}

run more than 10 times faster than the following bit (identical except where noted)? faster than the following bit more than 10 times (the same place, unless otherwise stated)?

const float x[16] = {  1.1,   1.2,   1.3,     1.4,   1.5,   1.6,   1.7,   1.8,
                       1.9,   2.0,   2.1,     2.2,   2.3,   2.4,   2.5,   2.6};
const float z[16] = {1.123, 1.234, 1.345, 156.467, 1.578, 1.689, 1.790, 1.812,
                     1.923, 2.034, 2.145,   2.256, 2.367, 2.478, 2.589, 2.690};
float y[16];
for (int i = 0; i < 16; i++)
{
    y[i] = x[i];
}

for (int j = 0; j < 9000000; j++)
{
    for (int i = 0; i < 16; i++)
    {
        y[i] *= x[i];
        y[i] /= z[i];
        y[i] = y[i] + 0; // <--
        y[i] = y[i] - 0; // <--
    }
}

when compiling with Visual Studio 2010 SP1. When Visual Studio 2010 SP1 to compile use. Optimization Level WAS at The -02with sse2Enabled. Enabling sse2optimization level is -02. I have not tested with other compilers. I have not tested with other compilers.

#1st Floor

Reference: Why the change https://stackoom.com/question/d58Q/ -f to - degrade performance - fold

#2nd Floor

In gcc you can enable FTZ and DAZ with this: In gcc, you can enable the FTZ and DAZ in the following ways:

#include <xmmintrin.h>

#define FTZ 1
#define DAZ 1   

void enableFtzDaz()
{
    int mxcsr = _mm_getcsr ();

    if (FTZ) {
            mxcsr |= (1<<15) | (1<<11);
    }

    if (DAZ) {
            mxcsr |= (1<<6);
    }

    _mm_setcsr (mxcsr);
}

also use gcc switches: -msse -mfpmath = sse use gcc switch: -msse -mfpmath = sse

(in the Corresponding Credits to Carl Hetherington [1]) (equivalent credits Carl djoser Burlington [1])

[1] http://carlh.net/plugins/denormals.php [1] http://carlh.net/plugins/denormals.php

#3rd floor

It's due to denormalized floating-point use. This is due to the use of non-normalized floating-point number. How to get rid of both it and the performance penalty? How to get rid of it and loss of performance? Having scoured the Internet for ways of killing denormal numbers, it seems there is no "best" way to do this yet. After searching the Internet to eliminate abnormal digital methods, it seems there is no "best" approach. I have found these three methods that may work best in different environments: I found these three methods in different environments might best:

Might not work in some GCC environments: In some GCC environment may not be available:
```
 // Requires #include <fenv.h> fesetenv(FE_DFL_DISABLE_SSE_DENORMS_ENV); 
```

Some not Work in Visual Might Studio Environments: 1 in some of the Visual Studio environment may not work: 1

 // Requires #include <xmmintrin.h> _mm_setcsr( _mm_getcsr() | (1<<15) | (1<<6) ); // Does both FTZ and DAZ bits. You can also use just hex value 0x8040 to do both. // You might also want to use the underflow mask (1<<11)

Appears to work in both GCC and Visual Studio: seems to be used in GCC and Visual Studio are:

 // Requires #include <xmmintrin.h> // Requires #include <pmmintrin.h> _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON); _MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON);

The Intel compiler has options to disable denormals by default on modern Intel CPUs. Intel compilers have to disable anti-normal case of default options on modern Intel CPU. More details here here for more details
Compiler switches. Compiler switch. -ffast-math, -msseOr -mfpmath=sseby Will disable denormals and the make A FEW OTHER Things Faster, But Also unfortunately do LOTS of Might BREAK OTHER approximations that your code. -ffast-math, -msseOr -mfpmath=ssedisables an exception, and some other things faster, but unfortunately, it also performs many other similar operations, may break your code. Test carefully! Carefully test! The FAST-equivalent of the Math at The for IS at The Visual Studio Compiler /fp:fastBut the I have have not been to Confirm The model types within the this of Able Also disables denormals. For Visual Studio compiler, the equivalent operation is fast /fp:fastbut I can not confirm whether this is also disabled an exception. 1 1

#4th floor

Neely's the Comment dan ought to BE Expanded INTO AN answer: Dan Neely review should be extended to one answer:

Not Constant The ZERO IS IT 0.0fthat IS denormalized or Down Causes A SLOW, IT IS ZERO Approach The values that each of Iteration The Loop. 0.0fOr slower than zero results in a constant 0.0ftime, but each loop iteration value close to zero. As they come closer and closer to zero , they need more precision to represent and they become denormalized. As they are increasingly close to zero, they require higher precision expressed, and they become standardized. We are at The THESE y[i]values. These are y[i]values. (They Approach Because ZERO x[i]/z[i]IS less Within last 1.0 for All i.) (Which is close to zero, since all i x[i]/z[i]were less than 1.0.)

The crucial difference between the slow and fast versions of the code is the statement y[i] = y[i] + 0.1f; key difference between slow and fast release version of the code is the statement y[i] = y[i] + 0.1f; . . As soon as this line is executed each iteration of the loop, the extra precision in the float is lost, and the denormalization needed to represent that precision is no longer needed. After doing this line in each iteration of the loop, the float the extra precision will be lost, and no longer need the precision of representatives of non-standardization. Afterwards, the Operations Floating Point ON y[i]REMAIN Because the FAST They are not denormalized. After that, the y[i]floating-point operations on the state maintain a rapid, because they are not non-normalized.

Lost IS at The Extra Precision Why you the when the Add 0.1f? Why add 0.1flosing the extra precision? Because floating point numbers only have so many significant digits. Because a lot of floating-point numbers only significant digits. You have have enough for Storage Say Three Significant digits, the then 0.00001 = 1e-5, and 0.00001 + 0.1 = 0.1, for the this Example AT Least float format, Because IT does not have have to Store Room at The Least Significant bit in 0.10001. Assuming you have enough storage space to store the three significant digits , then at least float format for the present embodiment, compared 0.00001 = 1e-5, and 0.00001 + 0.1 = 0.1because it has no space to store 0.10001the least significant bit.

In short, y[i]=y[i]+0.1f; y[i]=y[i]-0.1f; in short, y[i]=y[i]+0.1f; y[i]=y[i]-0.1f; y[i]=y[i]+0.1f; y[i]=y[i]-0.1f; IS not you at The NO-OP Might of Think IT IS. Probably not what you think.

Well of Said the this AS Mystical :. At The Content of Matters at The floats, not the Just at The Assembly code mystics also say this : the contents of floating point is very important, not just the assembly code.

#5th Floor

The using gccand Applying Assembly A diff Generated Yields The only to the this -difference: use gccand applied to the resulting assembly diff only produce the following differences:

73c68,69
<   movss   LCPI1_0(%rip), %xmm1
---
>   movabsq $0, %rcx
>   cvtsi2ssq   %rcx, %xmm1
81d76
<   subss   %xmm1, %xmm0

At The cvtsi2ssqOne being 10 Times It would help all jobs. Indeed. cvtsi2ssqSlow 10 times.

Apparently, The floatVersion uses AN XMM Register loaded from Memory, the while The intVersion Converts A Real intvalue 0 to floatthe using The cvtsi2ssqInstruction, Taking A Lot of Time. Obviously, floatversion from memory loaded XMM register, intversion cvtsi2ssqinstruction actual intvalue 0 converted float, it will take a lot of time. Passing -O3to gcc does not Help. Will -O3be passed to gcc did not help. (gcc Version 4.2.1.) (gcc version 4.2.1).

(The Using doubleINSTEAD of floatdoes not Matter, the except that Changes at The IT cvtsi2ssqINTO A cvtsi2sdq.) (Use doubleinstead of floatit does not matter, but it will cvtsi2ssqchange cvtsi2sdq.)

Update updates

Extra Tests Show that IT s Some IS not Necessarily at The cvtsi2ssqInstruction. Some additional tests have shown that it is not necessarily cvtsi2ssqcommand. Eliminated once (the using A int ai=0;float a=ai;and the using aINSTEAD of 0), The Remains Speed -difference. Upon elimination (using int ai=0;float a=ai;using ainstead 0), the speed difference still exists. So @Mysticial is right, the denormalized floats make the difference. Therefore, @ Mysticial is correct, non-normalized floating-point numbers will make a difference. Testing CAN BE Seen by the this BETWEEN values 0and 0.1f. Tested 0to 0.1fbe seen between this value. At The Turning Point in above at The code IS AT Approximately 0.00000000000000000000000000000001, the when Loops Suddenly Takes 10 at The Times AS Long. Turning point in the code above is about 0.00000000000000000000000000000001, then suddenly it takes 10 times the cycle time.

Update<<1 更新<< 1

A small visualisation of this interesting phenomenon: small map on this interesting phenomenon:

Column 1: a float, divided by 2 for every iteration Column 1: floating point, each iteration divided by 2
Column 2: the binary representation of this float of 2: This binary floating-point number representation
Column 3: the time taken to sum this float 1e7 times of 3: sum of time required for the floating point 1e7 times

You can clearly see the exponent (the last 9 bits) change to its lowest value, when denormalization sets in. At that point, simple addition becomes 20 times slower. You can clearly see, making non-standardized settings, index (last 9) becomes the minimum value. In this case, a simple addition 20 times slower.

0.000000000000000000000000000000000100000004670110: 10111100001101110010000011100000 45 ms
0.000000000000000000000000000000000050000002335055: 10111100001101110010000101100000 43 ms
0.000000000000000000000000000000000025000001167528: 10111100001101110010000001100000 43 ms
0.000000000000000000000000000000000012500000583764: 10111100001101110010000110100000 42 ms
0.000000000000000000000000000000000006250000291882: 10111100001101110010000010100000 48 ms
0.000000000000000000000000000000000003125000145941: 10111100001101110010000100100000 43 ms
0.000000000000000000000000000000000001562500072970: 10111100001101110010000000100000 42 ms
0.000000000000000000000000000000000000781250036485: 10111100001101110010000111000000 42 ms
0.000000000000000000000000000000000000390625018243: 10111100001101110010000011000000 42 ms
0.000000000000000000000000000000000000195312509121: 10111100001101110010000101000000 43 ms
0.000000000000000000000000000000000000097656254561: 10111100001101110010000001000000 42 ms
0.000000000000000000000000000000000000048828127280: 10111100001101110010000110000000 44 ms
0.000000000000000000000000000000000000024414063640: 10111100001101110010000010000000 42 ms
0.000000000000000000000000000000000000012207031820: 10111100001101110010000100000000 42 ms
0.000000000000000000000000000000000000006103515209: 01111000011011100100001000000000 789 ms
0.000000000000000000000000000000000000003051757605: 11110000110111001000010000000000 788 ms
0.000000000000000000000000000000000000001525879503: 00010001101110010000100000000000 788 ms
0.000000000000000000000000000000000000000762939751: 00100011011100100001000000000000 795 ms
0.000000000000000000000000000000000000000381469876: 01000110111001000010000000000000 896 ms
0.000000000000000000000000000000000000000190734938: 10001101110010000100000000000000 813 ms
0.000000000000000000000000000000000000000095366768: 00011011100100001000000000000000 798 ms
0.000000000000000000000000000000000000000047683384: 00110111001000010000000000000000 791 ms
0.000000000000000000000000000000000000000023841692: 01101110010000100000000000000000 802 ms
0.000000000000000000000000000000000000000011920846: 11011100100001000000000000000000 809 ms
0.000000000000000000000000000000000000000005961124: 01111001000010000000000000000000 795 ms
0.000000000000000000000000000000000000000002980562: 11110010000100000000000000000000 835 ms
0.000000000000000000000000000000000000000001490982: 00010100001000000000000000000000 864 ms
0.000000000000000000000000000000000000000000745491: 00101000010000000000000000000000 915 ms
0.000000000000000000000000000000000000000000372745: 01010000100000000000000000000000 918 ms
0.000000000000000000000000000000000000000000186373: 10100001000000000000000000000000 881 ms
0.000000000000000000000000000000000000000000092486: 01000010000000000000000000000000 857 ms
0.000000000000000000000000000000000000000000046243: 10000100000000000000000000000000 861 ms
0.000000000000000000000000000000000000000000022421: 00001000000000000000000000000000 855 ms
0.000000000000000000000000000000000000000000011210: 00010000000000000000000000000000 887 ms
0.000000000000000000000000000000000000000000005605: 00100000000000000000000000000000 799 ms
0.000000000000000000000000000000000000000000002803: 01000000000000000000000000000000 828 ms
0.000000000000000000000000000000000000000000001401: 10000000000000000000000000000000 815 ms
0.000000000000000000000000000000000000000000000000: 00000000000000000000000000000000 42 ms
0.000000000000000000000000000000000000000000000000: 00000000000000000000000000000000 42 ms
0.000000000000000000000000000000000000000000000000: 00000000000000000000000000000000 44 ms

The About the Discussion equivalent ARM CAN AN BE found in Stack Overflow Question denormalized Floating Point in Objective-C? May be in Stack Overflow question denormalized floating point in Objective-C's find the equivalent of the discussion on ARM . . .

#6th floor

At The world of to is available for purchase denormalized Floating-Point ! Welcome to the non-normalized floating-point world! They can wreak havoc on performance !!! they will cause serious damage to property! ! !

Denormal (or subnormal) numbers are kind of a hack to get some extra values very close to zero out of the floating point representation. Informal (or informal) is a digital cracks, it can be obtained very close to the floating-point representation some additional value of zero. Floating-Point ON denormalized the Operations CAN BE TENS It would help to Hundreds of Times Within last ON on Normalized Floating-Point. Operate on non-standard floating-point operations than possible on a standardized floating- slow tens to hundreds of times . This is because many processors can not handle them directly and must trap and resolve them using microcode. This is because many processors can not handle them directly, but must use the microcode to capture and resolve them.

Print a Numbers you at The OUT IF the After the Iterations 10,000, by Will See that you Converged They have have to Different values DEPENDING ON model types within 0or 0.1IS Used. If you print out the number after 10,000 iterations, you will see that they have converged to different values, specific depending on the use 0or 0.1.

Here's the test code compiled on x64 : This is the x64 compiler test code:

int main() {

    double start = omp_get_wtime();

    const float x[16]={1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2.0,2.1,2.2,2.3,2.4,2.5,2.6};
    const float z[16]={1.123,1.234,1.345,156.467,1.578,1.689,1.790,1.812,1.923,2.034,2.145,2.256,2.367,2.478,2.589,2.690};
    float y[16];
    for(int i=0;i<16;i++)
    {
        y[i]=x[i];
    }
    for(int j=0;j<9000000;j++)
    {
        for(int i=0;i<16;i++)
        {
            y[i]*=x[i];
            y[i]/=z[i];
#ifdef FLOATING
            y[i]=y[i]+0.1f;
            y[i]=y[i]-0.1f;
#else
            y[i]=y[i]+0;
            y[i]=y[i]-0;
#endif

            if (j > 10000)
                cout << y[i] << "  ";
        }
        if (j > 10000)
            cout << endl;
    }

    double end = omp_get_wtime();
    cout << end - start << endl;

    system("pause");
    return 0;
}

Output: Output:

#define FLOATING
1.78814e-007  1.3411e-007  1.04308e-007  0  7.45058e-008  6.70552e-008  6.70552e-008  5.58794e-007  3.05474e-007  2.16067e-007  1.71363e-007  1.49012e-007  1.2666e-007  1.11759e-007  1.04308e-007  1.04308e-007
1.78814e-007  1.3411e-007  1.04308e-007  0  7.45058e-008  6.70552e-008  6.70552e-008  5.58794e-007  3.05474e-007  2.16067e-007  1.71363e-007  1.49012e-007  1.2666e-007  1.11759e-007  1.04308e-007  1.04308e-007

//#define FLOATING
6.30584e-044  3.92364e-044  3.08286e-044  0  1.82169e-044  1.54143e-044  2.10195e-044  2.46842e-029  7.56701e-044  4.06377e-044  3.92364e-044  3.22299e-044  3.08286e-044  2.66247e-044  2.66247e-044  2.24208e-044
6.30584e-044  3.92364e-044  3.08286e-044  0  1.82169e-044  1.54143e-044  2.10195e-044  2.45208e-029  7.56701e-044  4.06377e-044  3.92364e-044  3.22299e-044  3.08286e-044  2.66247e-044  2.66247e-044  2.24208e-044

Note how in the second run the numbers are very close to zero. Note that in the second round, figures how very close to zero.

Denormalized numbers are generally rare and thus most processors do not try to handle them efficiently. Denormalized number is usually rare, so most processor does not attempt to deal with them effectively.

Demonstrate that the this has Everything the To to do with denormalized a Numbers, IF WE flush denormals to ZERO by Adding to the this at The Start of at The code: To prove that this is related to the non-normalized numbers, if we count by the informal to the beginning of the code it flush to zero , you can:

_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);

At The Version with the Then 0IS NO longer Actually Becomes Faster and 10X It would help. Then, with 0the slower version is no longer 10 times, and in fact becomes faster. (This The requires that at The code BE Compiled with SSE Enabled.) (Which requires compiling code in the case of SSE enabled.)

This means that rather than using these weird lower precision almost-zero values, we just round to zero instead. Almost zero value which means we do not use these strange low accuracy, but rounded to zero.

Timings: Core i7 920 @ 3.5 GHz : Time: Core i7 920 @ 3.5 GHz:

//  Don't flush denormals to zero.
0.1f: 0.564067
0   : 26.7669

//  Flush denormals to zero.
0.1f: 0.587117
0   : 0.341406

In the end, this really has nothing to do with whether it's an integer or floating-point. Finally, it really has nothing to do with integer or floating point. Of The 0or 0.1fIS Converted / INTO A Register Stored Outside Loops of both. 0Or 0.1fconversion / stored in a register other than the two cycles. So that has no effect on performance. Therefore, this has no effect on performance.

xfxf996

Original articles published 0 · won praise 136 · views 830 000 +

Private letter concerns