This translation from: Why does Changing 0.1f to 0 SLOW Down Performance by 10X?
Why does this bit of code, Why this code,
const float x[16] = { 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8,
1.9, 2.0, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6};
const float z[16] = {1.123, 1.234, 1.345, 156.467, 1.578, 1.689, 1.790, 1.812,
1.923, 2.034, 2.145, 2.256, 2.367, 2.478, 2.589, 2.690};
float y[16];
for (int i = 0; i < 16; i++)
{
y[i] = x[i];
}
for (int j = 0; j < 9000000; j++)
{
for (int i = 0; i < 16; i++)
{
y[i] *= x[i];
y[i] /= z[i];
y[i] = y[i] + 0.1f; // <--
y[i] = y[i] - 0.1f; // <--
}
}
run more than 10 times faster than the following bit (identical except where noted)? faster than the following bit more than 10 times (the same place, unless otherwise stated)?
const float x[16] = { 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8,
1.9, 2.0, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6};
const float z[16] = {1.123, 1.234, 1.345, 156.467, 1.578, 1.689, 1.790, 1.812,
1.923, 2.034, 2.145, 2.256, 2.367, 2.478, 2.589, 2.690};
float y[16];
for (int i = 0; i < 16; i++)
{
y[i] = x[i];
}
for (int j = 0; j < 9000000; j++)
{
for (int i = 0; i < 16; i++)
{
y[i] *= x[i];
y[i] /= z[i];
y[i] = y[i] + 0; // <--
y[i] = y[i] - 0; // <--
}
}
when compiling with Visual Studio 2010 SP1. When Visual Studio 2010 SP1 to compile use. Optimization Level WAS at The -02
with sse2
Enabled. Enabling sse2
optimization level is -02
. I have not tested with other compilers. I have not tested with other compilers.
#1st Floor
Reference: Why the change https://stackoom.com/question/d58Q/ -f to - degrade performance - fold
#2nd Floor
In gcc you can enable FTZ and DAZ with this: In gcc, you can enable the FTZ and DAZ in the following ways:
#include <xmmintrin.h>
#define FTZ 1
#define DAZ 1
void enableFtzDaz()
{
int mxcsr = _mm_getcsr ();
if (FTZ) {
mxcsr |= (1<<15) | (1<<11);
}
if (DAZ) {
mxcsr |= (1<<6);
}
_mm_setcsr (mxcsr);
}
also use gcc switches: -msse -mfpmath = sse use gcc switch: -msse -mfpmath = sse
(in the Corresponding Credits to Carl Hetherington [1]) (equivalent credits Carl djoser Burlington [1])
[1] http://carlh.net/plugins/denormals.php [1] http://carlh.net/plugins/denormals.php
#3rd floor
It's due to denormalized floating-point use. This is due to the use of non-normalized floating-point number. How to get rid of both it and the performance penalty? How to get rid of it and loss of performance? Having scoured the Internet for ways of killing denormal numbers, it seems there is no "best" way to do this yet. After searching the Internet to eliminate abnormal digital methods, it seems there is no "best" approach. I have found these three methods that may work best in different environments: I found these three methods in different environments might best:
Might not work in some GCC environments: In some GCC environment may not be available:
// Requires #include <fenv.h> fesetenv(FE_DFL_DISABLE_SSE_DENORMS_ENV);
Some not Work in Visual Might Studio Environments: 1 in some of the Visual Studio environment may not work: 1
// Requires #include <xmmintrin.h> _mm_setcsr( _mm_getcsr() | (1<<15) | (1<<6) ); // Does both FTZ and DAZ bits. You can also use just hex value 0x8040 to do both. // You might also want to use the underflow mask (1<<11)
Appears to work in both GCC and Visual Studio: seems to be used in GCC and Visual Studio are:
// Requires #include <xmmintrin.h> // Requires #include <pmmintrin.h> _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON); _MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON);
The Intel compiler has options to disable denormals by default on modern Intel CPUs. Intel compilers have to disable anti-normal case of default options on modern Intel CPU. More details here here for more details
Compiler switches. Compiler switch.
-ffast-math
,-msse
Or-mfpmath=sse
by Will disable denormals and the make A FEW OTHER Things Faster, But Also unfortunately do LOTS of Might BREAK OTHER approximations that your code.-ffast-math
,-msse
Or-mfpmath=sse
disables an exception, and some other things faster, but unfortunately, it also performs many other similar operations, may break your code. Test carefully! Carefully test! The FAST-equivalent of the Math at The for IS at The Visual Studio Compiler/fp:fast
But the I have have not been to Confirm The model types within the this of Able Also disables denormals. For Visual Studio compiler, the equivalent operation is fast/fp:fast
but I can not confirm whether this is also disabled an exception. 1 1
#4th floor
Neely's the Comment dan ought to BE Expanded INTO AN answer: Dan Neely review should be extended to one answer:
Not Constant The ZERO IS IT 0.0f
that IS denormalized or Down Causes A SLOW, IT IS ZERO Approach The values that each of Iteration The Loop. 0.0f
Or slower than zero results in a constant 0.0f
time, but each loop iteration value close to zero. As they come closer and closer to zero , they need more precision to represent and they become denormalized. As they are increasingly close to zero, they require higher precision expressed, and they become standardized. We are at The THESE y[i]
values. These are y[i]
values. (They Approach Because ZERO x[i]/z[i]
IS less Within last 1.0 for All i
.) (Which is close to zero, since all i
x[i]/z[i]
were less than 1.0.)
The crucial difference between the slow and fast versions of the code is the statement y[i] = y[i] + 0.1f;
key difference between slow and fast release version of the code is the statement y[i] = y[i] + 0.1f;
. . As soon as this line is executed each iteration of the loop, the extra precision in the float is lost, and the denormalization needed to represent that precision is no longer needed. After doing this line in each iteration of the loop, the float the extra precision will be lost, and no longer need the precision of representatives of non-standardization. Afterwards, the Operations Floating Point ON y[i]
REMAIN Because the FAST They are not denormalized. After that, the y[i]
floating-point operations on the state maintain a rapid, because they are not non-normalized.
Lost IS at The Extra Precision Why you the when the Add 0.1f
? Why add 0.1f
losing the extra precision? Because floating point numbers only have so many significant digits. Because a lot of floating-point numbers only significant digits. You have have enough for Storage Say Three Significant digits, the then 0.00001 = 1e-5
, and 0.00001 + 0.1 = 0.1
, for the this Example AT Least float format, Because IT does not have have to Store Room at The Least Significant bit in 0.10001
. Assuming you have enough storage space to store the three significant digits , then at least float format for the present embodiment, compared 0.00001 = 1e-5
, and 0.00001 + 0.1 = 0.1
because it has no space to store 0.10001
the least significant bit.
In short, y[i]=y[i]+0.1f; y[i]=y[i]-0.1f;
in short, y[i]=y[i]+0.1f; y[i]=y[i]-0.1f;
y[i]=y[i]+0.1f; y[i]=y[i]-0.1f;
IS not you at The NO-OP Might of Think IT IS. Probably not what you think.
Well of Said the this AS Mystical :. At The Content of Matters at The floats, not the Just at The Assembly code mystics also say this : the contents of floating point is very important, not just the assembly code.
#5th Floor
The using gcc
and Applying Assembly A diff Generated Yields The only to the this -difference: use gcc
and applied to the resulting assembly diff only produce the following differences:
73c68,69
< movss LCPI1_0(%rip), %xmm1
---
> movabsq $0, %rcx
> cvtsi2ssq %rcx, %xmm1
81d76
< subss %xmm1, %xmm0
At The cvtsi2ssq
One being 10 Times It would help all jobs. Indeed. cvtsi2ssq
Slow 10 times.
Apparently, The float
Version uses AN XMM Register loaded from Memory, the while The int
Version Converts A Real int
value 0 to float
the using The cvtsi2ssq
Instruction, Taking A Lot of Time. Obviously, float
version from memory loaded XMM register, int
version cvtsi2ssq
instruction actual int
value 0 converted float
, it will take a lot of time. Passing -O3
to gcc does not Help. Will -O3
be passed to gcc did not help. (gcc Version 4.2.1.) (gcc version 4.2.1).
(The Using double
INSTEAD of float
does not Matter, the except that Changes at The IT cvtsi2ssq
INTO A cvtsi2sdq
.) (Use double
instead of float
it does not matter, but it will cvtsi2ssq
change cvtsi2sdq
.)
Update updates
Extra Tests Show that IT s Some IS not Necessarily at The cvtsi2ssq
Instruction. Some additional tests have shown that it is not necessarily cvtsi2ssq
command. Eliminated once (the using A int ai=0;float a=ai;
and the using a
INSTEAD of 0
), The Remains Speed -difference. Upon elimination (using int ai=0;float a=ai;
using a
instead 0
), the speed difference still exists. So @Mysticial is right, the denormalized floats make the difference. Therefore, @ Mysticial is correct, non-normalized floating-point numbers will make a difference. Testing CAN BE Seen by the this BETWEEN values 0
and 0.1f
. Tested 0
to 0.1f
be seen between this value. At The Turning Point in above at The code IS AT Approximately 0.00000000000000000000000000000001
, the when Loops Suddenly Takes 10 at The Times AS Long. Turning point in the code above is about 0.00000000000000000000000000000001
, then suddenly it takes 10 times the cycle time.
Update<<1 更新<< 1
A small visualisation of this interesting phenomenon: small map on this interesting phenomenon:
- Column 1: a float, divided by 2 for every iteration Column 1: floating point, each iteration divided by 2
- Column 2: the binary representation of this float of 2: This binary floating-point number representation
- Column 3: the time taken to sum this float 1e7 times of 3: sum of time required for the floating point 1e7 times
You can clearly see the exponent (the last 9 bits) change to its lowest value, when denormalization sets in. At that point, simple addition becomes 20 times slower. You can clearly see, making non-standardized settings, index (last 9) becomes the minimum value. In this case, a simple addition 20 times slower.
0.000000000000000000000000000000000100000004670110: 10111100001101110010000011100000 45 ms
0.000000000000000000000000000000000050000002335055: 10111100001101110010000101100000 43 ms
0.000000000000000000000000000000000025000001167528: 10111100001101110010000001100000 43 ms
0.000000000000000000000000000000000012500000583764: 10111100001101110010000110100000 42 ms
0.000000000000000000000000000000000006250000291882: 10111100001101110010000010100000 48 ms
0.000000000000000000000000000000000003125000145941: 10111100001101110010000100100000 43 ms
0.000000000000000000000000000000000001562500072970: 10111100001101110010000000100000 42 ms
0.000000000000000000000000000000000000781250036485: 10111100001101110010000111000000 42 ms
0.000000000000000000000000000000000000390625018243: 10111100001101110010000011000000 42 ms
0.000000000000000000000000000000000000195312509121: 10111100001101110010000101000000 43 ms
0.000000000000000000000000000000000000097656254561: 10111100001101110010000001000000 42 ms
0.000000000000000000000000000000000000048828127280: 10111100001101110010000110000000 44 ms
0.000000000000000000000000000000000000024414063640: 10111100001101110010000010000000 42 ms
0.000000000000000000000000000000000000012207031820: 10111100001101110010000100000000 42 ms
0.000000000000000000000000000000000000006103515209: 01111000011011100100001000000000 789 ms
0.000000000000000000000000000000000000003051757605: 11110000110111001000010000000000 788 ms
0.000000000000000000000000000000000000001525879503: 00010001101110010000100000000000 788 ms
0.000000000000000000000000000000000000000762939751: 00100011011100100001000000000000 795 ms
0.000000000000000000000000000000000000000381469876: 01000110111001000010000000000000 896 ms
0.000000000000000000000000000000000000000190734938: 10001101110010000100000000000000 813 ms
0.000000000000000000000000000000000000000095366768: 00011011100100001000000000000000 798 ms
0.000000000000000000000000000000000000000047683384: 00110111001000010000000000000000 791 ms
0.000000000000000000000000000000000000000023841692: 01101110010000100000000000000000 802 ms
0.000000000000000000000000000000000000000011920846: 11011100100001000000000000000000 809 ms
0.000000000000000000000000000000000000000005961124: 01111001000010000000000000000000 795 ms
0.000000000000000000000000000000000000000002980562: 11110010000100000000000000000000 835 ms
0.000000000000000000000000000000000000000001490982: 00010100001000000000000000000000 864 ms
0.000000000000000000000000000000000000000000745491: 00101000010000000000000000000000 915 ms
0.000000000000000000000000000000000000000000372745: 01010000100000000000000000000000 918 ms
0.000000000000000000000000000000000000000000186373: 10100001000000000000000000000000 881 ms
0.000000000000000000000000000000000000000000092486: 01000010000000000000000000000000 857 ms
0.000000000000000000000000000000000000000000046243: 10000100000000000000000000000000 861 ms
0.000000000000000000000000000000000000000000022421: 00001000000000000000000000000000 855 ms
0.000000000000000000000000000000000000000000011210: 00010000000000000000000000000000 887 ms
0.000000000000000000000000000000000000000000005605: 00100000000000000000000000000000 799 ms
0.000000000000000000000000000000000000000000002803: 01000000000000000000000000000000 828 ms
0.000000000000000000000000000000000000000000001401: 10000000000000000000000000000000 815 ms
0.000000000000000000000000000000000000000000000000: 00000000000000000000000000000000 42 ms
0.000000000000000000000000000000000000000000000000: 00000000000000000000000000000000 42 ms
0.000000000000000000000000000000000000000000000000: 00000000000000000000000000000000 44 ms
The About the Discussion equivalent ARM CAN AN BE found in Stack Overflow Question denormalized Floating Point in Objective-C? May be in Stack Overflow question denormalized floating point in Objective-C's find the equivalent of the discussion on ARM . . .
#6th floor
At The world of to is available for purchase denormalized Floating-Point ! Welcome to the non-normalized floating-point world! They can wreak havoc on performance !!! they will cause serious damage to property! ! !
Denormal (or subnormal) numbers are kind of a hack to get some extra values very close to zero out of the floating point representation. Informal (or informal) is a digital cracks, it can be obtained very close to the floating-point representation some additional value of zero. Floating-Point ON denormalized the Operations CAN BE TENS It would help to Hundreds of Times Within last ON on Normalized Floating-Point. Operate on non-standard floating-point operations than possible on a standardized floating- slow tens to hundreds of times . This is because many processors can not handle them directly and must trap and resolve them using microcode. This is because many processors can not handle them directly, but must use the microcode to capture and resolve them.
Print a Numbers you at The OUT IF the After the Iterations 10,000, by Will See that you Converged They have have to Different values DEPENDING ON model types within 0
or 0.1
IS Used. If you print out the number after 10,000 iterations, you will see that they have converged to different values, specific depending on the use 0
or 0.1
.
Here's the test code compiled on x64 : This is the x64 compiler test code:
int main() {
double start = omp_get_wtime();
const float x[16]={1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2.0,2.1,2.2,2.3,2.4,2.5,2.6};
const float z[16]={1.123,1.234,1.345,156.467,1.578,1.689,1.790,1.812,1.923,2.034,2.145,2.256,2.367,2.478,2.589,2.690};
float y[16];
for(int i=0;i<16;i++)
{
y[i]=x[i];
}
for(int j=0;j<9000000;j++)
{
for(int i=0;i<16;i++)
{
y[i]*=x[i];
y[i]/=z[i];
#ifdef FLOATING
y[i]=y[i]+0.1f;
y[i]=y[i]-0.1f;
#else
y[i]=y[i]+0;
y[i]=y[i]-0;
#endif
if (j > 10000)
cout << y[i] << " ";
}
if (j > 10000)
cout << endl;
}
double end = omp_get_wtime();
cout << end - start << endl;
system("pause");
return 0;
}
Output: Output:
#define FLOATING
1.78814e-007 1.3411e-007 1.04308e-007 0 7.45058e-008 6.70552e-008 6.70552e-008 5.58794e-007 3.05474e-007 2.16067e-007 1.71363e-007 1.49012e-007 1.2666e-007 1.11759e-007 1.04308e-007 1.04308e-007
1.78814e-007 1.3411e-007 1.04308e-007 0 7.45058e-008 6.70552e-008 6.70552e-008 5.58794e-007 3.05474e-007 2.16067e-007 1.71363e-007 1.49012e-007 1.2666e-007 1.11759e-007 1.04308e-007 1.04308e-007
//#define FLOATING
6.30584e-044 3.92364e-044 3.08286e-044 0 1.82169e-044 1.54143e-044 2.10195e-044 2.46842e-029 7.56701e-044 4.06377e-044 3.92364e-044 3.22299e-044 3.08286e-044 2.66247e-044 2.66247e-044 2.24208e-044
6.30584e-044 3.92364e-044 3.08286e-044 0 1.82169e-044 1.54143e-044 2.10195e-044 2.45208e-029 7.56701e-044 4.06377e-044 3.92364e-044 3.22299e-044 3.08286e-044 2.66247e-044 2.66247e-044 2.24208e-044
Note how in the second run the numbers are very close to zero. Note that in the second round, figures how very close to zero.
Denormalized numbers are generally rare and thus most processors do not try to handle them efficiently. Denormalized number is usually rare, so most processor does not attempt to deal with them effectively.
Demonstrate that the this has Everything the To to do with denormalized a Numbers, IF WE flush denormals to ZERO by Adding to the this at The Start of at The code: To prove that this is related to the non-normalized numbers, if we count by the informal to the beginning of the code it flush to zero , you can:
_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);
At The Version with the Then 0
IS NO longer Actually Becomes Faster and 10X It would help. Then, with 0
the slower version is no longer 10 times, and in fact becomes faster. (This The requires that at The code BE Compiled with SSE Enabled.) (Which requires compiling code in the case of SSE enabled.)
This means that rather than using these weird lower precision almost-zero values, we just round to zero instead. Almost zero value which means we do not use these strange low accuracy, but rounded to zero.
Timings: Core i7 920 @ 3.5 GHz : Time: Core i7 920 @ 3.5 GHz:
// Don't flush denormals to zero.
0.1f: 0.564067
0 : 26.7669
// Flush denormals to zero.
0.1f: 0.587117
0 : 0.341406
In the end, this really has nothing to do with whether it's an integer or floating-point. Finally, it really has nothing to do with integer or floating point. Of The 0
or 0.1f
IS Converted / INTO A Register Stored Outside Loops of both. 0
Or 0.1f
conversion / stored in a register other than the two cycles. So that has no effect on performance. Therefore, this has no effect on performance.