Damn IEEE-754 floating point numbers, "about" is "about", what's your bottom line? Check you out in the name of JS

IEEE 754 said: Although you are crazy and scolding your mother, you can completely avoid me and count me as a loser.

1. Those tricks made by IEEE-754 floating point numbers

First of all, let's look at a few simple questions. If you can tell the details of each question, you can skip it. If you can only say "because of the IEEE754 floating-point precision problem", then the following is still worth a look.

The first question is well known 0.1+0.2 != 0.3, why? The rookie will tell you "because of the IEEE 754 floating point number representation standard", the old bird will add "0.1 and 0.2 cannot be accurately represented by binary floating point numbers, this addition will lose precision", the giant bird will tell you how the whole process is , in which steps the precision of decimal addition may be lost, can you answer the details?

The second question, since decimal 0.1cannot be accurately stored by binary floating point, why console.log(0.1)is 0.1the exact value printed out?

The third question, do you know what these comparison results are?

//这相等和不等是怎么回事?
0.100000000000000002 ==
0.100000000000000010 // true

0.100000000000000002 ==
0.100000000000000020 // false

//显然下面的数值没有超过Number.MAX_SAFE_INTEGER的范围,为什么是这样?
Math.pow(10, 10) + Math.pow(10, -7) === Math.pow(10, 10) //  true
Math.pow(10, 10) + Math.pow(10, -6) === Math.pow(10, 10) //  false

A further question, given a number, add an increment to this number, and then compare it with this number, to keep the result true, that is, equal, then about the maximum order of magnitude of this increment, can you estimate it?

The fourth question, friends, do you know the following code that has been quoted (this code is used to solve the addition of decimals in a common range to conform to common sense, such as calculating the result of 0.1+0.2 exactly as 0.3)? Do you understand the idea of ​​doing this? But do you know what's wrong with this code? 268.34+0.83For example, there will be problems with your calculations .

//注意函数接受两个string形式的数
function numAdd(num1/*:String*/, num2/*:String*/) { 
    var baseNum, baseNum1, baseNum2; 
    try { 
        baseNum1 = num1.split(".")[1].length; 
    } catch (e) { 
        baseNum1 = 0; 
    } 
    try { 
        baseNum2 = num2.split(".")[1].length; 
    } catch (e) { 
        baseNum2 = 0;
    } 
    baseNum = Math.pow(10, Math.max(baseNum1, baseNum2)); 
    return (num1 * baseNum + num2 * baseNum) / baseNum; 
};

//看上去好像解决了0.1+0.2
numAdd("0.1","0.2"); //返回精确的0.3

//但是你试试这个
numAdd("268.34","0.83");//返回 269.16999999999996

So many problems, it's really damn IEEE-754, and all of this stems from the format of IEEE-754 floating-point numbers itself, and the rule of "saying "about" and "about" (rounding), resulting in loss of precision, calculation Lost, as a front-end, we will take a look at it from the perspective of JS.

Second, take a closer look at the appearance of IEEE-754 double-precision floating-point

The so-called "know oneself and one enemy, one hundred battles will not be imperiled", to disintegrate the enemy from the inside, you must first understand the enemy, but why only choose double precision, because knowing double precision can understand single precision, and in JavaScript, all Numbers are 64 -bit double-precision floating-point numbers are stored, so let's review how they are stored and how such storage is mapped to specific values.

IEEE754 floating point form

When binary is stored, it is stored in binary "scientific notation". Let's review the decimal scientific notation, such as 54846.3. The number we use in standard scientific notation should be like this: 5.48463e4, there are three Part, the first is the sign, which is a positive number, but the positive sign is generally omitted and not written, the second is the significant number part, here is 5.48463, and the last is the exponent part, here is 4. The above is the scientific notation method in the decimal field, and the same is true for binary, except that the base 10 is used in decimal, and the base 2 is used in binary.

Double-precision floating-point numbers are divided into 3 segments on these 64 bits, and these 3 segments also determine the value of a floating-point number. The 64-bit division is in the "1-11-52" mode, specifically:

  • That is, the highest bit of 1 (the leftmost bit) represents the sign bit, 0 means positive, 1 means negative
  • The next 11 bits represent the exponent part
  • The last 52 bits represent the mantissa part, which is the effective field part

There are a lot of moths here. First of all, "every real number has an opposite number" is taught in middle school, so the sign bit is changed to be an opposite number, but for the number 0, the opposite number is itself, and the sign bit is used for each by the exponent field and The numbers determined by the mantissa field are all treated equally, either positive or negative, or none at all. So here is the concept of positive 0 and negative 0, but positive 0 and negative 0 are equal, but they can reflect the difference in sign bit, and the interesting things related to positive zero and negative zero are not repeated here.

Then, the exponent does not have to be a positive number, it can be a negative number. One way is to also set a sign bit in the exponent field part, and the second is the way adopted by IEEE754, which sets an offset so that the exponent part always behaves as a non- Negative numbers, and then subtracting a certain offset value is the real exponent. The advantage of this is that some extreme values ​​can be represented, as we will see later. The offset value set by the 64bit floating point number is 1023, because the exponent field is represented as a non-negative number, 11 bits, so 0 <= e <= 2^11 -1, the actual E=e-1023, so -1023 < = E <= 1024. The two extreme values ​​of these two ends, combined with different mantissa parts, represent different meanings .

Finally, the mantissa part, that is, the effective field part, why is it called the effective field part? For example, there are 52 pits here, but your number consists of 60 binary 1s. In any case, you can't completely put it down, only Can put down 52 1s, what about the remaining 8 1s? Either rounding or discarding, in short, is invalid. Therefore, the mantissa part determines the precision of the number.

For binary scientific notation, if there must be a non-zero digit before the decimal point, is the effective field necessarily in the form of 1.XXXX? And such a binary is called normalized . When such binary is stored, the 1 before the decimal point exists by default, but it does not occupy the pit by default, and the mantissa part stores the part after the decimal point .

The question is, what happens if the binary fraction is too small? For a binary decimal close to 0, blindly pursuing the form of 1.xxx will inevitably lead to the exponent part approaching negative infinity, and the real exponent part can also represent -1023 at the minimum. Once the exponent part is forced to -1023, it will also If it is not in the form of 1.xxx, then the effective part can only be represented in the form of 0.xxx, and such binary floating point numbers are denormalized .

Therefore, the value that our entire 64-bit floating-point number can represent is determined by the sign bit s, the exponent field e and the mantissa field f as follows, from which we can see positive and negative zero, normalized and denormalized binary floating point numbers, positive and negative infinity How does it mean:

Mapping of floating point form and numeric value

The (0.f)sum here (1.f)refers to the binary representation, which must be converted to decimal and then calculated, so that you can get the final value.

After reviewing the 64bit floating point numbers of IEEE754, there are 3 points to keep in mind:

  1. The exponent and mantissa fields are limited, one is 11 bits and the other is 52 bits
  2. The sign bit determines the positive or negative, the exponent field determines the magnitude, and the mantissa field determines the precision
  3. All numerical calculations and comparisons are carried out in the form of 64 bits , leaving aside the decimal system that is taken for granted.

3. Where is the loss of precision?

When you calculate directly 0.1+0.2, you have to know that "your aunt is no longer your aunt, and your uncle is no longer your uncle, so it is understandable that there are problems with their children (results)." The sums here are 0.1 0.1and 0.20.2 in decimal, and when they are converted to binary, they are the binary representation of an infinite loop.

This leads to the first place where precision may be lost , which is in the process of converting from decimal to binary. Because most decimal decimals cannot be represented by the binary decimals of the 52-bit mantissa, the simplest 0.1 and 0.2 in our eyes are infinite loops when converted into binary decimals, and some may not be infinite loops, but the conversion When it is a binary decimal, the fractional part exceeds 52 bits, which cannot be placed.

Then since there is only a 52-bit effective field, a supernatural event must occur in the part beyond 52-bit - castration, which is called "rounding". IEEE754 specifies several rounding rules, but the default is to round to the nearest value, and if "rounding" is as close as "rounding", then the choice of the result is an even number.

So in the above 0.1+0.2, when 0.1 and 0.2 are stored, it is not the exact 0.1 and 0.2 that is stored, but the value with a certain loss of precision. However, the loss of precision is not over yet. When these two values ​​are added, the precision may be further lost. Note that the superposition of several loss of precision does not necessarily make the result more and more biased.

The second place where the precision may be lost is when floating-point numbers are involved in the calculation. When floating-point numbers are involved in the calculation, there is a step called pair order . Taking addition as an example, it is necessary to convert a small exponent field into a large exponent field, that is, shift left. The decimal point of a floating point number with a small exponent, once the decimal point is shifted to the left, will inevitably squeeze out the rightmost bit of the 52-bit effective field, and the part squeezed out at this time will also be "rounded". This results in another loss of precision.

So in 0.1+0.2this example, the precision has been lost in the process of converting the two numbers to binary and in the process of addition, so there is a problem with the final result, and it is not surprising that it cannot be achieved. Yes, the link in the appendix at the end of the article can help you.

4. Doubt: 0.1 cannot be accurately represented, but printing 0.1 is 0.1

Yes, logically speaking, 0.1 cannot be represented exactly, and an approximation of 0.1 is stored, so when I print 0.1, for example console.log(0.1), it prints out exactly 0.1.

The fact is that when you print, what actually happens is binary to decimal, decimal to string, and finally the output. The conversion of decimal to binary will cause approximation, and then the conversion of binary to decimal will also be approximated. The printed value is actually an approximated value, not an accurate reflection of the storage content of floating-point numbers.

Regarding this question, there is an answer on StackOverflow that you can refer to. The answer points out a literature. If you are interested, you can go to:

How does javascript print 0.1 with such accuracy?

5. Equal or not, just look at these 64 bits

It is emphasized again that all numerical calculations and comparisons are performed in the form of 64 bits . When these 64 bits cannot be accommodated, an approximation will occur, and an accident will occur once the approximation occurs.

There are some online decimal to IEEE754 floating point applications that are helpful for verifying some results. You can use this IEEE-754 Floating-Point Conversion tool to help you verify how your decimals are converted to IEEE754 floating point numbers. .

Let's look at the two simple comparison questions raised in the first part:

//这相等和不等是怎么回事?
0.100000000000000002 ==
0.1  //true

0.100000000000000002 ==
0.100000000000000010 // true

0.100000000000000002 ==
0.100000000000000020 // false

When you convert 0.1, 0.100000000000000002, 0.10000000000000001and 0.10000000000000002to floating-point numbers with the above tools, you will find that their mantissa part (pay attention to the lowest 4 digits of the mantissa part, the rest of the digits are the same), the first three are the same, the lowest 4 The bits are 1010, but the last 4 bits of the last converted to floating point mantissa are 1011.

This is because when they are converted to binary, the difference in the rounding part may cause different rounding, which may cause inconsistencies in the mantissa. Comparing two numbers is essentially comparing the 64 bits of the two numbers . Different That is, unequal, with one exception, +0==-0 .

Let's look at the second equality problem mentioned:

Math.pow(10, 10) + Math.pow(10, -7) === Math.pow(10, 10) //  true
Math.pow(10, 10) + Math.pow(10, -6) === Math.pow(10, 10) //  false

Why the above one can be equal, and the following one can't, first let's transform:

Math.pow(10, 10) =>
指数域 e =1056 ,即 E = 33
尾数域 (1.)0010101000000101111100100000000000000000000000000000

Math.pow(10, -7) =>
指数域 e =999 ,即 E = -24

Math.pow(10, -6) =>
指数域 e =1003 ,即 E = -20
尾数域 (1.)0000110001101111011110100000101101011110110110001101

It can be seen that the exponent of 1e10 is 33 times, and the Math.pow(10, -7)exponent is -24 times, a difference of 57 times, which is much greater than 52. Therefore, when the addition occurs, the opposite order occurs, and Math.pow(10, -7) has long been approximated to 0. .

The Math.pow(10, -6)exponent is -20 times, the difference is 53 times, and it looks greater than 52 times, but there is a default leading 1. Don't forget, so when the opposite order occurs and the decimal point is shifted to the left by 53 places, this string of mantissas (don't forget the leading 1) The 52nd bit is just squeezed out. At this time, "rounding" occurs. The rounding result is the lowest bit, that is, the bit0 bit becomes 1. At this time, the Math.pow(10, 10)sum is added, and the lowest bit of the result becomes 1. Natural and Math.pow(10, 10)not equal.

You can use this IEEE754 calculator to verify the results.

6. Analysis of the order of magnitude correspondence between numerical values ​​and numerical precision

Following the above result, we found that when the value is 10 times of 10, adding a number of the order of -7 has no effect on the value, and adding a number of the order of -6 has an impact on the value. We also know the essence here. of:

This is because the order needs to be aligned during the calculation. If a small increment is shifted to the right (because the decimal point is shifted to the left) to 52 bits away during the alignment, then this increment is likely to be ignored, that is, the alignment is complete. The mantissa is approximated to 0.

In other words, we can say that for 10<sup>10</sup> order of magnitude, the accuracy is about 10<sup>-6</sup> order of magnitude, then for 10<sup>9</sup>, 10< What is the accuracy of values ​​of the order of magnitude of sup>8</sup>, 10<sup>0</sup>, etc.?

There is a graph that illustrates this correspondence well:

Correspondence between numerical order of magnitude and precision order of magnitude

In this figure, the abscissa represents the order of magnitude of floating-point values, and the ordinate represents the order of magnitude of the achievable precision. Of course, the order of magnitude of the value corresponding to the abscissa here refers to the order of magnitude in decimal notation.

For example, you test in the console (.toFixed() function accepts an integer n up to 20 to display n digits after the decimal point):

0.1.toFixed(20) ==> 0.10000000000000000555(It can also be seen here that 0.1 is stored accurately), according to the above figure, we know that 0.1 is of the order of 10<sup>-1</sup>, so the accuracy is about 10<sup>-17</sup>. , and let's verify:

//动10的-18数量级及之后的数字,并不会有什么,依旧判定相等
0.10000000000000000555 ==
0.10000000000000000999  //true
//动10的-17数量级上的数字,结果马上不一样了
0.10000000000000000555 ==
0.10000000000000001555  //false

You can also see the previous example from the figure, 10<sup>10</sup> orders of magnitude, and the accuracy is in the order of 10<sup>-6</sup> orders.

That is to say, under the 64-bit floating point representation of IEEE754, if the magnitude of a number is 10<sup>X</sup> and its accuracy is 10<sup>Y</sup>, then X and Y roughly satisfy :

X-16=Y

After knowing this, let's go back and look at the definition of ECMA Number.EPSILON. If you don't know that this exists, you can output it on the console. This number is about a number of the order of 10<sup>-16</sup>, this The number is defined as "the difference between the smallest number greater than 1 that can be represented as a numerical value by an IEEE754 floating-point number and 1", what is this number used for?

0.1+0.2-0.3<Number.EPSILONIt is returned true, which means that ECMA presets a precision for developers to use, but we can now know that this predefined value is actually the precision corresponding to the order of magnitude of 10<sup>0</sup>, if you want To compare two numbers of a smaller order of magnitude, the predefined Number.EPSILON is not enough (not precise enough), you can mathematically reduce the order of magnitude of this predefined value.

Seven, troublesome small integers provide a solution

So how can we implement decimal calculations that look more normal and natural in a computer? For example 0.1+0.2, output 0.3. One of the ideas, which is currently enough to deal with most scenarios, is to convert decimals to integers, calculate the result in the integer range, and then convert the result to decimals, because there is a range, the integers in this range can be floated by IEEE754 The point form is precisely represented , in other words, the results of integer operations within this range are accurate, and the range of this number is sufficient in most scenarios, so this idea is feasible.

1. The "range" and "precision" of numbers in JS

The reason why I say a range, not all integers, is because integers also have the problem of precision. It is necessary to deeply understand the difference between the two concepts of "representable range" and "precision", just like a ruler." Range" and "Accuracy" .

The range of numbers that JS can represent, and the range of safe integers that can be represented (safe means no loss of precision) is defined by the following values:

//自己可以控制台打印看看
Number.MAX_VALUE => 能表示的最大正数,数量级在10的308次
Number.MIN_VALUE => 能表示的最小正数,注意不是最小数,最小数是上面那个取反,10的-324数量级

Number.MAX_SAFE_INTEGER => 能表示的最大安全数,9开头的16位数
Number.MIN_SAFE_INTEGER => 能表示的最小安全数,上面那个的相反数

Why are integers above the maximum safe number inexact? Going back to the pits of IEEE754, there are only 52 pits in the mantissa. No matter how many significant digits there are, rounding will occur.

2. A flawed code that solves the floating-point calculation exception

Therefore, back to solving the precise calculation of JS floating point numbers, you can convert the decimal to be calculated into an integer, within the safe integer range, calculate the result, and then convert it back to a decimal.

So with the following code ( but this is problematic ):

//注意要传入两个小数的字符串表示,不然在小数转成二进制浮点数的过程中精度就已经损失了
function numAdd(num1/*:String*/, num2/*:String*/) { 
    var baseNum, baseNum1, baseNum2; 
    try { 
        //取得第一个操作数小数点后有几位数字,注意这里的num1是字符串形式的
        baseNum1 = num1.split(".")[1].length; 
    } catch (e) {
        //没有小数点就设为0 
        baseNum1 = 0; 
    } 
    try { 
        //取得第二个操作数小数点后有几位数字
        baseNum2 = num2.split(".")[1].length; 
    } catch (e) { 
        baseNum2 = 0;
    }
    //计算需要 乘上多少数量级 才能把小数转化为整数 
    baseNum = Math.pow(10, Math.max(baseNum1, baseNum2)); 
    //把两个操作数先乘上计算所得数量级转化为整数再计算,结果再除以这个数量级转回小数
    return (num1 * baseNum + num2 * baseNum) / baseNum; 
};

There is no problem with the idea, and it seems to have solved the 0.1+0.2problem. When calculating with the above function numAdd("0.1","0.2"), the output is indeed 0.3. But try a few more, for example numAdd("268.34","0.83"), the output is 269.16999999999996, explode in an instant, and I don't want to look at any line of these codes.

In fact, after careful analysis, this problem is still very easy to solve. The problem is that there is an implicit type conversion. The above num1 and num2 are both of string type, but in the last return expression, they directly participate in the calculation, so num1 and num2 are implicitly changed from String is converted to Number, and Number is stored in the form of IEEE754 floating-point numbers. In the process of converting decimal to binary, precision will be lost .

We can returnadd these two sentences on top of the statements in the code above to see what the output is:

console.log(num1 * baseNum);
console.log(num2 * baseNum);

You'll find numAdd("268.34","0.83")the above two lines of output for the example 26833.999999999996, 83. It can be seen that the dream of converting to integers is not well realized

It is also easy to solve this problem, that is, we explicitly convert the decimal to an integer, because we know that the order of magnitude obtained by multiplying the two operands must be an integer, but it is approximated due to the amplification of the loss of precision. A decimal, then can't we keep the result to the integer part?

That is, the last sentence above

return (num1 * baseNum + num2 * baseNum) / baseNum;change toreturn (num1 * baseNum + num2 * baseNum).toFixed(0) / baseNum;

The representation on the numerator is .toFixed(0)accurate to integer digits, based on our explicit knowledge that the numerator is an integer .

3. Limitations and other possible ideas

The limitation of this method is that I need to multiply by an order of magnitude to convert the decimal to an integer. If the fractional part is very long, then the integer converted by this method exceeds the range of safe integers, and the calculation is not safe.

However, it is still a word, depending on the usage scenario to choose, if the limitation does not appear or appears but it is harmless, then it can be applied.

Another idea is to convert decimals to strings and simulate them with strings. This is applicable to a wide range of applications, but the implementation process will be cumbersome.

If you need to face such calculations many times in your project and don't want to implement it yourself, there are ready-made libraries that you can use, for example math.js, thanks for this wonderful world.

8. Summary

As a JS programmer, IEEE754 floating point numbers probably won't bother you very often, but knowing this will keep you calm and normal when you encounter related surprises in the future. After reading the full text, we should be able to understand the 64-bit floating point number representation of IEEE754 and the corresponding value, understand the difference between precision and range, understand that precision loss and unexpected comparison results are all derived from the limited number of bits, and You don't need to send a Nikkei question every time you encounter a similar problem, you will know the fur of the word "IEEE754", but you can't say a complete expression. The most important thing is to be able to calmly scold "You damn IEEE754" Then continue coding...

If there are any mistakes, please leave a message to point out, thank you.

P.S. Thanks to the following for helping me

Accurate calculation of
js floating point number addition, subtraction, multiplication and division

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325482563&siteId=291194637