The principle of floating point numbers, and analysis of the causes of errors in cross-platform calculations

First introduce the composition and precision of floating-point numbers, and then finally introduce the reasons for errors

float composition

Floating-point number composition: sign bit, exponent part, mantissa part, sign bit S occupies 1 bit, exponent E occupies 8 bits, mantissa M occupies 23 bits

float storage method

First of all, we know that the commonly used scientific notation is to convert all numbers into ( ± ) a . b × 1 0 c (±)ab\times10^c(±)a.b×10The form of c , where a ranges from 1 to 9, a total of 9 integers, b is all digits after the decimal point, and c is the exponent of 10. The computer stores all binary data, so the numbers stored in float must first be converted into( ± ) a . b × 2 c (±)ab \times 2^c(±)a.b×2c , since the largest number in binary is 1, the notation can be written as( ± ) 1. b × 2 c (±)1.b \times2^c(±)1.b×2In the form of c , if float wants to store decimals, it only needs to store (±), b and c are fine. The storage of float is to divide 4 bytes and 32 bits into 3 parts to store the sign, decimal part and exponent part respectively:

  1. Sign (1 bit): It is used to indicate whether the floating point number is positive or negative, 0 indicates a positive number, and 1 indicates a negative number.
  2. Exponent (8 bits): Exponent part. That is to say, the number c is mentioned above, but c is not directly stored here. In order to simultaneously represent the positive and negative indices and their order of magnitude, what is actually stored here is c+127.
  3. Mantissa (23 bits): mantissa part. That is the number b mentioned above.

float storage example

Taking the number 6.5 as an example, let's see how this number is stored in the float variable:

  1. Let’s look at the integer part first. The modulo 2 remainder can be expressed as 110 in binary.
  2. Let’s look at the decimal part again. Multiply by 2 and round to an integer to get the binary representation of .1 (if you don’t know how to find the binary of a decimal, please actively search for it).
  3. Stitching together to get 110.1 and then writing it like scientific notation, get 1.101 × 2 2 1.101 \times 2^21.101×22
  4. From the above formula, we can know that the sign is positive, the mantissa is 101, and the exponent is 2.
  5. If the sign is positive, then fill in 0 in the first digit, the exponent is 2, plus the offset of 127 equals 129, the binary representation is 10000001, fill in the 2-9 digits, and fill in the remaining mantissa 101 in the mantissa.

float range

After understanding the above principles, you can find the range of the float type, find the maximum value that can be represented, and then set the symbol to 1 and change it to a negative number, which is the minimum value. If you want to represent the largest value, you must have the largest mantissa and the largest exponent.
Then you can get the mantissa as 0.1111111 11111111 11111111 and the exponent as 11111111, but when the exponent is all 1, it has its special purpose, so the maximum exponent is 11111110, and the exponent is subtracted from 127 to get 127, so the largest number is 1.11111111111111111111 × 2 11 1 1 1 1 1 1 1 1 1 1 times 2^{127}1.1111111111111111111111×2127
, this value is 340282346638528859811704183484516925440, usually expressed as 3.4028235E38, then the range of float comes out: [-3.4028235E38, 3.4028235E38]

float precision

The data precision of the float type depends on the mantissa. I believe everyone knows this, but I have been confused for a long time about how to calculate the precision. Recently, I gradually understood in the process of continuous experimentation. First of all, the 23-digit mantissa is not considered in the case of the exponent. The range that can be represented is [ 0 , 2 23 − 1 ] [0, 2^{23} −1][0,2231 ] , in fact, a "1" is implied in front of the mantissa, so it should be a total of 24 digits, and the range that can be represented is[ 0 , 2 24 − 1 ] [0, 2^{24} − 1][0,2241 ] (Because the implicit bit is "1" by default, the minimum number represented is 1 instead of 0, but 0 is not considered first, and will be specially introduced later, here is only calculated according to the general value), see here we know these 24 bits The largest number that can be represented is2 24 − 1 2^{24}-12241 , converted to decimal is 16777215, then [0, 16777215] can be expressed accurately, because they can all be written as1. b × 2 c 1.b\times2^c1.b×2In the form of c , just adjust the exponent c accordingly.

The number 16777215 can be written as 1.11111111111111111111 × 2 23 1.1111111 11111111 1111111 \times 2^{23}1.1111111111111111111111×223 , so this number can be represented accurately, and then consider a larger number 16777216, because it is exactly an integer power of 2, which can represent1.00000000000000000000000 × 2 24 1.0000000 00000000 00000000 \times 2^{24}1.00000000000000000000000×224 , so this number can also be expressed accurately. Considering the larger number 16777217, if this number is written in the above representation method, it should be1.000000000000000000000001 ∗ 2 24 1.0000000 00000000 00000000 1 * 2^{24}1.000000000000000000000001224 , but at this time you will find that the number of digits after the decimal point is already 24 digits, and the storage space of 23 digits cannot be stored accurately.

Seeing this, I found that 16777216 seems to be a boundary. Numbers exceeding this number cannot be accurately represented. Does that mean that all numbers greater than 16777216 cannot be accurately represented? Actually not, for example, the number 33554432 can be accurately expressed as 1.00000000000000000000000 ∗ 2 25 1.0000000 00000000 00000000 * 2^{25}1.00000000000000000000000225 , speaking here combined with the memory representation of float mentioned above, we can get a number greater than 16777216 (not exceeding the upper limit), as long as it can be expressed as the addition of less than 24 powers of n of 2, and between each n A difference of less than 24 can be accurately represented. In other words, all reasonable numbers greater than 16777216 are exact numbers in the range [0, 16777215] by multiplying by2 n 2^n2n obtained, similarly all positive numbers less than 1 are also accurate numbers in the range [0, 16777215] by multiplying by2 n 2^n2n is obtained, but n takes a negative number.

16777216 has been proven to be a boundary, and integers smaller than this number can be accurately expressed, expressed in scientific and technological law is 1.6777216 × 1 0 7 1.6777216 \times 10^{7}1.6777216×107. It can be seen from here that there are a total of 8 significant digits. Since the highest digit is at most 1, it cannot guarantee all situations, so at least 7 significant digits can be guaranteed to be accurate (there are originally 8 digits, but the first digit can only be 1 , For example, when 26777216 appears, it is inaccurate, so only 7 digits can be guaranteed), which is often referred to as the precision of float type data.

float decimal

From the above analysis, we already know that float can represent numbers beyond the range of 16777216 are jumping, and at the same time, the decimals that float can represent are also jumping. These decimals must also be able to be written as the addition of the nth power of 2, such as 0.5, 0.25, 0.125... and the sum of these numbers, numbers like 5.2 cannot be accurately stored using the float type, and the binary representation of 5.2 is 101.0011001100110011001100110011... The last 0011 loops infinitely, but float can store up to 23 digits of the mantissa , then the 5.2 stored by the computer should be 101.001100110011001100110, which is the number 5.19999980926513671875, and the computer uses this number closest to 5.2 to represent 5.2. The precision of the decimal is consistent with the analysis just now. When the 8th significant digit changes, float may not be able to detect this change.

Causes of floating-point calculation errors on different platforms

The precision of floating-point numbers is limited. Take 32-bit as an example:
when performing floating-point operations, such as multiplication, some platforms will have FPU (floating-point processing unit) for processing, which may make the calculation results
more accurate. For example: in 32-bit Under the system, the FPU uses 80-bit registers for calculations, while the non-FPU uses SSE. Although there are 128-bit registers, only 32 bits of them are used (32 bits are used for calculations during the calculation process), so in the calculation In the process, 32 bits will be intercepted for calculation, resulting in errors.

Therefore, different platforms use different floating-point optimization logic, which will lead to the same input and different output. For example, the calculation result of 80838.0f * -2499.0f in C#, the calculation result of linux32 bit is -202014162, and in The calculation result of windows32/64 is -202014160, which is said to be under Linux 32-bit (Ubuntu 12.04+ gcc 4.6.3), pending verification)

Summarize:

Floating-point computing standard IEEE-754 recommends standard implementers to provide floating-point scalable precision formats. Intel x86 processors have FPU (float point unit) floating-point computing processors that support such extensions. Other processors do not necessarily support such extensions. ; Under the 32-bit system, it cannot be represented by float!

Reference article:

The precision and value range of float
is a floating-point number bug generated by cross-platform | there are unexpected results

Guess you like

Origin blog.csdn.net/qq_41841073/article/details/127057494