Why the precision of single-precision floating-point numbers is 7 digits

Cause

I encountered a problem today. The character is stuck on the side of a model, and it is normal to see the model in PVD. The final reason is that the two vertices of a triangle of the model are very advanced, and the result is that this very small difference is discarded during the floating-point operation, so in PhysX, it will be determined that the position moved by the distance is 0, so it has been Just stuck in place, can't jump. Look at the information of the two vertices:

	[0] = {x = -4.10000086, y = -0.200000167, z = -3.56512594}
	[1] = {x = -4.10000086, y = -0.200000077, z = -3.56512594}

There is only a slight difference in the y value. At this time, at least it is still within the valid range of floating point numbers. Bringing a few questions will help to study this problem:

  • 1. Why are the floating point numbers of -0.200000167 and -0.200000077 different, and the result of adding 4.76413503 (a coordinate used in the operation) at the end is the same? what is the reason?
  • 2. What is the logic of floating point arithmetic?
  • 3. Why is the precision of floating-point numbers 7 bits?

Floating point precision

IEEE754 representation

IEEE754 standard
IEEE754

test

Extract the problem data above to test the following

	float y = 4.76143503;
	float y1 = -0.200000167;
	float y2 = -0.200000077;
	
	float y1Add = y + y1;
	float y2Add = y + y2;
	
	std::cout << std::bitset<32>(*(_ULonglong*)&y) << std::endl;
	std::cout << std::bitset<32>(*(_ULonglong*)&y1) << std::endl;
	std::cout << std::bitset<32>(*(_ULonglong*)&y1Add) << std::endl;
	
	std::cout << std::bitset<32>(*(_ULonglong*)&y2) << std::endl;
	std::cout << std::bitset<32>(*(_ULonglong*)&y2Add) << std::endl;

	cout << "fTest = " << y1Add << endl;
	cout << "fTest1 = " << y2Add << endl;

The results can be seen:

	01000000100110000101110110101101
	10111110010011001100110011011000
	01000000100100011111011101000110
	10111110010011001100110011010010
	01000000100100011111011101000110
	fTest = 4.56143
	fTest1 = 4.56143

Summary :

  • The floating point representations of y1 and y2 are indeed different, that is, the 23 decimal places of IEEE754 are satisfactory
  • After doing the addition, the result becomes the same

Then let's analyze what the cause is based on the problem.

Floating point arithmetic

Refer to [2], [3] main steps:

  • 1. Normalized representation
  • 2. Pair order
  • 3. Mantissa scalar
  • 4. Normalization
  • 5. Rounding
	IEEE 754 standard floating point Addition Algorithm
	Floating-point addition is more complex than multiplication, brief overview of floating point addition algorithm have been explained below
	X3 = X1 + X2
	X3 = (M1 x 2E1) +/- (M2 x 2E2)
	1) X1 and X2 can only be added if the exponents are the same i.e E1=E2.
	2) We assume that X1 has the larger absolute value of the 2 numbers. Absolute value of of X1 should be greater than absolute value of X2, else swap the values such that Abs(X1) is greater than Abs(X2).
Abs(X1) > Abs(X2).
	3) Initial value of the exponent should be the larger of the 2 numbers, since we know exponent of X1 will be bigger , hence Initial exponent result E3 = E1.
	4) Calculate the exponent's difference i.e. Exp_diff = (E1-E2).
	5) Left shift the decimal point of mantissa (M2) by the exponent difference. Now the exponents of both X1 and X2 are same.
	6) Compute the sum/difference of the mantissas depending on the sign bit S1 and S2.
If signs of X1 and X2 are equal (S1 == S2) then add the mantissas
If signs of X1 and X2 are not equal (S1 != S2) then subtract the mantissas
	7) Normalize the resultant mantissa (M3) if needed. (1.m3 format) and the initial exponent result E3=E1 needs to be adjusted according to the normalization of mantissa.
	8) If any of the operands is infinity or if (E3>Emax) , overflow has occurred ,the output should be set to infinity. If(E3 < Emin) then it's a underflow and the output should be set to zero.
	9) Nan's are not supported.

Then, I manually calculated it myself:

	01000000100110000101110110101101 = 4.76143503
	阶码:10000001 = 129 - 127 = 2
	尾数:1.00110000101110110101101

	--------------------------------------------------
	10111110010011001100110011011000 = -0.200000167
	阶码:01111100 = 124 - 127 = -3
	尾数:1.10011001100110011011000

	01000000100100011111011101000110 = 4.56143475 = 4.76143503 + -0.200000167
	阶码:10000001 = 129 - 127 = 2
	尾数:1.00100011111011101000110

	加法运算:
	1.1001 1001 1001 1001 1011 000
	对齐:小数点左移2-(-3) = 5位
 	0.00001.1001 1001 1001 1001 1011 000
	相减:
	 	1.0011 0000 1011 1011 0101 101
	-	1.0000 1100 1100 1100 1100 110  11 000
	= 	1.0010 0011 1110 1110 1000 111
	比较	1.0010 0011 1110 1110 1000 110

	--------------------------------------------------
	10111110010011001100110011010010 = -0.200000077
	阶码:01111100 = 124 - 127 = -3
	尾数:1.10011001100110011010010

	01000000100100011111011101000110 = 4.56143475 = 4.76143503 + -0.200000077
	阶码:10000001 = 129 - 127 = 2
	尾数:1.00100011111011101000110
	
	加法运算:
	1.10011001100110011010010
	对齐:小数点左移2-(-3) = 5位
 	0.0000110011001100110011010010
	相减:计算器去掉小数点和前面的1来的比较快
	 	1.0011 0000 1011 1011 0101 101
	-	0.0000 1100 1100 1100 1100 110	10010
	=	1.0010 0011 1110 1110 1000 111
	比较	1.0010 0011 1110 1110 1000 110

summary:

  • The result of the calculation is indeed the same, but it is still a bit different from the final 4.76143503 (the last bit), which shows that my manual calculation is still a bit different from the machine calculation, including the subsequent normalization. Well, leave a question here
	// 4.76143503 + -0.200000167
	1.0010 0011 1110 1110 1000 111
	// 4.76143503 + -0.200000077
	1.0010 0011 1110 1110 1000 111
	// 最终结果的二进制
	1.0010 0011 1110 1110 1000 110
  • The first two problems are explained here: the most important thing is during the operation of floating-point numbers (here is addition), because the order codes are aligned, the next few digits of the mantissa of the smaller number (here 5 bits) are ignored , So the difference between -0.200000167 and -0.200000077 is gone

Let's take a look at IEEE754 has 23 decimal places, why the precision is 7 decimal places?

Floating point precision

First determine your own problem:

  • Precision refers to the number after the decimal point, and the 23 digits of the IEEE754 standard refer to the binary significant digits
  • What I want to understand is how 23 binary significant digits are converted to decimal 7 significant digits
  • After searching online, what can be finally determined is that the decimal decimal point can be accurate to 6-7 digits
  • How is my ultimate problem proved mathematically?

The most common explanation is as follows, refer to [5] [6]:

Because the value of the float type is determined by the last 23 bits in binary, and the decimal number represented by the last 23 bits is at most 2 ^ 23 = 8388608, which means that the accurate 23-bit number that can be expressed in binary is converted to The largest number in decimal is 7 digits, and the value is not important, because this number is 7 digits in decimal, so the precision of float in decimal is 7 digits. To put it more bluntly, the largest accurate number that can be expressed in binary is 7 digits.

To put it simply, 8388608, which can be represented in binary, is a 7-digit number, but it cannot include all 7 significant digits, so it is 6-7 significant digits. However, I still have doubts here, this algorithm should be calculated before the decimal point, how to calculate after the decimal point, if so, how to explain?

My understanding

After referring to [7] [8] [12], relatively speaking, I prefer the explanation of [8] [12]. But in the end I understood it was another one.

  • 1. The binary and decimal conversion after the decimal point should be such an expression, of course, it should be multiplied by a 0 or 1 to indicate that there is data on the significant digit

0. x x x x = 2 1 + 2 2 + . . . + 2 23 0.xxxx = 2^{-1} + 2^{-2} + ... + 2^{-23}
Then, the smallest unit after the decimal point is $ 2 ^ {-23} = $
here actually shows that some values ​​of floating point numbers can only be approximated

  • 2. [8] Mathematical derivation

l o g 10 2 23 = 6.924 -{log_{10}}{2^{-23}} = 6.924

Own understanding

  • (1) 23 decimal places, the smallest decimal place that can be expressed, other decimal places are formed by multiplying this minimum unit [12]

2 23 = 0.00000011920928955078125 2^{-23}= 0.00000011920928955078125

At first, I thought, there are obviously so many decimals behind, why is it 7 digits?

  • (2) This minimum unit represents the mantissa of precision, how to understand!
    If the minimum unit is 0.1, then the precision is 1 digit after the decimal point, because you can never combine decimals after 0.1 (integer sum of minimum units), such as 0.01, 0.02, etc.
    If the minimum unit is 0.000001, then the precision is 6 after the decimal point Digits, the same you can never combine decimals of 0.000001 smaller units, such as 0.0000001, 0.0000002, etc.
  • (3) Accurately speaking, the precision represented by 23 significant digits in binary is 6 ~ 7 digits after the decimal point.
    From the above, we can know that the minimum unit of 0.00000011920928955078125 is any combination of 23-bit binary (a multiple of the minimum unit), and only a larger number than 0.00000011920928955078125 can be obtained. So you can never get a decimal like 0.00000001 (the precision is 8 digits after the decimal point), but you can definitely get a decimal number of 0.000001 (the precision is 6 digits after the decimal point). However, it is not possible to fully represent a decimal with 7 digits after the decimal point of precision. The following ignores the rounding algorithm and only takes the calculation result of 7 significant digits after the decimal point:
0.00000011920928955078125 * 1 = 0.0000001
0.00000011920928955078125 * 2 = 0.0000002
0.00000011920928955078125 * 3 = 0.0000003
0.00000011920928955078125 * 4 = 0.0000004
0.00000011920928955078125 * 5 = 0.0000005
0.00000011920928955078125 * 6 = 0.0000007
0.00000011920928955078125 * 7 = 0.0000008
0.00000011920928955078125 * 8 = 0.0000009
0.00000011920928955078125 * 9 = 0.000001

It can be seen that 0.0000006 cannot be expressed. If you consider rounding, it may be that other expressions cannot be expressed, so it is impossible to completely represent all 7 decimal places after the decimal point.

to sum up

  • Floating point love and hate

reference

[1] In- depth understanding of the significant digits of floating-point numbers

[2] Addition and subtraction of binary floating point numbers

[3]Floating Point Tutorial

[4] Detailed explanation of the principle of floating-point arithmetic that programmers must know

[5] Why the precision of float is 7 digits

[6] The main difference between java floating point type float and double, what is the size of their decimal precision range?

[7] Reasons for loss of precision in floating point calculation

[8] Why are the significant digits of single-precision floating-point numbers 7 digits, and my count is obviously 6 digits, do you think I counted it right?

[9]Why IEEE754 single-precision float has only 7 digit precision?

[10]In-depth: IEEE 754 Multiplication And Addition

[11] Dialysis of floating-point precision: Inaccurate decimal calculation + floating-point precision lost

[12] Regarding the float type as single-precision, the effective number of digits is 7 digits. Why are these 8 digits accurate in the following example?

[13] Binary calculator web version

Published 41 original articles · praised 7 · 20,000+ views

Guess you like

Origin blog.csdn.net/pkxpp/article/details/103059502