Float type memory storage and loss of precision

Two-level representation of decimals

Convert decimal integer to binary

For example, 11 is expressed as a binary number:
11/2=5 remainder 1
5/2=2 remainder 1
2/2=1 remainder 0
1/2=0 remainder 1
0 ends 11 in binary form ( from bottom to top ) :1011

As long as the result of the division is 0, it is over. Dividing all integers by 2 must eventually get 0. In other words, all the algorithms that convert integers into binary numbers will not loop indefinitely. Integers can always be accurately represented in binary, but decimals are not necessarily.

Convert decimal fraction to binary

0.9*2=1.8 take the integer part 1
0.8 (decimal part of 1.8)*2=1.6 take the integer part 1
0.6*2=1.2 take the integer part 1
0.2*2=0.4 take the integer part 0
0.4*2=0.8 take the integer part 0
0.8*2=1.6 Take the integer part 1
0.6*2=1.2 Take the integer part 0
......... 0.9 is expressed in binary (from top to bottom): 1100 100 100 100 ...

The above calculation process loops, that is to say *2 can never eliminate the decimal part, so the algorithm will be infinite. Obviously, the binary representation of decimals is sometimes impossible to be precise. In fact, the reason is very simple. Can 1/3 be accurately represented in the decimal system? Similarly, the binary system cannot accurately represent 1/10. This also explains why the floating-point subtraction has the problem of "unreduced" precision loss.

Storage of float type in memory

Java's float type occupies 4 bytes in memory. The 32 binary bits structure of float is as follows:

Float memory storage structure:
4bytes 31 30 29----23 22----0

Representation Real number sign bit Exponent sign bit Exponent bit Significant digits

The sign bit 1 means positive and 0 means negative. There are 24 effective digits, one of which is the real sign bit.

The steps to convert a float type into a memory storage format are:
(1) First convert the absolute value of this real number into a binary format. Note that the binary method of the integer part and the decimal part of the real number has been discussed above.
(2) Move the decimal point of this binary format real number to the left or right by n places, until the decimal point is moved to the right of the first significant digit.
(3) Count out twenty-three digits from the first digit to the right of the decimal point and put them into the 22nd to 0th digits.
(4) If the real number is positive, put "0" in the 31st bit, otherwise put "1".
(5) If n is obtained by shifting to the left, it means that the exponent is positive and put "1" in the 30th bit. If n is obtained by shifting right or n=0, put "0" in the 30th bit.
(6) If n is obtained by shifting to the left, subtract 1 from n and convert it to binary, and add "0" to the left to make up seven bits, and put them in the 29th to 23rd bits. If n is obtained by shifting to the right or n=0, convert n to binary and add "0" to the left to make up the seven bits, then invert each bit, and put the 29th to 23rd bits.

The final representation of 0.2356 is: 0 0 1111100 11100010100000100100000

Steps to convert a floating binary format stored in memory to decimal:
(1) Write the binary numbers from the 22nd to the 0th digits, and add a "1" to the left to get twenty-four significant digits. Place the decimal point to the right of the leftmost "1".
(2) Take the value n represented by the 29th to 23rd bits. When 30 bits are "0", the n bits are negated. When 30 bits are "1", increase n by 1.
(3) Shift the decimal point to the left by n places (when the 30-digit is "0") or right (when the 30-digit is "1") to obtain a binary representation of the real number.
(4) Realize this binary number into decimal, and add a positive or negative sign according to whether the 31st bit is "0" or "1".

Floating point subtraction

The floating-point addition and subtraction process is more complicated than the fixed-point arithmetic process. The process of completing floating-point addition and subtraction operations is roughly divided into four steps:
(1) Check of the 0 operand;
if it is determined that one of the two floating-point numbers that need to be added and subtracted is 0, the result of the operation can be known and there is no need to do it again An orderly sequence of operations.

(2) Compare the size of the order code (exponent bit) and complete the order;
add and subtract two floating-point numbers, first of all to see whether the exponent bits of the two numbers are the same, that is, whether the position of the decimal point is aligned.
If the two numbers have the same exponent, it means that the decimal points are aligned, and the mantissa can be added and subtracted.
Conversely, if the order codes of the two numbers are different, it means that the position of the decimal point is not aligned. At this time, the order codes of the two numbers must be the same. This process is called ordering.

How to adjust the order (assuming that the exponent bits of the two floating-point numbers are Ex and Ey):
Change Ex or Ey by shifting the mantissa to make them equal. Since most numbers represented by floating-point numbers are normalized,
shifting the mantissa to the left will cause the loss of the most significant digit, resulting in a large error; while shifting the mantissa to the right will cause the loss of the least significant digit, but the error caused is relatively small. The order operation stipulates that the mantissa is moved to the right, and the order code is increased correspondingly after the mantissa is moved to the right, and its value remains unchanged.
Obviously, an increased order code is equal to another, and the increased order code must be a small order. Therefore, when aligning the order, the small order is always aligned with the large order, that is, the mantissa of the small order is shifted to the right (equivalent to the decimal point shifted to the left),
and the order code is increased by 1 for every right shift, until the order of two numbers Until the codes are equal, the number of bits shifted to the right is equal to the level difference △ E.

(3) The mantissa (significant digits) is added or subtracted;
after the order is completed, the valid digits can be summed. Regardless of whether it is addition or subtraction, all operations are performed according to addition, and the method is exactly the same as fixed-point addition and subtraction.

(4) The result is normalized and rounded.

The memory storage format of 12.0f is: 0 1 0000010 10000000000000000000000
The memory storage format of 11.9f is: 0 1 0000010 011 11100110011001100110 It
can be seen that the exponent bits of the two numbers are exactly the same, as long as the effective digits are subtracted.

12.0f-11.9f Result: 0 1 0000010 00000011001100110011010
The result is reduced to decimal: 0.000 11001100110011010 = 0.10000038