How floating point data is stored in memory

Article directory

Storage rules for floating point variables:

According to the international standard IEEE (Institute of Electrical and Electronics Engineering) 754, any binary floating point number V can be expressed in the following form: (-1)
^S * M * 2^E
(-1)^S represents the sign bit, when S =0, V is a positive number; when S=1, V is a negative number. M represents significant figures,Greater than or equal to 1, less than 2.2^E represents the exponent bit.

For a 32-bit floating point number, the highest bit is the sign bit S, the next 8 bits are the exponent E, and the remaining 23 bits are the significant digit M. For a 64-bit floating
point number, the highest bit is the sign bit S, and the next 11 bits are is the exponent E, and the remaining 52 bits are the significant digit M
as shown in the figure below:
== (Always remember that the significant digit is greater than 1 and less than 2) ==
Insert image description here
IEEE 754 has some special provisions for the significant digit M and the exponent E. .

As I said before,1≤M<2, that is to say, M can be written in the form of 1.xxxxxx , where xxxxxx represents the decimal part.
IEEE 754 stipulates that when M is stored inside the computer, the first digit of this number is always 1 by default , so it can be discarded and only the following xxxxxx parts are saved. For example, when saving 1.01, only 01 is saved, and when reading, the first 1 is added. The purpose of this is to save 1 significant figure. Taking a 32-bit floating point number as an example, there are only 23 bits left for M. After the first 1 is rounded off, 24 significant digits can be saved.

As for the index E , the situation is more complicated.
First of all, E is an unsigned integer (unsigned int). This means that if E is 8 bits , its value range is 0-255 ; if E is 11 bits, its value range is 0~2047 . However, we know that E in scientific notation can be negative, but E is an unsigned integer without a sign bit, so IEEE 754 stipulates that an intermediate number must be added to the real value of E when stored in memory, so that It becomes a positive integer. For an 8-bit E, this intermediate number is 127 ; for an 11-bit E, this intermediate number is 1023 . For example, E of 2^-1 is -1, so when it is saved as a 32-bit floating point number, it must be saved as -1+127=126, that is, 01111110. Note that this place is a stored value rather than a real value. Okay, that’s all
. What's the deal? Next, let's take a look at how floating point numbers are converted into binary.

How to convert floating point numbers to binary

First let's take a simple example:

How should we convert the decimal decimal 5.25 into a binary decimal?
We divide it into the following steps:
1. Split with the decimal point as the boundary;
2. Convert the integer part to binary, I believe everyone will have no problem
3. The decimal part uses the "multiplying by 2 method". When multiplied by 2, the decimal part Stop calculating when you get 0.
Insert image description here
4. Combine the results: integer part + decimal part, and finally get the binary result as 101.01.

Let's check it and find that it is indeed as we calculated:

Insert image description here
The above are the steps to convert floating point numbers into binary. Let's take a look at a more complicated example:
convert decimal 3.14 into binary:

17.625 Storage in memory

First, convert 17.625 into binary: 10001.101

After moving 10001.101 to the right, there is only 1 digit left before the decimal point:
1.0001101 * 2^4 because it has been moved to the right by four digits.

Base: Because the number before the decimal point must be 1, IEEE stipulates that only the number after the decimal point should be recorded. So, the base here is: 0001101
Exponent: It's actually 4, you have to add 127, so it's 131. That is 10000011 symbol: integer, so it is 0

To sum up, the storage format of 17.625 in memory is: 01000001 10001101 00000000 00000000

Storage of floating point data in memory

Let’s look at this example first. Think about how floating-point data is stored? What is the difference with integer data?

#include<stdio.h>
int main()
{
    
    
    float f = 5.5;
    
    return 0;
}

Let’s analyze it:

The 0.5 here converted to binary is 1 * 2^ -1. First converted to binary -> 101.1 and then converted to the standard form V = (-1)^0 * 1.011 * 2^2; s = 0, M = 1.011, E
= 2; E + 127 = 129—>10000001, M = 011 0000000000
0000000000 Finally, 0100 0000 1011 0000 0000 0000 0000 0000 is stored
and converted into hexadecimal as 0x40b00000

Let's check whether the answer is calculated as above:
Insert image description here
We found that it is indeed the answer calculated as shown in the figure above. The little-endian mode is used under VS, so it is stored backwards.
Why is the situation of E said to be more complicated? When it exists, E is divided into three situations:

E is not all 0 or not all 1

At this time, the floating point number is represented by the following rules: subtract 127 (or 1023) from the calculated value of the exponent E to obtain the real value, and then add the first 1 before the significant digit M. For example: the binary form of 0.5 (1/2) is 0.1. Since the positive part must be 1, that is, the decimal point is moved to the right by 1 place, it is 1.0*2^(-1), and its exponent code is -1+127= 126, expressed as 01111110, and the mantissa 1.0 removes the integer part to 0, and fills in 0 to 23 digits 000000000000000000000000, then its binary representation is 0 01111110 00000000000000000000000000000. This is normal.

E is all 0

At this time, the exponent E of the floating point number is equal to 1~127 (or 1~1023), which is the real value. The effective number M no longer adds the first 1, but is reduced to a decimal of 0.xxxxxx. This is done to represent ±0, and very small numbers close to 0.

If E is all 0, and E is the stored value at this time, then think about whether the real value of E is -127? If we restore V = (-1)^ s * 1.xxxxx * 2^ -127, then is this a very small number? It tends to ±0. At this time, the above regulations are in place.

E is all 1's

At this time, if the significant digits M are all 1, it means ±infinity (the sign bit depends on the sign bit s).

If E is all 1, the stored value is 255. Subtracting 127 is the real value of E. E = 128. If we restore it back to V = (-1)^s * 1.xxxxxx * 2^ 128/1024; then it will be a number of plus or minus infinity.
————————————————
Reference article: https://blog.csdn.net/stephen_999/article/details/127475793