In-depth profiling of floating-point storage in memory

Preface: When we code code, we often encounter storing in integer form and outputting in floating-point form; or storing in floating-point form and outputting in integer form. The output results are often unexpected. Then, why does such a change occur, and what causes the change? Next, let us start from the internal structure of the storage and take you to an in-depth analysis!

content

representation of floating point numbers

floating point storage model

significant digit M

Index E

Example explanation

Let's illustrate everything with an example
#include<stdio.h>
int main()
{
    int n = 9;
    float *pFloat = (float *)&n;
    printf("n的值为：%d\n",n);
    printf("*pFloat的值为：%f\n",*pFloat);
    *pFloat = 9.0;
    printf("n的值为：%d\n",n);
    printf("*pFloat的值为：%f\n",*pFloat);
    return 0;
}
In this example, we input n as an integer, define a pointer variable to store the address of n, and output n as an integer and a floating-point number respectively.

The output is as follows

n and *pFloat are obviously the same number in memory, why are the interpretation results of floating-point numbers and integers so different?

To understand this result, you must understand how floating-point numbers are represented internally in the computer.

representation of floating point numbers

According to the international standard IEEE (Institute of Electrical and Electronics Engineering) 754, any binary floating-point number V can be represented in the following form

(-1)^S*M*2^E

(-1)^S represents the sign bit. When S=0, V is a positive number; when S=1, V is a negative number.

M represents a significant number, greater than or equal to 1, less than 2.

2^E means the exponent bit.

for example:

Decimal 5.0, written in binary is 101.0, in scientific notation: 1.01*2^2 (analogous to decimal scientific notation: 20000=2*10^4);

Since 5.0 is a positive number, S=0, according to the above, M=1.01, E=2.

So, written as a floating point number is: (-1)^0*1.01*2^2.

floating point storage model

IEEE754 stipulates:

For a 32-bit floating-point number, the highest 1 bit is the sign bit s, the next 8 bits are the exponent E, and the remaining 23 bits are the significand M.

For 64-bit floating-point numbers, the highest 1 bit is the sign bit s, the next 11 bits are the exponent E, and the remaining 52 bits are the significand M.

significant digit M

Earlier we gave the range of M as 1≤M<2, so why are there so many ranges? We know that the highest bit of binary is 1, and after it is expressed in scientific notation, the form of M is the form of 1.xxxxx, where xxxxx represents the fractional part.

IEEE754 stipulates that when M is stored inside the computer, the first digit of this number is always 1 by default, so it can be discarded and only the following xxxxx part is saved. For example, when saving 1.01, only 01 is saved, and when it is read, the first 1 is added. The purpose of this is to save 1 significant digit. Taking a 32-bit floating point number as an example, there are only 23 bits left for M. After rounding off the 1 in the first digit, it is equivalent to saving 24 significant digits.

Index E

As for index E, the situation is more complicated.

First, E is an unsigned integer (unsingde int)

This means that if E is 8 bits, its value range is 0~255; if E is 11 bits, its value range is 0~204. But we know that E in scientific notation can have negative numbers, so IEEE754 stipulates that the real value of E must be added with an intermediate number when stored in memory. For 8-bit E, this intermediate number is 127; for 1 Bit E, the middle number is 1023.

For example: E of 2^10 is 10, so when it is stored as a 32-bit floating point number, it must be stored as 10+127=137, that is, it is converted to binary as 10001001.

For the index E to be fetched from memory, it can be further divided into three cases:

(1) E is not all 0 or not all 1

At this time, the floating-point number is represented by the following rules, that is, the calculated value of the exponent E is subtracted from 127 (or 1023) to obtain the real value, and then the significant number M is added with the first 1.

for example:

The binary form of 0.5 (1/2) is 0.1. Since it is stipulated that the positive part must be 1, that is, the decimal point is shifted 1 bit to the right, it is 1.0*2^(-1), and its order code is -1+127=126, Represented as 01111110 and the mantissa 1.0 minus the integer part is 0, and padded 0 to 23 bits 00000000000000000000000, then its binary representation is:

0 01111110 00000000000000000000000

(2) E is all 0

At this time, the exponent E of the floating-point number is equal to 1-127 (or 1023), which is the real value, and the significant number M is no longer added with the first 1, but is reduced to a decimal of 0.xxxxx. This is done to represent ±0, and very small numbers close to 0.

(3) E is all 1

At this time, if the significant digits M are all 0, it means ± infinity (positive or negative depends on the sign bit S);

Example explanation

(1) printf("The value of n is: %d\n",n);

n is stored as an integer, and stored in memory as:

The output is an integer, so the output is still 9.

(2) printf("The value of *pFloat is: %f\n",*pFloat);

*pFloat is equivalent to the value of n, but the output is output in the form of floating-point numbers, but the storage is still in the form of integers. This is, the memory will process the stored 0 00000000 000000000000000000001001 in the form of floating-point numbers, which is quite Therefore, the first bit is S, that is, S=0, E is all 0, and the rest is M, that is, M=00000000000000000001001.

(3) printf("The value of n is: %d\n",n);

At this point, *pFloat=9.0, that is, 9.0 is stored in the form of a floating point number and taken out in the form of an integer.

According to the previous, we write the storage of 9.0:

Since it is a positive number, S=0;

9 is converted to binary as 1001, scientific notation: 1.001*2^3;

∴E=3 (belonging to the first type, neither all 0s nor all 1s), add 127 when storing, then the real value of E is: E=3+127=130, converted to binary 1000 0010

The M is followed by 0 to complete 23 bits, namely: 001 0000 0000 0000 0000

So floating point numbers are stored in memory as:

The final output is output in the form of an integer, that is, 0 100000010 00100000000000000000000 is regarded as the complement of the number to be output, but the output is the original code, because the highest bit is 0, so this number is a positive number, the original inverse complement is the same, then output this number , converted to decimal is:

(4) printf("The value of *pFloat is: %f\n",*pFloat);

Stored in the form of floating-point numbers, output in the form of floating-point numbers, the final output is still 9.000000, with 6 decimal places reserved.