Preface: When we code code, we often encounter storing in integer form and outputting in floating-point form; or storing in floating-point form and outputting in integer form. The output results are often unexpected. Then, why does such a change occur, and what causes the change? Next, let us start from the internal structure of the storage and take you to an in-depth analysis!
content
representation of floating point numbers
Let's illustrate everything with an example
#include<stdio.h> int main() { int n = 9; float *pFloat = (float *)&n; printf("n的值为:%d\n",n); printf("*pFloat的值为:%f\n",*pFloat); *pFloat = 9.0; printf("n的值为:%d\n",n); printf("*pFloat的值为:%f\n",*pFloat); return 0; }
In this example, we input n as an integer, define a pointer variable to store the address of n, and output n as an integer and a floating-point number respectively.
The output is as follows
n and *pFloat are obviously the same number in memory, why are the interpretation results of floating-point numbers and integers so different?
To understand this result, you must understand how floating-point numbers are represented internally in the computer.
representation of floating point numbers
According to the international standard IEEE (Institute of Electrical and Electronics Engineering) 754, any binary floating-point number V can be represented in the following form
- (-1)^S*M*2^E
- (-1)^S represents the sign bit. When S=0, V is a positive number; when S=1, V is a negative number.
- M represents a significant number, greater than or equal to 1, less than 2.
- 2^E means the exponent bit.
for example:
Decimal 5.0, written in binary is 101.0, in scientific notation: 1.01*2^2 (analogous to decimal scientific notation: 20000=2*10^4);
Since 5.0 is a positive number, S=0, according to the above, M=1.01, E=2.
So, written as a floating point number is: (-1)^0*1.01*2^2.
floating point storage model
IEEE754 stipulates:
For a 32-bit floating-point number, the highest 1 bit is the sign bit s, the next 8 bits are the exponent E, and the remaining 23 bits are the significand M.
For 64-bit floating-point numbers, the highest 1 bit is the sign bit s, the next 11 bits are the exponent E, and the remaining 52 bits are the significand M.
significant digit M
Earlier we gave the range of M as 1≤M<2, so why are there so many ranges? We know that the highest bit of binary is 1, and after it is expressed in scientific notation, the form of M is the form of 1.xxxxx, where xxxxx represents the fractional part.
IEEE754 stipulates that when M is stored inside the computer, the first digit of this number is always 1 by default, so it can be discarded and only the following xxxxx part is saved. For example, when saving 1.01, only 01 is saved, and when it is read, the first 1 is added. The purpose of this is to save 1 significant digit. Taking a 32-bit floating point number as an example, there are only 23 bits left for M. After rounding off the 1 in the first digit, it is equivalent to saving 24 significant digits.
Index E
As for index E, the situation is more complicated.
First, E is an unsigned integer (unsingde int)
This means that if E is 8 bits, its value range is 0~255; if E is 11 bits, its value range is 0~204. But we know that E in scientific notation can have negative numbers, so IEEE754 stipulates that the real value of E must be added with an intermediate number when stored in memory. For 8-bit E, this intermediate number is 127; for 1 Bit E, the middle number is 1023.
For example: E of 2^10 is 10, so when it is stored as a 32-bit floating point number, it must be stored as 10+127=137, that is, it is converted to binary as 10001001.
For the index E to be fetched from memory, it can be further divided into three cases:
(1) E is not all 0 or not all 1
At this time, the floating-point number is represented by the following rules, that is, the calculated value of the exponent E is subtracted from 127 (or 1023) to obtain the real value, and then the significant number M is added with the first 1.
for example:
The binary form of 0.5 (1/2) is 0.1. Since it is stipulated that the positive part must be 1, that is, the decimal point is shifted 1 bit to the right, it is 1.0*2^(-1), and its order code is -1+127=126, Represented as 01111110 and the mantissa 1.0 minus the integer part is 0, and padded 0 to 23 bits 00000000000000000000000, then its binary representation is:
0 01111110 00000000000000000000000
(2) E is all 0
At this time, the exponent E of the floating-point number is equal to 1-127 (or 1023), which is the real value, and the significant number M is no longer added with the first 1, but is reduced to a decimal of 0.xxxxx. This is done to represent ±0, and very small numbers close to 0.
(3) E is all 1
At this time, if the significant digits M are all 0, it means ± infinity (positive or negative depends on the sign bit S);
Example explanation
(1) printf("The value of n is: %d\n",n);
n is stored as an integer, and stored in memory as:
The output is an integer, so the output is still 9.
(2) printf("The value of *pFloat is: %f\n",*pFloat);
*pFloat is equivalent to the value of n, but the output is output in the form of floating-point numbers, but the storage is still in the form of integers. This is, the memory will process the stored 0 00000000 000000000000000000001001 in the form of floating-point numbers, which is quite Therefore, the first bit is S, that is, S=0, E is all 0, and the rest is M, that is, M=00000000000000000001001.
(3) printf("The value of n is: %d\n",n);
At this point, *pFloat=9.0, that is, 9.0 is stored in the form of a floating point number and taken out in the form of an integer.
According to the previous, we write the storage of 9.0:
Since it is a positive number, S=0;
9 is converted to binary as 1001, scientific notation: 1.001*2^3;
∴E=3 (belonging to the first type, neither all 0s nor all 1s), add 127 when storing, then the real value of E is: E=3+127=130, converted to binary 1000 0010
The M is followed by 0 to complete 23 bits, namely: 001 0000 0000 0000 0000
So floating point numbers are stored in memory as:
The final output is output in the form of an integer, that is, 0 100000010 00100000000000000000000 is regarded as the complement of the number to be output, but the output is the original code, because the highest bit is 0, so this number is a positive number, the original inverse complement is the same, then output this number , converted to decimal is:
(4) printf("The value of *pFloat is: %f\n",*pFloat);
Stored in the form of floating-point numbers, output in the form of floating-point numbers, the final output is still 9.000000, with 6 decimal places reserved.