[C Language] Floating point storage in memory

We know that the integer is stored in memory as its two's complement, so how is the floating point stored in memory?

Take an example problem to study this problem:

#include <stdio.h>

int main()
{
	int n = 9;
	float* pFloat = (float*)&n;
	printf("n的值为:%d\n", n);
	printf("*pFloat的值为:%f\n", *pFloat);
	*pFloat = 9.0;
	printf("n的值为:%d\n", n);
	printf("*pFloat的值为:%f\n", *pFloat);
	return 0;
}

You can try to do this question, what is the output result?

operation result:

Storage rules for floating point numbers

and  *pFloat  are obviously the same number in memory, why is there such a big output gap?

According to the international standard IEEE (Institute of Electrical and Electronics Engineering) 754 , any binary floating-point number V can be expressed in the following form:
  • (-1)^S * M * 2^E
  • (-1)^S represents the sign bit, when S=0 , V is a positive number; when S=1 , V is a negative number.
  • M represents a valid number, greater than or equal to 1 and less than 2 .
  • 2^E means exponent bits.

Many people may be confused when they see this, what is this? In fact, it is very simple, (-1)^S * M * 2^E  can be understood as a formula, ^ means power, for example,  2^E is 2 to the E power, for example:

(-1)^{S}*M*2^{E}

Decimal 5.0 , written in binary is 101.0 , and the decimal point is advanced by two places, which is equivalent to 1.01*2^{2}writing according to the above formula, you can get: (-1)^{0}*1.01*2^{2}, S=0, M=1.01, E=2.

IEEE 754 states:

For 32 -bit floating-point numbers, the highest 1 bit is the sign bit S , the next 8 bits are the exponent E , and the remaining 23 bits are the significand M.  

  For 64 -bit floating-point numbers, the highest bit is the sign bit S, the next 11 bits are the exponent E , and the remaining 52 bits are the significant figure M.

IEEE 754 has some special regulations on the significant figure M and exponent E.

As mentioned earlier, 1≤M<2, that is to say, M can be written in the form of 1.xxxxxx, where xxxxxx represents the decimal part.

IEEE 754 stipulates that when M is saved inside the computer, the first digit of this number is always 1 by default, so it can be discarded, and only the following xxxxxx part is saved. For example, when saving 1.01, only save 01, and then add the first 1 when reading. The purpose of doing this is to save 1 significant figure. Taking the 32-bit floating-point number as an example, there are only 23 bits left for M. After the first 1 is discarded, it is equivalent to saving 24 significant figures.

As for the index E, the situation is more complicated.

First, E is an unsigned integer (unsigned int)

This means that if E is 8 bits, its value range is 0~255; if E is 11 bits, its value range is 0~2047. However, we know that E in scientific notation can appear Negative numbers, so IEEE 754 stipulates that when storing the real value of E in the memory, an intermediate number must be added. For 8-bit E, the intermediate number is 127; for 11-bit E, the intermediate number is 1023. For example, The E of 2^10 is 10, so when saving it as a 32-bit floating point number, it must be saved as 10+127=137, which is 10001001.

Then, the index E is fetched from the memory and can be further divided into three cases:

E is not all 0 or not all 1

At this time, the floating-point number is represented by the following rules, that is, the calculated value of the exponent E is subtracted by 127 (or 1023) to obtain the real value, and then the first digit 1 is added before the effective number M.

for example:

The binary form of 0.5 is 0.1. Since it is stipulated that the positive part must be 1, that is, the decimal point is shifted to the right by 1, then it is 1.0*2^(-1), and its order code is -1+127=126, which is expressed as 01111110, while The mantissa 1.0 removes the integer part to be 0, fills 0 to 23 digits 000000000000000000000000, then its binary representation is:

0 01111110 00000000000000000000000

E is all 0

At this time, the exponent E of the floating-point number equal to 1-127 (or 1-1023) is the real value, and the effective number M is no longer added with the first digit of 1, but is restored to a decimal of 0.xxxxxx. This is done to represent ±0, and very small numbers close to 0.

E is all 1

At this time, if the significant digits M are all 0, it means ± infinity (positive or negative depends on the sign bit S).

Well, that's all for the representation rules of floating point numbers.

explain the previous topic

#include <stdio.h>

int main()
{
	int n = 9;
	float* pFloat = (float*)&n;
	printf("n的值为:%d\n", n);	//9
	//n的值没有被改变,所以还是9

	printf("*pFloat的值为:%f\n", *pFloat);	//0.000000
	//整型9在内存中的补码:00000000 00000000 00000000 00001001
	//*pFloat是一个单精度浮点型,以浮点型的方式取出9:
	//0 00000000 00000000000000000001001
	//S=0, E=00000000, M=0000000000000000001001
	//可见E为全0,按照前面讲的规则,E的真实值=1-127=-126,M的真实值=0.00000000000000000001001
	//套用前面的公式:(-1)^0 * 0.00000000000000000001001 * 2^-126
	//显然,这是一个非常小的数字,而打印时只能打印出小数点后6位,所以是0.000000

	*pFloat = 9.0;
	printf("n的值为:%d\n", n);	//1091567616
	//n的值通过*pFloat指针被重新赋为浮点型的9.0
	//9.0直接转为二进制 -> 1001.0,按规则把小数点提前3位 -> 1.001 * 2^3,
	//且9.0是正数,最终转换为:(-1)^0 * 1.001 * 2^3
	//S=0, E=3+127=130, M=1.001
	//那么,第一位的符号位S=0,指数E=3+127=130,写成二进制形式:10000010,有效数字M=001,后面补20个0凑满23位
	//所以,写成二进制形式应该是S+E+M,即:
	//0 10000010 00100000000000000000000000
	//把这个二进制数字转换为十进制正是1091567616

	printf("*pFloat的值为:%f\n", *pFloat);	//9.000000
	//浮点数以%f直接取出就是9.000000
	return 0;
}

Guess you like

Origin blog.csdn.net/m0_73156359/article/details/131014063