Storage of floating point numbers in memory (C language)

1. Introduction to floating point numbers

Floating point numbers are a data type in computer science used to store numbers with fractional parts. In the C language, floating-point types are represented by float and double. The float type uses 4 bytes (32 bits) for storage, while the double type uses 8 bytes (64 bits) for storage.

2. Storage of floating point numbers in memory

The storage of floating-point numbers in memory is implemented in accordance with the IEEE 754 standard. This standard specifies how to represent floating-point numbers as binary numbers and how to store them in memory.

1. IEEE 754 standard

The IEEE 754 standard includes two ways of representing floating-point numbers:

Single-precision floating-point numbers (32 bits): 1 sign bit (S), 8 exponent bits (E), and 23 fractional bits (M).
Double-precision floating-point number (64 bits): 1 sign bit (S), 11 exponent bits (E), and 52 fraction bits (M).

The sign bit S indicates whether the number is positive (0) or negative (1). The exponent bit E is used to store the exponent so that scientific notation can be performed. The decimal place M stores the fractional part of the floating-point number.

In the IEEE 754 standard, the exponent E is stored using an offset code. In other words, the value stored in the exponent bit E needs to be added with an offset value to get the real exponent value.

2. Memory storage of single-precision floating-point numbers

The storage method of single-precision floating-point numbers in memory is shown in the figure below:

For 32-bit floating-point numbers, the highest 1 bit is the sign bit S, the next 8 bits are the exponent E, and the remaining 23 bits are the significand M.

insert image description here

Among them, the first bit is the sign bit S, the following 8 bits are the exponent bit E, and the last 23 bits are the decimal place M.

For example, for the single-precision floating-point number -3.75, its binary representation is:

-11.11

The sign bit S is 1, the exponent bit E is 10000010, and the decimal bit M is 11100000000000000000000. According to IEEE 754, the value stored in the exponent bit E needs to be added with an offset value of 127 to obtain the real exponent value, so the real exponent value is:

10000010 - 127 = -15

Thus, -3.75 can be stored in memory as follows:

1 10000 11100000000000000000000

3. Memory storage of double-precision floating-point numbers

The storage method of double-precision floating-point numbers in memory is shown in the following figure:

For a 64-bit floating-point number, the highest bit is the sign bit S, the next 11 bits are the exponent E, and the remaining 52 bits are the significand M.

insert image description here

Among them, the first bit is the sign bit S, the following 11 bits are the exponent bit E, and the last 52 bits are the decimal bit M.

For example, for the double-precision floating-point number -3.75, its binary representation is:

-11.11

The sign bit S is 1, the exponent bit E is 10000000011, and the decimal place M is 1100000000000000000000000000000000000000000000000000. According to IEEE 754, the value stored in the exponent bit E needs to add an offset value of 1023 to get the real exponent value, so the real exponent value is:

10000000011 - 1023 = -13

Thus, -3.75 can be stored in memory as follows:

1 10000000011 1100000000000000000000000000000000000000000000000000

4. IEEE 754 has some special regulations on the effective number M and exponent E

E is an unsigned integer (unsigned int)
, which means that if E is 8 bits, its value range is 0 ~ 255; if E is 11 bits, its value range is 0 ~ 2047.
However, we know that E in scientific notation can have negative numbers, so IEEE 754 stipulates that an intermediate number must be added to the real value of E when stored in memory. For 8-digit E, the intermediate number is 127;
For an 11-bit E, this intermediate number is 1023.
For example, the E of 2^10 is 10, so when saving it as a 32-bit floating point number, it must be saved as 10+127=137, which is
10001001.

The index E can be further divided into three cases when it is taken out of the memory:

E is not all 0 or not all 1

At this time, the floating-point number is represented by the following rules, that is, the calculated value of the exponent E is subtracted by 127 (or 1023) to obtain the real value, and then the first digit 1 is added before the effective number M.

For example:
the binary form of 0.5 (1/2) is 0.1, since the positive part must be 1, that is, the decimal point is shifted to the right by 1, then it is 1.0*2^(-1), and its order code is -1+127= 126, expressed as 01111110, and the mantissa 1.0 removes the integer part to be 0, and fills 0 to 23 digits 000000000000000000000000, then its binary representation is:

0 01111110 00000000000000000000000

E is all 0

At this time, the exponent E of the floating-point number equal to 1-127 (or 1-1023) is the real value, and the
effective number M is no longer added to the first digit of 1, but is restored to a decimal of 0.xxxxxx. This is done to represent ±0, and very small numbers close to 0.

E is all 1

At this time, if the significant number M is all 0, it means ± infinity (positive or negative depends on the sign bit s);

3. Summary

This article introduces how floating-point numbers are stored in memory, based on the IEEE 754 standard. Single-precision floating-point numbers and double-precision floating-point numbers are stored in 4 bytes (32 bits) and 8 bytes (64 bits) respectively, and are divided into sign bit S, exponent bit E, and decimal bit M. S means positive and negative, E is used to store the exponent and use the offset code, and M stores the fractional part of the floating point number. For C language programmers, it is very important to understand the storage of floating point numbers in memory.

insert image description here

Guess you like

Origin blog.csdn.net/ikun10001/article/details/130997722