1 fixed number
The position of the decimal point in fixed-point numbers is constant.
1.1 Fixed point decimal
means ( 0 , 1 ) (0,1)(0,1 ) The decimals within are called fixed-point decimals, and the position of the decimal point is implied to the left of the most significant bit of the value. Fixed-point decimals can only represent pure decimals, that is, 0.5 0.50.5, 0.78 0.78 A decimal like 0.78 cannot represent 1.5 1.51.5.
1.2 Fixed point integer
Fixed-point integers can only be used to represent pure integers. The decimal point is implied to the right of the lowest digit of the value.
2 floats
For something like 28.625 28.625How to express the value of 28.625 in the machine?
Take the following form as an example:
So what do these nouns in the picture correspond to?
2.1 Normalization
First of all, we will first 28.625 28.62528.625 converted to binary.
One trick for binary conversion is the list method.
16 | 8 | 4 | 2 | 1 | 0.5 | 0.25 | 0.125 |
---|---|---|---|---|---|---|---|
1 | 1 | 1 | 0 | 0 | 1 | 0 | 1 |
So, ( 28.625 ) 10 = ( 11100.101 ) 2 (28.625)_{10}=(11100.101)_2(28.625)10=(11100.101)2.
Speaking of normalization, everyone may be unfamiliar, but when it comes to scientific notation in decimal, everyone will be familiar with it.
Binary normalization is also similar to scientific notation, but the base is replaced by 2.
N = M × 2 E N=M×2^E N=M×2E
So ( 11100.101 ) 2 = 0.11100101 × 2 101 (11100.101)_2=0.11100101×2^{101}(11100.101)2=0.11100101×2101。
Then we look back at the previous figure, where
- Number sign : It is the sign of the whole number, 1 means negative sign, 0 means positive sign. In this example it is 0.
- Order symbol : the order symbol, in the above example 2 101 2^{101}2101 in 101 is the order, which is a positive number, so the order symbol is 0;
- Order code : the numerical value of the order, which is 101.
- Mantissa : the number after 0 after normalization, this example is 11100101
Assuming there is such a 16-bit machine, the order symbol occupies 1 digit, the number symbol occupies 1 digit, the exponent code occupies 5 digits, and the mantissa occupies 9 digits. trailing 0 to make up). Therefore ( 28.625 ) 10 (28.625)_{10}(28.625)10Written as follows:
numeral | Order symbol | exponent code | mantissa |
---|---|---|---|
0 | 0 | 00101 | 111001010 |
ie 0000101111001010
.
3 Floating-point number representation under the IEEE 754 standard
I believe that after the above process, everyone has a certain understanding of how to convert floating point numbers.
But under different rules, the converted floating-point numbers are different.
In order to unify, the standards formulated by IEEE 754 are currently adopted . The following figure takes float32 as an example.
3.1 IEEE 754 Standardization
The difference from Section 2.1 is that IEEE754 adopts 1.x
the state where M is adjusted, and 28.625 is also taken as an example.
( 28.625 ) 10 = ( 11100.101 ) 2 = 1.1100101 × 2 100 (28.625)_{10}=(11100.101)_2=1.1100101×2^{100}(28.625)10=(11100.101)2=1.1100101×2100
Also hide 1.1100101 1.11001011.1100101 Medium1. 1.1.
The rest is filled in the mantissa and filled with 0, so the mantissa =11001010000000000000000
3.2 Order Code
In IEEE 754, the original code of order is not used to fill in the second column.
Instead use the decimal number of order + 2 k − 1 2^{k}-12k−1, k k k refers to the number of digits occupied by the order, float32 here is2 8 − 1 = 127 2^8-1=12728−1=127 ,127
when converted into binary01111111
, the number of + here is calledthe offset.
Therefore 100
, add the order symbol 0
and make up 8 bits to get the original code 00000100
.
- Original code:
00000100
- Order code: original code +
01111111
=10000010
3.3 Conversion
numeral | Order symbol + order code | mantissa |
---|---|---|
0 | 10000010 | 11001010000000000000000 |
4 float32 maximum value
According to the IEEE 754 standard, we calculate the maximum and minimum values of float32.
Although many articles have discussed the calculation method of the value range of float32, either the answer is given directly, or the answer is wrong. Hence the following description.
4.1 Special floating-point numbers
Before starting to explore the maximum value of float32, declare a few special floating-point number forms.
4.1.1 0 value
When introducing the floating-point numbers of the IEEE standard, M=1.x
according to this standard, no matter what the mantissa x is, it cannot represent a value of 0. So for a value of 0, a special declaration is required:
- +0 = 0 00000000 00000000000000000000000
- -0 = 1 00000000 00000000000000000000000
As shown above, a value of 0 is represented by an order of all 0s and a mantissa of all 0s.
4.1.2 Infinity
If the mantissa is all 0 and the order is all 1, it means infinity.
- +INFINITY = 0 11111111 00000000000000000000000
- -INFINITY = 1 11111111 00000000000000000000000
4.1.3 NaN values
If the order is all 1 and the mantissa is not all 0, it means NaN value.
All NaN values in the following ranges:
0 11111111 00000000000000000000001 ~ 0 11111111 11111111111111111111111
1 11111111 00000000000000000000001 ~ 1 11111111 11111111111111111111111
4.1.3 subnormal numbers
Take the following value as an example:
0.00110001101001 ∗ 2 − 126 0.00110001101001 * 2^{−126}0.00110001101001∗2−126
If it is based on the IEEE 754 standard, it needs to be converted into the following form:
1.10001101001 ∗ 2 − 129 1.10001101001 * 2^{−129}1.10001101001∗2−129
But -129
it has exceeded the range that can be represented by 8 digits. So the standard IEEE 754 cannot represent such a value.
In order to be able to represent such extremely small values, it is necessary to specify the following subnormal values (denormalized values).
The order of the denormalized value is all 0, and the mantissa is not all 0. , it is stipulated that the value represented by it is
0. x ∗ 2 − 126 0.x * 2^{−126}0.x∗2−126
x x x refers to the numeric value in the mantissa.
4.1.4 Summary
The order here refers to the order symbol + order code.
0 | Infinity | NaN | Subnormal |
---|---|---|---|
The mantissa is all 0, and the order is also all 0 | The mantissa is all 0, and the order is all 1 | The mantissa is not all 0, and the order is all 1 | The mantissa is all 0, and the order is not all 0 |
4.1 float32 maximum positive value
The maximum value is actually not difficult to think of, just fill the sum 尾数
with .阶码
1
That is as shown in the figure below, 0 represents the order symbol, because there is a sign bit in front, the maximum exponent can only reach 127,
At the same time, let's look at an interesting thing. Let's take it as 12
an example. After normalization, it is
( 12 ) 10 = ( 1.100 × 2 11 ) 2 (12)_{10} = (1.100×2^{11})_2(12)10=(1.100×211)2
at the same time:
( 1.1 ) 2 = ( 1.5 ) 10 (1.1)_2=(1.5)_{10} (1.1)2=(1.5)10
( 11 ) 2 = ( 3 ) 10 (11)_2 =(3)_{10} (11)2=(3)10
And ( 1.5 × 2 3 ) 10 = 12 (1.5×2^3)_{10}=12(1.5×23)10=12
Therefore, it can be found that after normalization, we simultaneously convert M and the exponent from binary to decimal, and the value obtained by calculating the multiplication formula is equal to the original decimal value.
Using this property, let's study what the maximum value of float32 in the above figure is in decimal.
And the index 01111111
is converted to decimal 127
,
then the maximum value of float32 in decimal is
max = ( 2 − 2 − 23 ) × 2 127 max = (2-2^{-23})×2^{127 }max=(2−2−23)×2127
Use the calculator to get, max = 3.4028234663852 × 1 0 38 max=3.4028234663852×10^{38}max=3.4028234663852×1038
It can also be seen from the definition of the java document pair Float.MAX_VALUE
that our calculation result is correct.
4.2 float32 minimum positive value
4.2.1 float32 minimum normal positive value
It is the same reason to find the minimum normal positive value of float32 bits. Here is an example:
- 0.1 × 2 − 1 = 0.01 0.1×2^{-1}=0.01 0.1×2−1=0.01
- 0.1 × 2 2 = 1.0 0.1×2^{2}=1.0 0.1×22=1.0
- 0.01 × 2 − 1 = 0.001 0.01×2^{-1}=0.001 0.01×2−1=0.001
In order to make this minimum positive value as small as possible, we need M × 2 EM × 2^{E}M×2MMin EM is a positive number and as small as possible,EEE must be negative andEEThe absolute value of E needs to be as large a number as possible.
In the range of normal, yes M=1.x
, the mantissa x can only be all 0, and M is the smallest.
In the order, the original code of -127 11111111
+ offset 127= 00000000
, 00000000 is used to represent the subnormal number, so the order can only reach the minimum -126
.
So the minimum positive value of float32 in the normal number range: 1.00000...000 × 2 − 126 = 2 − 126 = 1.175494350822 × 1 0 − 38 1.00000...000×2^{-126}=2^{-126} =1.175494350822×10^{-38}1.00000...000×2−126=2−126=1.175494350822×10− 38
This result is consistent with the introduction in the JAVA documentFloat.MIN_NORMAL
.
4.2.1 float32 minimum subnormal positive value
When the order is all 0 and the mantissa is not all 0, it is subnormal, and the following situation is the smallest:
and 2 − 149 = 1.40129846432 4 − 45 2^{-149}=1.401298464324^{-45}2−149=1.401298464324− 45Float.MIN_VALUE
, and the result is also consistentwith the java document
4.3 float32 minimum negative value
4.3 float32 maximum negative value
To be continued. You can try to refer to the above reasoning yourself to deduce the minimum negative value and the maximum negative value.