Finding the most value of fixed-point and floating-point numbers and float32

1 fixed number

The position of the decimal point in fixed-point numbers is constant.

1.1 Fixed point decimal

means ( 0 , 1 ) (0,1)(0,1 ) The decimals within are called fixed-point decimals, and the position of the decimal point is implied to the left of the most significant bit of the value. Fixed-point decimals can only represent pure decimals, that is, 0.5 0.50.5, 0.78 0.78 A decimal like 0.78 cannot represent 1.5 1.51.5.

1.2 Fixed point integer

Fixed-point integers can only be used to represent pure integers. The decimal point is implied to the right of the lowest digit of the value.
insert image description here

2 floats

For something like 28.625 28.625How to express the value of 28.625 in the machine?
Take the following form as an example:
insert image description here
So what do these nouns in the picture correspond to?

2.1 Normalization

First of all, we will first 28.625 28.62528.625 converted to binary.
One trick for binary conversion is the list method.

16 8 4 2 1 0.5 0.25 0.125
1 1 1 0 0 1 0 1

So, ( 28.625 ) 10 = ( 11100.101 ) 2 (28.625)_{10}=(11100.101)_2(28.625)10=(11100.101)2.

Speaking of normalization, everyone may be unfamiliar, but when it comes to scientific notation in decimal, everyone will be familiar with it.
Binary normalization is also similar to scientific notation, but the base is replaced by 2.

N = M × 2 E N=M×2^E N=M×2E

So ( 11100.101 ) 2 = 0.11100101 × 2 101 (11100.101)_2=0.11100101×2^{101}(11100.101)2=0.11100101×2101

Then we look back at the previous figure, where

  • Number sign : It is the sign of the whole number, 1 means negative sign, 0 means positive sign. In this example it is 0.
  • Order symbol : the order symbol, in the above example 2 101 2^{101}2101 in 101 is the order, which is a positive number, so the order symbol is 0;
  • Order code : the numerical value of the order, which is 101.
  • Mantissa : the number after 0 after normalization, this example is 11100101

Assuming there is such a 16-bit machine, the order symbol occupies 1 digit, the number symbol occupies 1 digit, the exponent code occupies 5 digits, and the mantissa occupies 9 digits. trailing 0 to make up). Therefore ( 28.625 ) 10 (28.625)_{10}(28.625)10Written as follows:

numeral Order symbol exponent code mantissa
0 0 00101 111001010

ie 0000101111001010.

3 Floating-point number representation under the IEEE 754 standard

I believe that after the above process, everyone has a certain understanding of how to convert floating point numbers.

But under different rules, the converted floating-point numbers are different.

In order to unify, the standards formulated by IEEE 754 are currently adopted . The following figure takes float32 as an example.
insert image description here

3.1 IEEE 754 Standardization

The difference from Section 2.1 is that IEEE754 adopts 1.xthe state where M is adjusted, and 28.625 is also taken as an example.
( 28.625 ) 10 = ( 11100.101 ) 2 = 1.1100101 × 2 100 (28.625)_{10}=(11100.101)_2=1.1100101×2^{100}(28.625)10=(11100.101)2=1.1100101×2100

Also hide 1.1100101 1.11001011.1100101 Medium1. 1.1.
The rest is filled in the mantissa and filled with 0, so the mantissa =11001010000000000000000

3.2 Order Code

In IEEE 754, the original code of order is not used to fill in the second column.

Instead use the decimal number of order + 2 k − 1 2^{k}-12k1, k k k refers to the number of digits occupied by the order, float32 here is2 8 − 1 = 127 2^8-1=127281=127 ,127when converted into binary01111111, the number of + here is calledthe offset.

Therefore 100, add the order symbol 0and make up 8 bits to get the original code 00000100.

  • Original code:00000100
  • Order code: original code + 01111111=10000010

3.3 Conversion

numeral Order symbol + order code mantissa
0 10000010 11001010000000000000000

4 float32 maximum value

According to the IEEE 754 standard, we calculate the maximum and minimum values ​​of float32.
Although many articles have discussed the calculation method of the value range of float32, either the answer is given directly, or the answer is wrong. Hence the following description.

4.1 Special floating-point numbers

Before starting to explore the maximum value of float32, declare a few special floating-point number forms.

4.1.1 0 value

When introducing the floating-point numbers of the IEEE standard, M=1.xaccording to this standard, no matter what the mantissa x is, it cannot represent a value of 0. So for a value of 0, a special declaration is required:

  • +0 = 0 00000000 00000000000000000000000
  • -0 = 1 00000000 00000000000000000000000

As shown above, a value of 0 is represented by an order of all 0s and a mantissa of all 0s.

4.1.2 Infinity

If the mantissa is all 0 and the order is all 1, it means infinity.

  • +INFINITY = 0 11111111 00000000000000000000000
  • -INFINITY = 1 11111111 00000000000000000000000

4.1.3 NaN values

If the order is all 1 and the mantissa is not all 0, it means NaN value.
All NaN values ​​in the following ranges:

0 11111111 00000000000000000000001 ~ 0 11111111 11111111111111111111111

1 11111111 00000000000000000000001 ~ 1 11111111 11111111111111111111111

4.1.3 subnormal numbers

Take the following value as an example:
0.00110001101001 ∗ 2 − ​​126 0.00110001101001 * 2^{−126}0.001100011010012126

If it is based on the IEEE 754 standard, it needs to be converted into the following form:
1.10001101001 ∗ 2 − ​​129 1.10001101001 * 2^{−129}1.100011010012129

But -129it has exceeded the range that can be represented by 8 digits. So the standard IEEE 754 cannot represent such a value.

In order to be able to represent such extremely small values, it is necessary to specify the following subnormal values ​​(denormalized values).

The order of the denormalized value is all 0, and the mantissa is not all 0. , it is stipulated that the value represented by it is
0. x ∗ 2 − ​​126 0.x * 2^{−126}0.x2126
x x x refers to the numeric value in the mantissa.

4.1.4 Summary

The order here refers to the order symbol + order code.

0 Infinity NaN Subnormal
The mantissa is all 0, and the order is also all 0 The mantissa is all 0, and the order is all 1 The mantissa is not all 0, and the order is all 1 The mantissa is all 0, and the order is not all 0

insert image description here

4.1 float32 maximum positive value

The maximum value is actually not difficult to think of, just fill the sum 尾数with .阶码1

That is as shown in the figure below, 0 represents the order symbol, because there is a sign bit in front, the maximum exponent can only reach 127,
insert image description here

At the same time, let's look at an interesting thing. Let's take it as 12an example. After normalization, it is
( 12 ) 10 = ( 1.100 × 2 11 ) 2 (12)_{10} = (1.100×2^{11})_2(12)10=(1.100×211)2

at the same time:

( 1.1 ) 2 = ( 1.5 ) 10 (1.1)_2=(1.5)_{10} (1.1)2=(1.5)10

( 11 ) 2 = ( 3 ) 10 (11)_2 =(3)_{10} (11)2=(3)10

And ( 1.5 × 2 3 ) 10 = 12 (1.5×2^3)_{10}=12(1.5×23)10=12

Therefore, it can be found that after normalization, we simultaneously convert M and the exponent from binary to decimal, and the value obtained by calculating the multiplication formula is equal to the original decimal value.

Using this property, let's study what the maximum value of float32 in the above figure is in decimal.

insert image description here

And the index 01111111is converted to decimal 127,
then the maximum value of float32 in decimal is
max = ( 2 − 2 − 23 ) × 2 127 max = (2-2^{-23})×2^{127 }max=(2223)×2127

Use the calculator to get, max = 3.4028234663852 × 1 0 38 max=3.4028234663852×10^{38}max=3.4028234663852×1038

It can also be seen from the definition of the java document pair Float.MAX_VALUEthat our calculation result is correct.
insert image description here

4.2 float32 minimum positive value

4.2.1 float32 minimum normal positive value

It is the same reason to find the minimum normal positive value of float32 bits. Here is an example:

  • 0.1 × 2 − 1 = 0.01 0.1×2^{-1}=0.01 0.1×21=0.01
  • 0.1 × 2 2 = 1.0 0.1×2^{2}=1.0 0.1×22=1.0
  • 0.01 × 2 − 1 = 0.001 0.01×2^{-1}=0.001 0.01×21=0.001

In order to make this minimum positive value as small as possible, we need M × 2 EM × 2^{E}M×2MMin EM is a positive number and as small as possible,EEE must be negative andEEThe absolute value of E needs to be as large a number as possible.

In the range of normal, yes M=1.x, the mantissa x can only be all 0, and M is the smallest.

In the order, the original code of -127 11111111+ offset 127= 00000000, 00000000 is used to represent the subnormal number, so the order can only reach the minimum -126.

So the minimum positive value of float32 in the normal number range: 1.00000...000 × 2 − 126 = 2 − 126 = 1.175494350822 × 1 0 − 38 1.00000...000×2^{-126}=2^{-126} =1.175494350822×10^{-38}1.00000...000×2126=2126=1.175494350822×1038
This result is consistent with the introduction in the JAVA documentFloat.MIN_NORMAL.
insert image description here

4.2.1 float32 minimum subnormal positive value

When the order is all 0 and the mantissa is not all 0, it is subnormal, and the following situation is the smallest:
insert image description here
and 2 − 149 = 1.40129846432 4 − 45 2^{-149}=1.401298464324^{-45}2149=1.40129846432445Float.MIN_VALUE , and the result is also consistentwith the java document
insert image description here

4.3 float32 minimum negative value

4.3 float32 maximum negative value

To be continued. You can try to refer to the above reasoning yourself to deduce the minimum negative value and the maximum negative value.

Guess you like

Origin blog.csdn.net/weixin_43490422/article/details/126782442