IEEE754 standard: 3. Why is it said that the precision of 32-bit floating point numbers is "7 significant digits"

This chapter contains some of my own understanding, and I would like to point out any deviations. This chapter is also the most important chapter in the entire series, so please read it patiently.

Regarding how the precision of floating point numbers in the IEEE754 standard is calculated, there are different opinions on the Internet, and some of them conflict with each other. I am also confused... Here are only two opinions that I personally think are more reliable.

1. Let’s talk about the conclusion first

Open the IEEE754 Wikipedia, and you can see that the precision of single-precision floating point numbers is "Approximately 7 decimal digits"

Some people translated this sentence as "about 7 decimal places" and "decimal" as "decimal".

But my personal understanding is that the meaning of "decimal" here should be "decimal", that is, the precision of a 32-bit floating point number is "approximately 7 decimal digits" " , I will explain why this is understood later.

2. Before discussing...

Let’s first think about this: Can today’s computers store all decimals between [1,2]?

After thinking about it for a moment, you will know: No. Because the computer's memory or hard disk capacity is limited, and the number of decimals between 1 and 2 is infinite.

To the extreme, the computer cannot even store a decimal between 1 and 2, for example, for decimals 1.00000....One trillion zeros.... 00001, I’m afraid it’s difficult to use a computer to store it...

However, the computer can store all integers between [1, 10000]. Because integers are "discrete", [ There are only 10,000 integers between 1, 10,000]. 10,000 states can be easily stored in a computer, and calculations can also be performed. For example, calculating 10,000 + 10,000 only requires that your computer can store 20,000 states. ..

Look at it this way: computers can perform integer operations in mathematical concepts, but it is difficult to perform decimal operations in mathematical concepts. It is difficult for current computers to deal with such "continuous" things as decimals...

In fact, in order to perform decimal operations, the computer has to treat the decimals as "discrete" values, one by one, Just like integers:

↑ The integers in mathematics are one by one, imagine that the green pointer must move one grid at a time

↑ Decimals in mathematics are continuous. Imagine that the green pointer can be infinitely adjusted, and you can go wherever you want.

↑ The decimals stored in the computer are one by one, and the green pointer must move one grid at a time, just like integers.

This raises the issue of accuracy. For example, in the picture above, we cannot store 0.3 in the computer because the green pointer can only move one grid at a time, either at 0.234 or 0.468...

Of course, we can also increase the accuracy of the computer in storing decimals, or reduce the distance between points:

The same is generally true for single-precision floating-point numbers and double-precision floating-point numbers in IEEE754: the blue dots in double-precision floating-point numbers are denser...

3. Understanding angle 1: Understanding from the perspective of "interval"

1. foreshadowing

Understanding "accuracy" from the perspective of "interval" is actually this way of thinking:

Imagine a circular dial similar to the picture above. There are some blue dots on the dial as scales, and a green pointer is used to point to the blue dots. The green pointer can only move one grid at a time: that is, it can only move from the current The blue point moves to the next blue point. cannot point to the position between two blue points.

Suppose the blue dot used to indicate the scale on the dial is as follows:

0.0000

0.0012

0.0024

0.0036

0.0048

0.0060

0.0072

0.0084

0.0096

0.0108

0.0120 (note here, the previous number is 108, this number is 120, remember this first)

0.0132

0.0144

...

That is, this is a set of decimal numbers. This set of numbers gradually increases in steps of 0.0012 ... Assume that this dial is All decimals that your computer can represent.

Dial diagram


Question: Can we say that the accuracy of this dial, or this set of numbers, reaches 4 decimal digits (for example, can it be accurate to 1 integer + 3 decimal places)?

Analysis: If we can accurately point to 1 integer + 3 decimal places, then we should be able to say the following:

We can say that the current pointer is located at 0.001x: and the pointer can indeed be located at 0.0012, which belongs to 0.001x (x means that this bit is an arbitrary number, or there is no limit on the precision of the bit)

We can say that the current pointer is at 0.002x: and the pointer could indeed be at 0.0024, belonging to 0.002x

We can say that the current pointer is at 0.003x: and the pointer could indeed be at 0.0036, belonging to 0.003x

...

We can say that the current pointer is at 0.009x: and the pointer could indeed be at 0.0096, belonging to 0.009x

We can say that the current pointer is at 0.010x: and the pointer could indeed be at 0.0108, belonging to 0.010x

We can say: The current pointer is at 0.011x...But, note, the pointer can never point to 0.011x... in our On the dial, the pointer can point to 0.0108, or to 0.0120, but it can never point to 0.011x

...

This means: For the current dial (or for this set of numbers), 4-digit precision is too high... Some of the states that can be described by 4-digit precision cannot be represented by this dial.

Then, lower the accuracy.

Can we say that the accuracy of this dial, or this set of numbers, reaches 3 decimal digits (for example, it can be accurate to 1 digit integer + 2 decimal digits)?

Let’s analyze it again: If we can accurately point to 1 integer + 2 decimal places, then we should be able to say the following:

We can say that the current pointer is at 0.00xx: and the pointer can indeed be at 0.0012, 0.0024, 0.0036...0.0098, which all belong to 0.00xx

We can say that the current pointer is at 0.01xx: and the pointer can indeed be at 0.0108, 0.0120... these all belong to 0.01xx

...

It can be seen that for the current dial (or for this set of numbers), it can completely "hold" 3-digit precision. In other words, all states that can be described by 3-digit precision can be obtained in this dial. express.

If our machine uses this dial as the value dial for floating point numbers, then we can say:

The floating point precision of our machine (or the floating point precision of this dial) can be accurate to 3 decimal digits (it cannot be accurate to 4 decimal digits).

And this accuracy is essentially determined bythe dial interval. In this case, the dial interval is0.0012, If the dial interval is reduced to 0.00000012, the accuracy of the corresponding dial will be improved (can be upgraded to 7 decimal digits, but cannot reach 8 decimal digits) number)

Through this example, I hope everyone can intuitively understand "the dial'sinterval" and "the dial's There is a close relationship between accuracy". This will be the basis for the discussion below.

In fact: the 32-bit floating point numbers in the ieee754 standard can also be imagined as a "floating point number dial with very dense blue dots", If we can analyze the interval between the blue dots in this dial, then we can analyze the accuracy of this dial.

Note: The example in this section can also be explained in a very simple sentence: Assume that the floating point number dial can provide 4-digit precision control, for example, it can control to 1 integer + 3 decimal places, which requires that it must be able to control the granularity of 0.001 , and the value of 0.001 is smaller than the actual interval of the dial, 0.0012... so the dial cannot provide 4-digit accuracy...

2. Interval of 32-bit floating point numbers

So how to analyze the interval and precision of 32-bit floating point numbers? There is a very stupid method: List all the decimals that can be represented by 32-bit floating point numbers and calculate the interval . Then analyze the accuracy...

Uh... I'm really planning to use this stupid method... Let's get started...

Note: Only the normal number is analyzed here, and the negative number case is not considered first, that is to say, the sign bit is not considered For the case of 1 

The smallest number that can be represented by a 32-bit floating point numberSpecification number is:

0 00000001 00000000000000000000000 (two-step system)

(Note that the specification number’sexponent minimum is 00000001 and cannot be 00000000. This has been discussed in Chapter 2 of this series I’ve passed it, so I won’t go into details below)

The very next number is:

0 00000001 00000000000000000000001 (two-step system)

The very next number is:

0 00000001 00000000000000000000010 (two-step system)

The very next number is:

0 00000001 00000000000000000000011(two-step system)

...

Going down step by step in this way, after 223-1 steps, we will point to this number:

0 00000001 11111111111111111111111(Two-step system)

One more step, that is, after 2^23 steps, we will point to this number:

0 00000010 00000000000000000000000 (two-step system)

To summarize: after 2^23 moves:

Our starting point: 0 00000001 00000000000000000000000,

End of movement: 0 00000010 00000000000000000000000

Now you can find the interval, Interval = difference / number of moves = (value corresponding to the end point - value corresponding to the starting point) / 2^23,

However, don’t rush to calculate. Let’s take a closer look first. We can find that compared with the starting point, the sign bit and mantissa bit of the end point have not changed, only the exponent Bit changed: starting index bit 00000001 → end index bit 000000 10, The exponent of the end point is 1 larger than the exponent of the starting point

The evaluation formula for floating point numbers in ieee754 is:

(Ignore the sign bit first)

In this case: Suppose the value corresponding to the starting point is

Then the value corresponding to the end point should be

That is, only the exponent bit becomes larger by 1

Expanding the index will make it clearer:

Suppose the value corresponding to the starting point is 0.0000 0001 (8 decimal places)

The value corresponding to the end point should be 0.0000 001 (7 decimal places)

The difference between the starting point and the end point is: (0.0000 001 - 0.0000 0001) a>, is a very small number

The interval is: Difference / 2^23

Note: In fact, we did not calculate the real interval above, we just assumed that the values ​​of the starting point and the end point are respectively

and

Then a hypothetical interval was calculated. But this assumption is particularly important. We will continue to use this assumption for analysis below ️

Without further ado, let’s move on.

Current starting point changed: 0 00000010 00000000000000000000000

Rerun 2^23 steps, arrived: 0 00000011 00000000000000000000000

Similarly: the sign bit and the mantissa bit have not changed, and the exponent bit has become larger by 1.

Using the above assumption, the value corresponding to the starting point at this time is

, then the value corresponding to the end point should be

, that is, the exponent bit has become larger by 1

Calculate the difference again: 0.0000 01 (6 decimal places) - 0.0000 001 (7 decimal places)

Calculate the interval again: equal to difference / 2^23 (number of moves)

I don’t know if the students have realized that something is wrong. If not, we plan to move forward:

Current starting point completed: 0 00000011 00000000000000000000000

Rerun 2^23 steps, arrived: 0 00000100 00000000000000000000000

Similarly, the end point is only the exponential position relative to the starting pointhas become larger by 1

Re-calculation difference: (0.00001(5 decimal place) - 0.000001(6 a> decimal))...

Calculate the interval again: equal to difference / 2^23 (number of moves)

Feel something is wrong? Keep moving forward...

Current starting point changed: 0 00000100 00000000000000000000000

Rerun 2^23 steps, arrived: 0 00000101 00000000000000000000000

Re-calculation difference: (0.0001(4 decimal place) - 0.00001(5 a> decimal))...

Calculate the interval again: equal to difference / 2^23 (number of moves)

...Having walked all the way here, do you feel something is wrong?

What's wrong is: the difference between the end point and the starting point! The difference is getting bigger and bigger! In the same way, the gap is also getting bigger and bigger!

If you don’t believe it, let’s list the previous differences:

...
Nasaki is: 0.0000 001 ( 7 decimal) - 0.0000 0001( 8 decimal), difference value 0.0000 0009
...
Nashaki is: (0.000001 ( 6 decimal place) - 0.0000001( 7 decimal)), difference value 0.0000 009
...
Nasaki is: (0.00001 ( 5 decimal) - 0.000001( 6 decimal)), etc. 0.0000 09
...
Nasaki is: (0.0001 ( 4 decimal place) - 0.00001( 5 decimal)), etc. 0.0000 9

The decimal point of the difference is constantly moving to the right. If it goes like this next time, one day, the difference will become 9, 90, 90,000...

The number of moves is always = 2^23, and the interval is always = difference/2^23.... As the difference becomes larger and larger, the interval will also become larger and larger...

At this point, you have discovered an important feature of the ieee754 standard: If you imagine the floating point numbers represented by ieee754 as a dial, the blue dots on the dial are not evenly distributed, but are increasingly spaced apart and sparse. :

Probably something like this:

You can verify this feature directly in C language:


The blue point next to 16777216 is 16777218. The difference between the two numbers is 2. 32-bit floating point numbers cannot represent 16777217.

3. Interval table of 32-bit floating point numbers

We said at the beginning: knowing the distance between the dials, we can calculate the accuracy of the dials.

The complication is that on the ieee754 dial, the intervals are not fixed, but are getting larger and larger.

Fortunately, the wiki has already summarized the interval data for us:

For the data in this table, we only focus on the three columns on the right, which tells us: The interval between the ranges of [minimum value, maximum value] is How much

For example: The following line tells us, The numbers in the range of 8388608 ~ 16777215, the interval is 1

So a 32-bit floating point number can store 8388608, or 8388609, because the interval is 18.5, but cannot store 838860

And the second line says: 1 ~ 1.999999880791 The number in this range, the interval is: 1.19209e-7

If you look through the source code of float.h in C language, you will find this sentence:

#define FLT_EPSILON 1.192092896e-07F // smallest such that 1.0+FLT_EPSILON != 1.0

↑ Define constantsFLT_EPSILON, whose value is1.192092896e-07F

This 1.192092896e-07F is actually the interval we see in the table >: 1.19209e-7

The source code says: 32-bit floating point number1.0, At least add FLT_EPSILON a>This constant can not be equal to 1.0.

In other words, if 1.0 plus a number N less than FLT_EPSILON , this "weird situation" will occur1.0 + N == 1.0 .

Because for 32-bit floating point numbers in the range 1 ~ 1.999999880791 , at least  must be added FLT_EPSILON, or at least add the interval corresponding to the range Only then can the pointer be moved from the current blue point to the next blue point

Note: If it is not a number between 1 ~ 1.999999880791, you do not have to add  1.19209e-7 Ah. To be precise, it should be: The number in a certain interval must at least add the interval corresponding to the interval to get from the current The blue point moves to the next blue point.

Take a closer look at the interval table above, and you will be more confident about floating point number operations in C language.

Note: In fact, there is a problem that is not too big or too small in the above explanation. However, we will leave it aside for now, and we will come back to discuss this issue again when we understand it more deeply.

Note: 64-bit floating point number Interval table,  See also  IEEE754 WIKI

4. Precision of 32-bit floating point numbers

Then, why is it said that the precision of 32-bit floating point numbers is 7 decimal digits?

The first thing to note is: the precision of 32-bit floating point numbers is: Approximately 7 decimal digits, which isapproximately7 decimal digits a>

In fact, for some 8-digit decimal numbers, the 32-bit floating point number container can also accurately save them. For example, the following two numbers can be accurately saved

So what does the so-called precision of 7 decimal digits mean? Before discussing this, we need to understand some more essential things first

I. Floating point numbers can only store the value corresponding to the position of the blue dot.

As mentioned above, 32-bit floating point numbers will form a dial, and the blue dots on the dial will gradually become sparse. The green pointer can only point to a certain blue dot, not the position between the two blue dots. Or change In other words: 32-bit floating point numbers can only store the value corresponding to the blue dot.

If the value you want to save is not the value corresponding to the blue dot, it will be automatically rounded to the value corresponding to the blue dot closest to the number. Example:

In the range of 0.5 ~ 1, the interval is about 5.96046e-8, that is, is about is 0.00000005.96046

In other words: there is a blue dot on the dial which is 0.5

The next blue point should be: current blue point + interval ≈ 0.5 + 0.00000005.96046 ≈ 0.5000000596046

Then, if we want to save 0.50000006, that is the value we want to save, which is slightly larger than the next blue point:

Because the green pointer must point to the blue dots and cannot point to a position between the blue dots, the green pointer will be "calibrated" to0.5000000596046, or what we want to save0.50000006, will be roundedis0.5000000596046

Try it out:

In fact, each 32-bit floating point number container must store a blue point value

To verify, first find the blue point value starting from 0.5:

First blue dot: 0.5

Second blue dot: ≈ 0.5 + 0.0000000596046 ≈ 0.5000000596046

The third blue dot: The second blue dot + 0.0000000596046 ≈ 0.0000001192092

The fourth blue dot: The third blue dot + 0.0000000596046 ≈ 0.0000001788138

Then look at the following code and find that what is actually stored in the variable is actually the blue dot value:


Looking at the printed content, you can see that all the blue dot values ​​are actually stored.

This is what we needto focus on understanding

As a digression, in fact, after learning so far, we can roughly explain a classic programming problem: "Why 0.99999999 in a 32-bit floating point number is stored as 1.0 ", because 0.99999999 is not a blue point value, and the nearest blue point value is 1.0, then the green pointer is automatically "calibrated" to the nearest blue point 1.0 .

II. Understand that the precision of a 32-bit floating point number is 7 decimal digits

This is how I understand it:

Example 1:

Looking up the table, we found that the interval in the range of 1024 ~ 2048 is approximately 0.000122070

As shown below: I want to store it accurately to 4 digits after the decimal point, but find that I can't. In fact, I can only store it accurately to 3 digits after the decimal point:

I tried to store it accurately to 4 decimal places, but found that it was actually impossible to store 1024.0005 because there is no number of this level 1024.0005xxxxxxxx on the dial.

In the range of 1024 ~2048, the number that can be accurately saved is 4-digit decimal integer + 3-digit decimal fraction = 7-digit decimal number

Example 2:

Look up the table and find that the interval in the range 8388608 ~ 16777215 is 1

As shown below: I want to store it accurately to one decimal place, but I find that I can't. In fact, I can only store it accurately to zero places after the decimal point, or I can only store it accurately to single digits (because the minimum interval is 1):

One digit after the decimal point cannot be stored accurately, it can only be stored accurately to the one digit.

In the range of 8388608 ~ 16777215, the number that can be accurately saved is 7 or 8-digit decimal integer + 0 decimal digit = 7 or 8-digit decimal number

Yes, 32-bit floating point numbers can also accurately store 8-digit decimal numbers less than or equal to 16777215, so its accuracyis approximately 7 digit decimal number

Example 3:

Look up the table and find that the interval in the range 1 ~ 2 is 1.19209e-7

As shown below: I want to store it accurately to 7 digits after the decimal point, but find that I can't. In fact, I can only store it accurately to 6 digits after the decimal point:

I want to store it accurately to 7 decimal places, but I find that it is actually impossible to store 1.0000012. There is no number of this level 1.0000012xxxx on the dial.

In the range of 1 ~ 2, the number that can be accurately saved is 1-digit decimal integer + 6-digit decimal fraction = 7-digit decimal number

The so-called precision of 32-bit floating point numbers is 7 decimal digits, which is probably calculated in this way. Basically Integer digits + decimal digits can only have 7 digits at most a>, adding more cannot ensure accuracy (note that this is not the calculation method given by wiki, see below for the algorithm given by wiki)

If you don’t like this way of understanding, you might as well take a step back and just remember the following three points:

1. 32-bit floating point numbers can actually only store the blue dot value on the corresponding dial.

Instead of storing values ​​between blue points and blue points

2. The blue points are not evenly distributed, but become increasingly sparse. In other words, the distance between blue points is getting larger and larger, or the accuracy is getting lower and lower.

This is why it can still be accurate to 6 decimal places when it comes to 1.xxxxxxx, but only when it comes tohttp://1024.xxx The reason why it can be accurate to 3 decimal places, but can only be accurate to single digits at 8388608. Because the blue dots are getting sparser and sparse, and after that, it is not even accurate to single digits...

5. Precautions

I. Distinguish the storage precision & printing effect of 32-bit floating point numbers

In C language, when using %f to print, the default print is 6  decimal places

The number of significant digits of a 32-bit floating point number is 7 significant digits

There is no conflict between the two, for example:

Primitive 值 1234.878 89

String effect 1234.878 906

It can be seen that only the first 7 digits in the printing effect are consistent with the original value.

In fact, "original value" vs "printed value" is actually "the value you want to store" vs "the actual stored value"

You want to store 1234.878 89, but what is actually stored is 1234.878 906... Because 1234.878 906... is a "blue dot value", which can be truly pointed by the green pointer and can truly be 32-bit floating point number. Container is stored.

Although it cannot accurately store the value you want to save, the 32-bit floating point number can ensure that the first 7 digits of the value you want to save are accurately stored. Therefore, the first 7 digits in the printing effect are consistent with the original value.

%f prints to 6 decimal places by default, and what is printed is the actual stored blue point value. However, the blue point value is not necessarily 7 decimal places, there may be more than a dozen decimal places, but %f will silently round it to 7 decimal places and print it out

II. Sometimes the precision is not 7 digits

There are many possible reasons, such as:

1. When printing, %f is rounded:

At this time, you can set more decimal places to print.

2. It seems that the accuracy is more than 7 digits

Note that all that can be stored in floating point numbers are actually blue point values.

So, if  you want to store the value  and the blue point value Exactly the same, then the value you want to store can be stored completely accurately.

If the value you want to store and blue point value Very, very close, it will show extraordinary accuracy. See the following example for details:

3. It seems that the accuracy is less than 7 digits

Example:

My personal understanding of this is:

For numbers between 1024 ~ 2048, 32-bit floating point numbers are indeed capable of being accurate to 7 significant digits

When the value you want to store is not a blue point value, rounding will occur and will be automatically rounded to the nearest blue point value.

So, 1024.001, will be rounded to the nearest blue point 1024. 000976..., it seems that the accuracy is less than 7 digits

And 1024.0011, will be rounded to the nearest blue point 1024.00109..., it seems that the accuracy is 7 digits again...

Just saying: 32-bit floating point numbers do have the ability to be accurate to 7 significant digits, but rounding rules make it sometimes seem impossible to be accurate to 7 digits...

If understood from this perspective, a topic we discussed earlier is a bit untenable:

We said before: :

There is this line of code in float.h in C language:
#define FLT_EPSILON 1.192092896e-07F // smallest such that 1.0+FLT_EPSILON != 1.0
↑ Define constants FLT_EPSILON,  The value is 1.192092896e-07F
this  1.192092896e-07F  , actually it is in our 1 ~ 2 range Interval :  1.19209e-7
The source code says: 32-bit floating point number 1.0,  At least add FLT_EPSILON This constant can is not equal to 1.0
In other words, if  1.0  Add one Number N less than FLT_EPSILON will appear 1.0 + N == 1.0  This "weird situation".

Wait, it seems to be ignored hereRounding rule: In the range of 1 ~ 2, between the two blue dots a>The interval is: 1.19209e-7, But this does not mean that we want to start from the current Walking from the blue point to the next blue point requiresto complete an interval, because there are rounding rules, in fact, you only need toJust go half the interval, and then the rounding rule will automatically round you to the next blue point...

Verify it in C language:

Test environment: win10 1909, gcc version 6.3.0 (MinGW.org GCC-6.3.0-1)


It can be seen that because of the rounding mechanism, if a blue point wants to move to the next blue point:  Generally speaking, only needs to move< /span> is enough.A little more than half of the interval

And this line of comments in C language:

#define FLT_EPSILON 1.192092896e-07F // smallest such that 1.0+FLT_EPSILON != 1.0

In fact, it is not quite right. 1.0 There is no need to add  FLT_EPSILON< a i=4> This entire interval can != 1.0  (or to reach the next blue dot), generally just add  (or it will be there The next blue dot is here).!=1.0 will be A little more than half of FLT_EPSILON 

But this is just my personal understanding...

III. How many decimal places can 0.xxxxxxx be accurate to?

In other words, a 32-bit floating point number can record 7 significant digits, so for a number in this format 0.xxxxxxx , what is Can it be accurate to 7 digits after the decimal point, or 6 digits after the decimal point? In other words, does the 0 in the integer part count as a valid number at this time...

My personal understanding is that for decimals like 0.xxxxxxx, it can actually be accurate to 7 digits after the decimal point, that is, 0 does not count as one significant digit.

Taking the range of 0.5 ~ 1 as an example, the interval at this time is 5.96046e-8, which is approximately equal to 0.0000 0006

I tried below to be accurate to 8 decimal places, but found that it didn't work.

But accurate to 7 decimal places is more than enough:

Between 0 and 0.5, the interval will be smaller and the accuracy will be higher (because the blue dots on the floating point number dial are getting sparser and the accuracy is getting worse. If the range of 0.5 to 1 can be Accurate to 7 digits after the decimal point, the earlier 0 ~ 0.5 will only be more accurate, or the blue dots will be denser and the intervals will be smaller)

Example:

In short, the same thing is true: Roughly speaking, the precision of a 32-bit floating point number is 7 significant digits.

In fact, only blue points can be stored in floating point numbers. The closer the blue points are to 0, the denser they are and the higher the accuracy. ←7 significant digits is a summary and general description of this phenomenon

One final point: Some students may mistakenly believe that the precision of the 32-bit floating point number type is: it can always be accurate to 6 digits after the decimal point. For example, it can accurately store 999999.123456, but it cannot accurately store 999999.1234567.

I believe that after reading this, everyone can find the error in this understanding: the accuracy of a 32-bit floating point number is 7 significant digits. Generally speaking, these 7 significant digits refer to Integers + decimals total 7 digits, which does not mean that it can always be accurate to six decimal places...

IV. In-depth understanding of interval tables

Let's look back at the interval table on this wiki. In fact, it mainly tells us: what is the interval between two blue dots in a certain range.

For example, in the range of 1 ~ 2, the distance between two blue dots is approximately 1.19209e-7

In the range 8388608 ~ 16777215, the interval between two blue points is 1

There are actually a few things to note here:

1. In each range, there are 2^32 blue points, or each interval is equally divided into 2^23 intervals.

For example, the range 1~2 will be equally divided into 2^23 intervals, and the range 8388608 ~ 16777215 will also be equally divided into 2^23 intervals.

2. The division of the range is determined by the index

The so-called range1~2 will be divided into 2^23 intervals. To be precise, it should be the range 2^0 ~ 2^1 will be divided into 2^23 intervals

The so-called range8388608 ~ 16777215 will be divided into 2^23 intervals. To be precise, it should be the range 2^23 ~ 2^24 will be divided into 2^23 intervals

Every time the index bit changes, a new range will be divided. This is actually easy to understand:

Compare, currently I am from: 0 00000010 00000000000000000000000

Moving forward 2^23 - 1 step, or moving forward 2^23 - 1 interval, corresponds to actually changing the mantissa from 000000000000000000000000 to 11111111111111111111111 step by step

Take another step forward, that is, move forward a total of 2^23 intervals, and we have reached the end point: 0 00000011 00000000000000000000000

It can be seen that the end point is relative to the starting point, only the index has increased1

From the end point to the starting point, a range is determined. The range is divided into 2^23 equal intervals. (End point - starting point) / 2^23 is the length of each interval.

Go forward 2^23 intervals and you will arrive at 0 00000100 000000000000000000000000. The index also becomes larger by 1.. .

It is not difficult to see: The change of the exponent bit is used to divide the range, and the change of the mantissa bit is used to move forward step by step.

The number of mantissa digits determines how many intervals can be divided into each range. For example, there are 23 mantissa digits, which means that 2^23 intervals can be divided into each range.

How many exponent bits there are determines how much range we can include. For example, if there are 8 exponent bits (the representable exponent range is [-127, 128]), then our range division is like this:

2^-127 ~ 2^-126 is a range

2^-126 ~ 2^-125 is a range

...

2^0 ~ 2^1 is a range

2^1 ~ 2^2 is a range

...

2^127 ~ 2^128 is a range

Each range above will be equally divided into 2^23 intervals by the mantissa digits.

Increasing the exponent bit will not increase the accuracy: For example, if the exponent bit is increased to 16 bits (the representable exponent range is [-32767, 32768]), Then our scope division is as follows

2^-32767 ~ 2^-32766 is a range

2^-32766 ~ 2^-32765 is a range

...

2^-127 ~ 2^-126 is a range

2^-126 ~ 2^-125 is a range

...

2^0 ~ 2^1 is a range

2^1 ~ 2^2 is a range

...

2^32767 ~ 2^32768 is a range

Each of the above ranges will still be equally divided into 2^23 intervals by the mantissa digits.

Note: 2^0 ~ 2^1, this range is still equally divided into 2^23 intervals, 2^-126 ~ 2^-125, this range is still equally divided into 2^23 intervals...

There is no improvement in accuracy for each range.

Increasing the mantissa digits will increase the accuracy: For example, increase the mantissa digits to 48. Then each range will be equally divided into 2^48 intervals. This way Only then will the intervals in each range become smaller, the blue dots will become denser, and the accuracy will improve.

Summary: The number of exponent bits controls how many ranges can be included, and the number of mantissa bits controls the accuracy of each range, or in other words, controls the size of the interval in each range and the density of blue dots.

I hope this will give you a better understanding of the specific functions of the exponent bit and the mantissa bit in the ieee754 standard, and what they control.

4. Understand Angle 2: Calculation methods in WIKI

Understanding angle 2 is quite simple.

We said that a 32-bit floating point number is represented in memory like this: 1 sign bit, 8 exponent bits, 23 mantissa bits

In fact, the mantissa digit is 24 bits, because there is an integer part hidden before the mantissa digit1. or 0. (See the first article in this series)

Think about it carefully, there are three parts of floating point memory:

Sign bit: used to control the sign

Exponent bit: Controls the exponent, which actually controls the movement of the decimal point:

Just like in decimal:

1.2345e2 = 123.45

1.2345e3 = 1234.5, the exponent bit + 1 just moves the decimal point backward one place. It is the same in binary, the exponent Bits are only used to control the movement of the decimal point. For example, 0.01 → 0.001 (the decimal point moves one place to the left)

Mantissa bit: In fact, only the mantissa bit actually controls the accuracy, or actually records the status.

in 24-bit mantissa

From: 0.0000 0000 0000 0000 0000 000

To: 0.0000 0000 0000 0000 0000 001

...

Until: 1.1111 1111 1111 1111 1111 111

It contains a total of 2^24 states, or can accurately record 2^24 different states:

0.0000 0000 0000 0000 0000 000 is a state,

0.0000 0000 0000 0000 0000 001 Another state,

1.0010 1100 0100 1000 0000 000 Another state

...

If you plan to record 2^24 + 1 states, then the mantissa is not enough. In other words, it cannot meet your demand for accuracy.

From this perspective: There is an equal sign between accuracy and the number of representable states .

To sum up: 32-bit floating point numbers can record a total of 2^24 states (the sign bit is used to control positive and negative, the exponent bit is used to control the position of the decimal point. Only the mantissa bit is used to accurately record the state)

For float f = xxx; where xxx is Numerical values, no matterxxx what base system you use to write them, as long as you use 32-bit floating point numbers as containers, you can only accurately record up to 2^24 types. Status, just like there are 2^24 rooms in a 32-bit floating point building.

In fact, xxxWe usually write in decimal,

And 2^24 = 16 777 216 (decimal), that is, the 32-bit floating point number container can only store up to 16 777 216 (decimal) states.

16 777 216 is an 8-digit number

Therefore, the precision of a 32-bit floating point number is at most 7 decimal digits (0 - 9 999 999), with a total of 10 000 000 states.

If the precision of the 32-bit floating point number is 8 decimal digits (0 - 99 999 999), this is a total of 100 000 000 states, which is greater than the upper limit of the states that the 32-bit floating point number can store 16 777 216... so the accuracy is No more than 8 decimal digits.

The analysis is complete here.

If you prefer mathematical expressions, then "the precision of a 32-bit floating point number is at most N decimal digits", N is calculated like this:

The following is a description of the algorithm from the wiki:

The number of decimal digits precision is calculated via number_of_mantissa_bits * Log10(2). Thus ~7.2 and ~15.9 for single and double precision respectively.

As stated in the wiki, the precision of a 32-bit floating point number is approximately 7 decimal digits, and that of a 64-bit floating point number is approximately 16 decimal digits.

Note: For these two understanding angles: Understanding angle 2 is simpler, and the accuracy can be calculated directly using mathematical formulas. Understanding angle 1 (that is, understanding from the perspective of intervals) is more interpretable, richer in details, and can There are also more phenomena to explain.

5. Summary

This chapter generally summarizes the two calculation methods of "the precision of a 32-bit floating point number is 7 decimal digits". Regarding this topic, the information on the Internet is relatively confusing, so I have added some of my own understanding here. Please let me know if there are any mistakes. pointed out.

Guess you like

Origin blog.csdn.net/weixin_42056745/article/details/131699180