What is a floating point number?

Wechat search followed the "Water Drop and Silver Bullet" public account to get high-quality technical dry goods in the first place. With 7 years of experience in back-end research and development, he explained the technology clearly in a simple way.

In the last article, we mainly introduced the way of using fixed-point numbers to represent numbers in computers.

To recap briefly, in simple terms, when a fixed-point number is used to represent a number, it is agreed that the position of the decimal point is fixed, and the integer part and the decimal part are converted to binary respectively, which is the result of the fixed-point number.

However, when using fixed-point numbers to represent decimals, there are disadvantages of limited numerical range and precision range. Therefore, in computers, we generally use "floating-point numbers" to represent decimals.

In this article, let's take a closer look at how floating-point numbers represent decimals, and how large the range and precision of floating-point numbers are.

What is a floating point number?

First of all, we need to understand what is a floating point number?

We have learned fixed-point numbers before, where "fixed-point" refers to the agreement that the position of the decimal point is fixed. The "floating point" of floating point numbers means that the position of the decimal point can be floating .

How to understand this?

In fact, floating-point numbers are expressed in scientific notation. For example, the decimal decimal 8.345, expressed in scientific notation, can be expressed in various ways:

8.345 = 8.345 * 10^0
8.345 = 83.45 * 10^-1
8.345 = 834.5 * 10^-2
...

see it? When using this method of scientific notation to represent decimals, the position of the decimal point becomes "floating". This is the origin of the name of floating-point numbers relative to fixed-point numbers.

Using the same rule, for binary numbers, we can also use scientific notation, that is to say, just replace the base 10 with 2.

How do floating point numbers represent numbers?

We already know that floating-point numbers use scientific notation to represent a number, and its format can be written as follows:

V = (-1)^S * M * R^E

The meaning of each variable is as follows:

  • S: Sign bit, value 0 or 1, determine the sign of a number, 0 means positive, 1 means negative
  • M: Mantissa, expressed as a decimal, such as 8.345 * 10^0 as seen above, 8.345 is the mantissa
  • R: base, which means that the decimal number R is 10, which means that the binary number R is 2
  • E: Exponent, expressed as an integer, such as 10^-1 as seen earlier, -1 is the exponent

If we want to use a floating point number to represent a number in a computer, we only need to confirm these variables.

Suppose we now use 32 bits to represent a floating point number, and fill the above variables into these bits according to certain rules:

Insert picture description here

Suppose we define the following rules to fill these bits:

  • Sign bit S occupies 1 bit
  • Index E occupies 10 bits
  • The mantissa M occupies 21 bits

According to this rule, the decimal number 25.125 is converted to a floating point number. The conversion process is like this (D stands for decimal, B stands for binary):

  1. Integer part: 25(D) = 11001(B)
  2. Decimal part: 0.125(D) = 0.001(B)
  3. Expressed in binary scientific notation: 25.125(D) = 11001.001(B) = 1.1001001 * 2^4(B)

So the sign bit S = 0, the mantissa M = 1.001001 (B), and the exponent E = 4 (D) = 100 (B).

According to the rules defined above, fill to 32 bit, that's it:
Insert picture description here
the result of the floating-point number comes out, isn't it simple?

But there is a problem here. In the rule we just defined, the sign bit S occupies 1 bit, the exponent bit E occupies 10 bits, and the mantissa M occupies 21 bits. This rule is casually defined by us.

If you also want to set a new rule, for example, the sign bit S occupies 1 bit, the exponent bit E occupies 5 bits this time, and the mantissa M occupies 25 bits, is that okay? of course can.

According to this rule, the floating-point number is expressed like this:
Insert picture description here
we can see that the number of digits allocated for the exponent and the mantissa is different, and the following situations will occur:

  1. The more exponent bits, the fewer the mantissa bits, and the larger the range it represents, but the accuracy will be worse. On the contrary, the fewer the exponent bits, the more mantissa bits, the smaller the range, but the better the accuracy.
  2. The floating-point number format of a number will be different because of the different defined rules, and the results will be different, and the range and precision of the representation will also be different.

This was the case when people proposed the definition of floating-point numbers in the early days. At that time, there were many computer manufacturers, such as IBM, Microsoft, etc., each computer manufacturer would define its own floating-point number rules, and different manufacturers would not express floating-point numbers for the same number. the same.

This will lead to a program that performs floating-point operations on computers from different vendors, and needs to be converted to the floating-point format specified by the vendor before calculations can be performed. This will inevitably increase the cost of calculation.

How to solve this problem? The industry urgently needs a unified floating-point number standard.

Floating point standard

Until 1985, the IEEE organization introduced a floating-point number standard, which is the IEEE754 floating-point number standard we often hear . This standard unifies the representation of floating-point numbers and provides two floating-point formats:

  • Single-precision floating-point number float: 32 bits, the sign bit S occupies 1 bit, the exponent E occupies 8 bits, and the mantissa M occupies 23 bits
  • Double-precision floating-point number float: 64 bits, the sign bit S occupies 1 bit, the exponent E occupies 11 bits, and the mantissa M occupies 52 bits

In order to maximize the range and precision of the number it represents, the floating-point number standard also specifies the exponent and mantissa:

  1. The first digit of the mantissa M is always 1 (because 1 <= M <2), so this 1 can be omitted, it is a hidden bit , so that the single-precision 23-bit mantissa can represent 24 significant digits, and the double-precision 52 digits The mantissa can represent 53 significant digits
  2. The exponent E is an unsigned integer. When it represents a float, it occupies a total of 8 bits, so its value range is 0 ~ 255. But because the exponent can be negative, it is stipulated to add an intermediate number 127 to its original value when storing E , so that the value range of E is -127 ~ 128. When it means double, it occupies a total of 11 bits. When storing E, add the middle number 1023, so the value range is -1023 ~ 1024.

In addition to specifying the mantissa and exponent bits, the following provisions are also made:

  • Index E is not all 0 and not all 1: Normalized number, normal calculation according to the above rules
  • The exponent E is all 0, the mantissa is not 0: non-normalized number, the hidden bit of the mantissa is no longer 1, but 0 (M = 0.xxxxx), which can represent 0 and very small numbers
  • Exponent E is all 1, and mantissa is all 0: positive infinity/negative infinity (positive and negative depend on the sign bit of S)
  • The exponent E is all 1, the mantissa is not 0: NaN (Not a Number)
    Insert picture description here

Standard floating point number representation

With this unified floating-point number standard, we then convert 25.125 into a standard float floating-point number:

  1. Integer part: 25(D) = 11001(B)
  2. Decimal part: 0.125(D) = 0.001(B)
  3. Expressed in binary scientific notation: 25.125(D) = 11001.001(B) = 1.1001001 * 2^4(B)

So S = 0, mantissa M = 1.001001 = 001001 (remove 1, hidden bits), exponent E = 4 + 127 (middle number) = 135(D) = 10000111(B). Fill it into 32 bits as follows:
Insert picture description here

This is the result of standard 32-bit floating point numbers.

If expressed by double, similar to this rule, the exponent bit E is filled with 11 bits, and the mantissa bit M is filled with 52 bits.

Why do floating-point numbers have a loss of precision?

Let's take a look again, what is the situation that floating-point numbers often hear about the loss of precision?

If we now want to use a floating point number to represent 0.2, what will its result be?

The process of converting 0.2 into a binary number is to keep multiplying by 2 until there is no decimal. In this calculation process, the integer part obtained from top to bottom is the binary result.

0.2 * 2 = 0.4 -> 0
0.4 * 2 = 0.8 -> 0
0.8 * 2 = 1.6 -> 1
0.6 * 2 = 1.2 -> 1
0.2 * 2 = 0.4 -> 0(发生循环)
...

So 0.2(D) = 0.00110...(B).

Because the decimal 0.2 cannot be accurately converted into binary decimals, and when the computer represents a number, the width is limited. When the infinite loop decimal is stored in the computer, it can only be truncated, which will cause the loss of decimal precision.

What is the range and precision of floating point numbers?

Finally, let's take a look again. How much range and precision can a number be represented by a floating point number?

Take the single-precision floating-point number float as an example, the largest binary number it can represent is +1.1.11111…1 * 2^127 (23 1s after the decimal point), and the binary 1.11111…1 ≈ 2, so the maximum number that float can represent It is 2^128 = 3.4 * 10^38, that is, the range of float is: -3.4 * 10^38 ~ 3.4 * 10 ^38.

How small is the precision it can represent?

The smallest binary number that float can represent is 0.0000...1 (22 0s after the decimal point, 1 1), and it is 1/2^23 when expressed as a decimal number.

It can be calculated in the same way that the largest binary number that double can represent is +1.111…111 (52 1s after the decimal point) * 2^1023 ≈ 2^1024 = 1.79 * 10^308, so the range that double can represent is: -1.79 * 10^308 ~ +1.79 * 10^308.

The minimum precision of double is: 0.0000...1 (51 0s, 1 1), which is 1/2^52 in decimal.

It can be seen from this that although the range and precision of floating-point numbers are also limited, their range and precision are already very large, so in computers, we usually use floating-point numbers to store the representation of decimals.

to sum up

In this article, we mainly talked about the floating-point number representation of numbers, summarized as follows:

  1. Floating point numbers are generally expressed in scientific notation
  2. Fill the variables in scientific notation into fixed bits, which is the result of floating point numbers
  3. In the early days when floating-point numbers were proposed, each computer manufacturer formulated its own floating-point number rules, which led to different manufacturers' floating-point number representations for the same number. The calculations needed to be converted before calculations.
  4. Later, the IEEE organization proposed a standard for floating-point numbers, unified the format of floating-point numbers, and specified single-precision floating-point numbers float and double-precision floating-point numbers double. Since then, various computer manufacturers have unified the floating-point number format, which has continued to this day.
  5. When floating-point numbers represent decimals, because decimal decimals cannot be accurately converted when they are converted to binary, and they will be truncated when stored in a fixed-bit computer, there may be a loss of precision when floating-point numbers represent decimals.
  6. When a floating point number represents a number, its range and precision are very large, so the decimal numbers we usually use are usually stored in a computer using floating point numbers

Wechat search followed the "Water Drop and Silver Bullet" public account to get high-quality technical dry goods in the first place. With 7 years of experience in back-end research and development, he explained the technology clearly in a simple way.

Guess you like

Origin blog.csdn.net/ynxts/article/details/112342463
Recommended