IEEE 754 Standard - Wikipedia

IEEE binary floating point arithmetic standard (IEEE 754) is the standard floating-point operations since the 1980s, the most widely used for many CPU with floating-point arithmetic unit employed. This standard defines the format of floating-point represented (including negative zero -0) with an abnormal value (denormal number), some special values ((infinity (Inf) and non-numeric (NaN)), and these values of "floating-point operator "; it also specifies four rounding rules and five kinds of exceptions (including timing and exception handling occurs).

IEEE 754 floating-point number represents the four ways specified values: single-precision (32-bit), double-precision (64-bit), an extension of single precision (43 bits or more, rarely used) extending double precision (79 bits above, typically implemented in 80). Only 32-bit modes are mandatory, others are optional. Most programming languages ​​provide IEEE floating-point format and arithmetic, but some will be listed as non-essential. For example, there is before the advent of the C language IEEE 754, IEEE arithmetic now includes, but is not mandatory for (a float in C generally refers to IEEE single precision, double precision and double refers to degrees).

The standard called the IEEE binary floating point arithmetic standard (the ANSI / IEEE Std 754-1985) , also known as IEC 60559: 1989, binary floating point arithmetic microprocessor system (the original number is IEC 559: 1989) [1] . Later there is "nothing to do with the base of the float," the "IEEE 854-1987 standard" provisions radix-2 status with 10. Now the latest standard is "ISO / IEC / IEEE FDIS 60559 : 2010".

In the sixties and seventies, various models of various computer company's computer, has a vastly different floating-point representation, but there is no industry-wide standard. This gives data exchange, computer work caused great inconvenience. IEEE floating point professional team in the late seventies began brewing floating-point standard. In 1980, Intel introduced the 8087 single-chip floating-point coprocessor, the floating-point representation and defined operators have sufficient reasonable, advanced, was adopted as a standard IEEE floating-point numbers, released in 1985 . Prior to this, the content of this standard has been widely used in various computer companies in the early 1980s, the industry has become the de facto industry standard. Numerical University of California, Berkeley computer science professor William Kahan as the "Father of the float."

Floating-point analysis

A floating-point number (Value) representation can actually said this:
Here Insert Picture Description

I.e. the actual value of the floating-point number, equal to the sign bit (sign bit) is multiplied by an offset index value (exponent bias) multiplied by a fractional value (fraction).

The following is the text description 754 pairs IEEE floating-point format.

Herein represents the bit convention

The W bits (bit) of data, from low-end to high-memory address, 0 to W-1 encoding. The memory address is typically the lower end of the right most bit is written, referred to as the least significant bit (Least Significant Bit, LSB), bits represents the smallest minimal impact on the overall value of the bit is changed. This is a statement of the need for X86 architecture is little-endian data storage.

For decimal integer N, if necessary, is expressed as N2 and N10 to represent the binary number to distinguish.

For a number whose binary value in scientific notation index, hereinafter called the actual value of the exponent ; and according to the IEEE 754 standard values encoded exponent part, called the floating-point representation coding field value of the exponent .

Overall presentation

IEEE 754 floating-point number three domains
binary floating-point format is stored symbol values represent the method - the most significant bit is designated as the sign bit (sign bit); "index portion", i.e., the next most significant bits of e, storage exponent part; "effective number" is the last remaining significant bits of the f bit memory (significand) of the fractional part (in the form of a non-integer part of the statute defaults to 0, all other cases a default).
Here Insert Picture Description

Index offset

Here Insert Picture Description

Exponential actual value plus a fixed offset value approaches a float representing the index may have the advantage $ e} e bit unsigned integer value to represent all of length index, which enables the two floating-point the index size is relatively easier, in fact, can compare the size of two floating-point representation according to the dictionary order.

This represents a shift index portion, known in Chinese as exponent.

Statute in the form of floating-point numbers

If the encoded value in the floating-point exponent portion 0 < e x p O n e n t 2 e 2 0<exponent\leqslant 2^{e}-2 between and in scientific notation represented embodiment, the fraction (fraction) portion of the most significant bit (i.e. integer number) is1, then the floating-point number will be referred tothe Statute of floating-point form. "Statute" means the only floating-point format used to represent a certain value.

Since this represents the mantissa has an implicit binary significant digits, to binary mantissa scientific notation (mantissa) phase difference, called the IEEE754 effective number (significant).

For example, a double-precision (64-bit) floating-point number in the range of formal specification for the index offset values 00000000001 00000000001 (. 11-'bit) to 11111111110 11111111110 , the fractional part is 000.....000 000.....000 to 111.....111 111.....111 (52-bit)

The Statute form of floating-point numbers

If the value of the exponent part of the floating point encoding is zero, nonzero fractional part, then the floating-point number will be referred to the Statute of floating-point form . General is a number very close to zero will be used to represent the type of the Statute. IEEE 754 standard: Statute form of floating-point exponent offset values smaller than 1 in the form of floating-point exponent statute. For example, single precision floating point encoding the exponent part of the minimum value of 1 in the form of the statute, the actual index value -126; single precision floating point exponent field encoding rather than the statute is 0, the actual value of the corresponding index is - 126 instead of -127. In fact, the Statute of floating-point form which may be used is still valid, but they have an absolute value smaller than the absolute statute all floating point numbers; i.e., all the Statute float float closer to zero than the statute. Statute floating point mantissa is greater than 2 and less than or equal to 1, rather than floating point mantissa statute is less than 1 and greater than 0.

The special value

There are three special values should be pointed out:
Here Insert Picture Description
the above rules, summarized as follows:
Here Insert Picture Description

32-bit single precision

Single precision binary fraction using a 32-bit memory.
Here Insert Picture Description
S is the sign bit, Exp numbers refer to, Fraction valid number. Means that part of the index, the actual index value prejudiced size of a fixed value (32 in the case 127) and form a so-called bias value. Object represented in this way is to simplify the comparison. Because the index value may be positive as negative, if complement representation, then, all the sign bit S and the symbol bit Exp itself will result in the size can not be simply compared. Because of this, part of the index n usually stored in an unsigned value. Single precision exponent part is -126 to + 127 adds an offset value 127, the size of the exponent values (0 and 255 is of special value) from 1 to 254. When floating point calculations, the index value will be the value obtained by subtracting the actual bias index size.

Single precision floating point where the extremes of: Here Insert Picture Description

64 pairs of precision

Double binary fraction using a 64-bit memory.
Here Insert Picture Description
S is the sign bit, Exp numbers refer to, Fraction valid number. I.e., a so-called index portion biased positive , said index prejudiced actual value with a fixed size value (64-bit case is 1023) and the form. Object represented in this way is to simplify the comparison. Because the index value may be positive as negative, if complement representation, then, all the sign bit S and the symbol bit Exp itself will result in the size can not be simply compared. Because of this, part of the index n usually stored in an unsigned value. Double precision exponent part is -1022 + 1023 plus 1023, the size of the index value from 1 to 2046 (0 (2 binary all zero) and 2047 (2 to carry a whole) is a special value). When floating point calculations, the index value will be the value obtained by subtracting the actual bias index size.

Compare floating point

Float dictionary can be made substantially according to the symbol bit comparison, the exponent field, the order of the mantissa field. Obviously, all greater than the positive negative; the same sign, binary exponential notation which floating-point values ​​greater greater.

Floating point rounding

Computation result on any valid number, usually stored in a long register when the result is returned to a floating-point format, the extra bits must be discarded. There are many ways to run a rounding operation, actually four different ways IEEE standard are listed:

Rounding to the nearest: rounded to the nearest, as in the case of an even number closest priority (Ties To Even, which is the default rounding mode): will be rounding to the nearest value and may be expressed, but when There are two numbers as close to the time, which is taken even number (0 is the end of the binary).
Rounding towards + ∞ direction: will result in the positive direction of infinity rounding.
Rounding towards -∞ direction: will result in the negative infinity rounding direction.
Rounding direction towards zero: will result rounding toward zero.

Floating-point calculations and function
Standard operation

We must provide the following functions:
Here Insert Picture Description

The proposed function and a predicate

Here Insert Picture Description

accuracy

In binary, the first significant digit must be "1", the "1" and not stored.

DISCUSSION

Effective single-precision and double precision numbers are floating-point numbers are stored in bits 23 and 52, with the most left is not the first bit of storage, i.e., 24 and 53 bits.

log 2 24 = 7.22 \log 2^{24}=7.22
log 2 53 = 15.95 \log 2^{53}=15.95
by the above calculation, single-precision and double precision floating point numbers 7 and 15 can ensure valid decimal number.

Discussion II

Decimal-precision floating point defined in the C ++ language standard (decimal precision): Decimal Digits, may be (floating point) does not change the value represented by [3]. Decimal precision floating-point C-language standard defined as: the number of bits decimal number q, such that any float having q decimal digits p can be approximated as b-ary digits and decimal representation can be approximated without changing it back q decimal digits [4]

However, due to the relative approximation error is nonuniform, some seven decimal floating-point number can not be guaranteed after approximately into 32-bit floating point and then converted back to approximately seven decimal floating point value remains unchanged: e.g. 8.589973e9 becomes 8.589974e9. This approximation error does not represent more than one bit capacity, thus (24-1) * std :: log10 (2) is equal to 6.92, the rounded to 6, the value becomes std :: numeric_limits :: digits10 and the FLT_DIG. std :: numeric_limits :: max_digits10 value of 9, meaning nine decimal digits can be all distinguished float values; that represents a maximum float discrimination.

类似的,std::numeric_limits<double>::digits10或DBL_DIG是15, std::numeric_limits<double>::max_digits10是17

example

The following C ++ program, schematically showing single-precision and double-precision floating-point precision.

#include <iostream>

int main () {
    std::cout.precision(20);
    float a=123.45678901234567890;
    double b=123.45678901234567890;
    std::cout << a << std::endl;
    std::cout << b << std::endl;
    return 0;
}

// Xcode 5.1
// Output:
// 123.456787109375
// 123.45678901234568059
// Program ended with exit code: 0

Full translation of Wikipedia - encyclopedia Wikipedia, the free

Published 612 original articles · won praise 282 · views 50000 +

Guess you like

Origin blog.csdn.net/weixin_43627118/article/details/104851030