Fixed-point notation:
The symbol '.' is expressed as a binary dot, the right side of the dot is a negative power of 2, and the left side is a positive power of 2.
The binary of decimals can only represent those that can be written as x* 2 y 2^{y}2The number of y cannot be expressed exactly for those other numbers, but can only be expressed approximately.
And cannot efficiently represent very large numbers. (Increasing the length of the binary representation increases precision)
IEEE says:
- Sign (sign): S determines whether the number is negative (S=1) or positive (S=0)
- Mantissa (significand): M is a binary decimal
- Exponent: E is to weight floating-point numbers, and the weight is 2 to the power of E
Bit division of floating point numbers:
- A single sign bit S directly encodes the sign
- The k-bit exponent code field exp encodes the exponent code E
- The n-bit fractional field frac encodes the mantissa M (M is a binary fraction)
Single-precision floating-point format: float (s=1, exp=8, frac=23) 32 bits
Double-precision floating-point format: double (s=1, exp=11, frac=52) 64 bits
According to the value of exp, it is The encoded value can be divided into three cases:
-
Normalized value:
The bit pattern of exp is neither all 0s nor all 1s.
The exponent code field:
interpreted as a signed integer expressed in offset form (single precision -126~127) (double precision -1022~1023); the value of the exponent code: E=e-Bias.
①e: (unsigned number) ek − 1 e_{k-1}ek−1… and 1 and 0 and_1 and_0e1e0
②Bias: 2 k − 1 2^{k-1} 2The offset value decimal field of k − 1
-1 :
the mantissa is defined as M=1+f (implicit representation starting with 1) -
Denormalized value:
when the exponent is all 0s.
Bias field:
the subcode value is E=1-Bias.
Decimal field:
the value of the mantissa is M=f.
① Denormalized numbers effectively avoid the generation of +0.0 and -0.0
② For numbers close to 0.0, the number gradually overflows, and the possible number distribution is evenly close to 0.0 -
Special value:
the exponent code is all 1, the decimal is all 0, and the obtained value represents infinity. s=0 is positive infinity, s=1 is negative infinity.
The exponent code is all 1, the decimal is not 0, and the result is called "NaN" (not a number).
IEEE representation example 1:
Assumed 8-bit floating-point format, where k=4 (order code bits), n=3 (decimal bits), and the bias (Bias) is 2 4 − 1 2^{4-1}24−1-1=7.
e = ek − 1 e=e_{k-1}e=ek−1… and 1 and 0 and_1 and_0e1e0(unsigned number)
E:
- Normalization: E=e-Bias
- Denormalization: E=1-Bias
f = 0. f n − 1 f=0.f_{n-1} f=0.fn−1… f 1 f 0 f_1 f_0 f1f0(binary value)
M:
- Normalization: M=1+f
- Denormalization: M=f
V = ( − 1 ) s × M × 2 E V=(-1)^{s}×M×2^E V=(−1)s×M×2E
IEEE representation example two:
Conversion methods between integers and floating point numbers: