Article Directory
FP32, FP16 and INT8
When it comes to deep learning and computing tasks, FP32, FP16, INT8, and INT4 are commonly used data types for representing different numerical precision and storage requirements.
1.FP32
single precision floating point: Provides high precision and dynamic range, suitable for most scientific computing and general computing tasks.
Bit description (32 bits)
Sign bit (sign): 1 bit
Exponent: 8 bits
Mantissa (fraction): 24 bits (23 explicitly stored)
Calculation method: refer to Wikipedia - Single-precision floating-point format
value = ( − 1 ) sign × 2 ( E − 127 ) × ( 1 + ∑ i = 1 23 b 23 − i 2 − i ) . \mathrm{value }=(-1)^{\mathrm{sign}}\times2^{(E-127)}\times\left(1+\sum_{i=1}^{23}b_{23-i}2^ {-i}\right).value=(−1)sign×2(E−127)×(1+i=1∑23b23−i2−i).
- s i g n = b 31 = 0 sign = b_{31} = 0 sign=b31=0
- E = ( b 30 b 29 … b 23 ) 2 = ∑ i = 0 7 b 23 + i 2 + i = 124 E = (b_{30}b_{29} \ldots b_{23})_{2} = \sum\limits_{i=0}^{7}b_{23+i} 2^{+i} =124 E=(b30b29…b23)2=i=0∑7b23+i2+i=124
- 1. b 22 b 21 … b 0 = 1 + ∑ i = 1 23 b 23 − i 2 − i = 1 + 1 ⋅ 2 − 2 = 1.25 1.b_{22}b_{21} \ldots b_0 = 1 + \sum\limits_{i=1}^{23} b_{23-i} 2^{-i} = 1 + 1 \cdot 2^{-2} = 1.25 1.b22b21…b0=1+i=1∑23b23−i2−i=1+1⋅2−2=1.25
Result:
value = ( + 1 ) × 2 − 3 × 1.25 = + 0.15625. \mathrm{value}=(+1)\times2^{-3}\times1.25=+0.15625.value=(+1)×2−3×1.25=+ 0.15625.
Can be calculated automatically with the help ofIEEE-754 Floating Point Converter:
2. FP16
half precision floating point: Compared with FP32, it provides lower precision, but can reduce storage space and computational overhead. It is mainly used in computationally intensive tasks such as deep learning and machine learning.
Bit description (16 bits)
Sign bit (sign): 1 bit
Exponent: 5 bits
Fraction: 11 bits (10 explicitly stored)
Calculation method: refer to Wikipedia - Half-precision floating-point format
Exponent | Significand = zero | Significand ≠ \neq = zero | Equation |
---|---|---|---|
0000 0 2 00000_2 000002 | zero, - 0 | subnormal numbers | ( − 1 ) s i g n × 2 − 14 × 0. f r a c t i o n 2 (-1)^{\mathrm{sign}}\times2^{-14}\times0.\mathrm{fraction}_{2} (−1)sign×2−14×0.fraction2 |
0000 1 2 , . . . , 1111 0 2 00001_2, ..., 11110_2 000012,...,111102 | normalized value | normalized value | ( − 1 ) s i g n × 2 E − 15 × 1. f r a c t i o n 2 (-1)^{\mathrm{sign}}\times2^{E-15}\times1.\mathrm{fraction}_{2} (−1)sign×2E−15×1.fraction2 |
1111 1 2 11111_2 111112 | ± infinity | NaN(quiet, signalling) |
3. INT8
8-bit integer: It is mainly used to quantize images, audio, etc. to reduce the amount of calculation and storage requirements.
Using 8 bits (1 byte) of memory to store each value, integers ranging from -128 to 127 can be represented.
Bit description (8 bits)
- The highest bit represents the sign bit (0 - positive, 1 - negative)
Maximum value:
0 1 1 1 1 1 1 1 0 \quad 1 \quad 1 \quad 1 \quad 1 \quad 1 \quad 1 \quad 1 01111111
0 × 2 7 + 1 × 2 6 + 1 × 2 5 + 1 × 2 4 + 1 × 2 3 + 1 × 2 2 + 1 × 2 1 + 1 × 2 0 = 127 0 \times 2^7 + 1 \times 2^6 + 1 \times 2^5 + 1 \times 2^4 + 1 \times 2^3 + 1 \times 2^2 + 1 \times 2^1 + 1 \times 2^0 = 127 0×27+1×26+1×25+1×24+1×23+1×22+1×21+1×20=127
Minimum value:
1 0 0 0 0 0 0 0 1 \quad 0 \quad 0 \quad 0 \quad 0 \quad 0 \quad 0 \quad 0 10000000
1 × 2 7 + 0 × 2 6 + 0 × 2 5 + 0 × 2 4 + 0 × 2 3 + 0 × 2 2 + 0 × 2 1 + 0 × 2 0 = − 128 1 \times 2^7 + 0 \times 2^6 + 0 \times 2^5 + 0 \times 2^4 + 0 \times 2^3 + 0 \times 2^2 + 0 \times 2^1 + 0 \times 2^0 = -128 1×27+0×26+0×25+0×24+0×23+0×22+0×21+0×20=−128