FP32, FP16 and INT8

FP32, FP16 and INT8

When it comes to deep learning and computing tasks, FP32, FP16, INT8, and INT4 are commonly used data types for representing different numerical precision and storage requirements.

1.FP32

single precision floating point: Provides high precision and dynamic range, suitable for most scientific computing and general computing tasks.

Bit description (32 bits)

  • Sign bit (sign): 1 bit

  • Exponent: 8 bits

  • Mantissa (fraction): 24 bits (23 explicitly stored)

Calculation method: refer to Wikipedia - Single-precision floating-point format
value = ( − 1 ) sign × 2 ( E − 127 ) × ( 1 + ∑ i = 1 23 b 23 − i 2 − i ) . \mathrm{value }=(-1)^{\mathrm{sign}}\times2^{(E-127)}\times\left(1+\sum_{i=1}^{23}b_{23-i}2^ {-i}\right).value=(1)sign×2(E127)×(1+i=123b23i2i).

  • s i g n = b 31 = 0 sign = b_{31} = 0 sign=b31=0
  • E = ( b 30 b 29 … b 23 ) 2 = ∑ i = 0 7 b 23 + i 2 + i = 124 E = (b_{30}b_{29} \ldots b_{23})_{2} = \sum\limits_{i=0}^{7}b_{23+i} 2^{+i} =124 E=(b30b29b23)2=i=07b23+i2+i=124
  • 1. b 22 b 21 … b 0 = 1 + ∑ i = 1 23 b 23 − i 2 − i = 1 + 1 ⋅ 2 − 2 = 1.25 1.b_{22}b_{21} \ldots b_0 = 1 + \sum\limits_{i=1}^{23} b_{23-i} 2^{-i} = 1 + 1 \cdot 2^{-2} = 1.25 1.b22b21b0=1+i=123b23i2i=1+122=1.25

Result:
value = ( + 1 ) × 2 − 3 × 1.25 = + 0.15625. \mathrm{value}=(+1)\times2^{-3}\times1.25=+0.15625.value=(+1)×23×1.25=+ 0.15625.
Can be calculated automatically with the help ofIEEE-754 Floating Point Converter:

2. FP16

half precision floating point: Compared with FP32, it provides lower precision, but can reduce storage space and computational overhead. It is mainly used in computationally intensive tasks such as deep learning and machine learning.

Bit description (16 bits)

  • Sign bit (sign): 1 bit

  • Exponent: 5 bits

  • Fraction: 11 bits (10 explicitly stored)

Calculation method: refer to Wikipedia - Half-precision floating-point format

Exponent Significand = zero Significand ≠ \neq = zero Equation
0000 0 2 00000_2 000002 zero, - 0 subnormal numbers ( − 1 ) s i g n × 2 − 14 × 0. f r a c t i o n 2 (-1)^{\mathrm{sign}}\times2^{-14}\times0.\mathrm{fraction}_{2} (1)sign×214×0.fraction2
0000 1 2 , . . . , 1111 0 2 00001_2, ..., 11110_2 000012,...,111102 normalized value normalized value ( − 1 ) s i g n × 2 E − 15 × 1. f r a c t i o n 2 (-1)^{\mathrm{sign}}\times2^{E-15}\times1.\mathrm{fraction}_{2} (1)sign×2E15×1.fraction2
1111 1 2 11111_2 111112 ± infinity NaN(quiet, signalling)

3. INT8

8-bit integer: It is mainly used to quantize images, audio, etc. to reduce the amount of calculation and storage requirements.

Using 8 bits (1 byte) of memory to store each value, integers ranging from -128 to 127 can be represented.

Bit description (8 bits)

  • The highest bit represents the sign bit (0 - positive, 1 - negative)

Maximum value:
0 1 1 1 1 1 1 1 0 \quad 1 \quad 1 \quad 1 \quad 1 \quad 1 \quad 1 \quad 1 01111111
0 × 2 7 + 1 × 2 6 + 1 × 2 5 + 1 × 2 4 + 1 × 2 3 + 1 × 2 2 + 1 × 2 1 + 1 × 2 0 = 127 0 \times 2^7 + 1 \times 2^6 + 1 \times 2^5 + 1 \times 2^4 + 1 \times 2^3 + 1 \times 2^2 + 1 \times 2^1 + 1 \times 2^0 = 127 0×27+1×26+1×25+1×24+1×23+1×22+1×21+1×20=127

Minimum value:
1 0 0 0 0 0 0 0 1 \quad 0 \quad 0 \quad 0 \quad 0 \quad 0 \quad 0 \quad 0 10000000
1 × 2 7 + 0 × 2 6 + 0 × 2 5 + 0 × 2 4 + 0 × 2 3 + 0 × 2 2 + 0 × 2 1 + 0 × 2 0 = − 128 1 \times 2^7 + 0 \times 2^6 + 0 \times 2^5 + 0 \times 2^4 + 0 \times 2^3 + 0 \times 2^2 + 0 \times 2^1 + 0 \times 2^0 = -128 1×27+0×26+0×25+0×24+0×23+0×22+0×21+0×20=128

Guess you like

Origin blog.csdn.net/m0_70885101/article/details/131555760