深入理解计算机系统——第二章—2.4浮点数

本文链接： https://blog.csdn.net/weixin_40199047/article/details/102292088

浮点表示：
$V \in R,\; V = x \times 2^y\; 通常 |V| >> 0 或者 |V| << 0$

2.4.1 二进制小数

$d_i \in (0～9),\Downarrow \\[2ex] d = d_{m}d_{m-1}\cdot \cdot \cdot d_{1}d_{0} \blue . d_{-1}d_{-2}\cdot \cdot \cdot d_{-n} = \sum_{i=-n}^{m}10^i \times d_i \\[2ex] \red{\tt {e.g.:}} \\[2ex] 12.34_{10} = 1 \times 10^1 + 2 \times 10^0 + 3 \times 10^{-1} + 4 \times 10^{-2} = 12 \frac{34}{100} \\[2ex] \implies b_i \in \{0,\;1 \},\Downarrow \\[2ex] b = b_{m}b_{m-1}\cdot \cdot \cdot b_{1}b_{0} \blue . b_{-1}b_{-2}\cdot \cdot \cdot b_{-n} = \sum_{i=-n}^{m}2^i \times b_i \qquad （2.19） \\[2ex] \red{\tt {e.g.:}} \\[2ex] 101.11_{2} = 1 \times 2^2 + 0 \times 2^1 + 1 \times 2^{0} + 1 \times 10^{-1} + 1 \times 10^{-2} = 5 \frac{3}{4} \\[2ex]$
二进制表示的局限性
无法精确的表示一些数字。例如，0.20的表示如下：
在这里插入图片描述

IEEE 浮点表示

IEEE浮点标准用
$V=(-1)^s \times M \times 2^E \\[2ex] \left. \begin{array}{l} \text{符号（sign） } & s 决定正负。数值0作特殊情况处理。\\[2ex] \text{尾数（significand）}& M是二进制小数，\; M \in [1～2 - \epsilon ]\;||\; [0 ～ 1 - \epsilon] \\[2ex] \text{阶码（exponent）} & E 对浮点数加权，可能是负数。 \\[2ex] \text{} 一个单独的符号位s直接编码符号s \\[2ex] k位的阶码字段exp = e_{k-1} \cdot \cdot \cdot e_1e_0编码阶码E。\\[2ex] n位小数字段frac = f_{n-1} \cdot \cdot \cdot f_1f_0编码尾数M。\\[2ex] \red {\tt {e.g.:}} \\[2ex] 单精度（float）:\; s = 1,\; exp = k = 8,\; frac = n = 23 \implies 32位表示 \\[2ex] 双精度（double）:\; s = 1,\; exp = k = 11,\; frac = n = 52 \implies 64位表示 \\[2ex] \end{array} \right. \\[2ex]$
在这里插入图片描述
根据exp的值，被编码的值分三种情况。

$\left. \begin{array}{l} \text{情况1：规格化的值} \\[2ex] exp位模式不全为0也不全为1:E = e - Bias ,\; e为无符号数，\; e = e_{k-1} \cdot \cdot \cdot e_1e_0,\; Bias = 2^{k-1} - 1 \\[2ex] frac \implies f,\; f \in [0,1) \implies 0.f_{n-1} \cdot \cdot \cdot f_1f_0,\; M=1+f \\[2ex] 情况2：非规格化的值 \\[2ex] 阶码域全为0 \implies E = 1 - Bias,\; M = f。\\[2ex] 情况3：特殊值 \\[2ex] 阶码全为1 \implies 小数域全为0 = \begin{cases} s = 0, & +\infty \\[2ex] s = 1, & -\infty \end{cases} \; 小数域为非零 = NaN \\[2ex] \red {\tt {e.g.:}} \\[2ex] \end{array} \right. \\[2ex]$

在这里插入图片描述

$\left. \begin{array}{l} \text{} 1.\; 值+0.0 \implies 0 \\[2ex] 2.\; 最小正非规格化值的位表示： M = f = 2^{-n},\; E = 1 - (2^{k-1} - 1) = -2^{k-1} + 2 \\[2ex] \implies V = 2^{-n-2^{k-1}+2} \\[2ex] 3.\; 最大非规格化值的位表示： M = f = 1 - 2^{-n},\; E = 1 - (2^{k-1} - 1) = -2^{k-1} + 2 \\[2ex] \implies V = (1 - 2^{-n}) \times 2^{-2^{k-1}+2} = (1 - \epsilon ) \times 2^{-2^{k-1}+2} \\[2ex] 4.\; 最小正规格化值的位表示： M = 1,\; E = 1 - (2^{k-1} - 1) = -2^{k-1} + 2 \\[2ex] \implies V = 2^{-2^{k-1}+2} \\[2ex] 5.\; 值1.0,\; M = 1,\; E = 0,\; V = 2^0 = 1 \\[2ex] 6.\; 最大规格化值的位表示： f = 1 - 2^{-n},\; M = 2 - 2^{-n},\; E = 2^{k-1} - 1 \\[2ex] \implies V = 2 - 2^{-n} \times 2^{2^{k-1} - 1 } = (2 - \epsilon ) \times 2^{2^{k-1} - 1 }\\[2ex] = (1 - 2^{-n - 1}) \times 2^{2^{k-1} } \\[2ex] \red{\tt{e.g.:}} 整数值转换成IEEE浮点表示 \\[2ex] 0 \times 00359141 = [0000\; 0000\; 0011\; 0101\; 1001\; 0001\; 0100\; 0001] \\[2ex] 0 \times 00359141 << 21 = 1.101011001000101000001_2 \times 2^{21} \\[2ex] 去掉最高有效位，末尾补0凑成23位\Downarrow \\[2ex] [101\; 0110\; 0100\; 0101\; 0000\; 0100] \\[2ex] 21 + Bias = 21 + 2^{8-1} - 1 = 148 = [1001\; 0100] \\[2ex] 补上符号位0 \Downarrow \\[2ex] \implies [0100\; 1010\; 0101\; 0110\; 0100\; 0101\; 0000\; 0100] \\[2ex] = 0 \times 4A564504 \\[2ex] 数据比对 \Downarrow \\[2ex] \end{array} \right.$

0x00359141	0	0	0	0	0	0	0	0	0	0	1	`1`	0	1	0	1	1	0	0	1	0	0	0	1	0	1	0	0	0	0	0	`1`
0x4A564504			0	1	0	0	1	0	1	0	0	`1`	0	1	0	1	1	0	0	1	0	0	0	1	0	1	0	0	0	0	0	`1`	0	0

2.4.4 舍入（rounding）

IEEE浮点格式定义了四种舍入方式。

方式	1.40	1.60	1.50	2.50	-1.50
向偶数舍入	1	2	2	2	-2
向零舍入	1	1	1	2	-1
向下舍入	1	1	1	2	-2
向上舍入	2	2	2	3	-1

$\left. \begin{array}{l} \text{} 向零舍入 \rightarrow_0 & |\hat x| \leq |x| \\[2ex] 向下舍入 \downarrow & x^- \leq x \\[2ex] 向上舍入 \uparrow & x \leq x^+ \\[2ex] 特别地：\red {（没搞懂。。。）}\\[2ex] 1.2349999 \implies 1.23 \\[2ex] 1.2350001 \implies 1.24 \\[2ex] 1.2450000 和1.2350000 \implies 1.24 \\[2ex] 形如\rm \; XX \cdot \cdot \cdot X \blue \tt. \rm YY \cdot \cdot \cdot Y100 \cdot \cdot \cdot 的二进制位模式，X和Y为任意值 \Downarrow \\[2ex] 10.00011_2( 2 \frac {3}{32}) \downarrow \implies 10.00_2( 2 ) \\[2ex] 10.00110_2( 2 \frac {3}{16}) \uparrow \implies 10.01_2( 2 \frac {1}{4}) \\[2ex] 10.11100_2( 2 \frac {7}{8}) \uparrow \implies 11.00_2( 3 ) \\[2ex] 10.10100_2( 2 \frac {5}{8}) \downarrow \implies 10.10_2( 2 \frac {1}{2}) \\[2ex] \end{array} \right.$