NVIDIA CUDA Highly Parallel Processor Programming (5): Floating Point Operations

floating point format

In the IEEE-754 floating-point standard, a value consists of 3 parts: the sign bit (S), the exponent (E), and the mantissa (M). With some exceptions, each (S, E, M) mode can be assigned a unique value according to the following format:
value = ( − 1 ) S × 1. M × 2 E − bias value = (-1)^S \ times 1.M \times {2^{E-bias}}value=(1)S×1.M×2E b i a s
S: S = 0 means it is a positive number, and S = 1 means it is a negative number.
E: It is located in the mantissa field of the floating-point number, and its value is between 0 and 1.
M: In the exponent field of the floating-point number, it indicates the position of the decimal point.

The normalized representation of M

The above formula requires all values ​​to be processed into the form of 1.M, so that for each floating-point number, its mantissa is unique. For example, the only mantissa allowed for 0.5D (decimal) is M = 0:
0.5 D = 1.0 B × 2 − 1 0.5D = 1.0B \times 2^{-1}0 . 5 D=1.0B×21
other forms such as0.1 B × 2 0 0.1B \times 2^{0}0.1B×20 and10.0 B × 2 − 2 10.0B \times 2^{-2}10.0B×22 does not work. 1. Numbers in M ​​format are called Normalized Numbers. Normalized numbers all have 1.0, so 1.0 is omitted when storing floating-point numbers.

The remainder of E represents

If the IEEE middle order code E is represented by e bits, then add 2 e − 1 − 1 2^{e-1} -1 to the order code on the basis of two's complement representation2e11 constitutes its remainder code notation. The advantage of using remainder codes is that signed numbers can be compared with unsigned comparators. For example, the remainder code of the 3-digit order code means:
insert image description here
Excess-3 means the remainder 3 code, that is, add 2 on the basis of two’s complement code3 − 1 − 1 = 011 2^{3-1}-1=0112311=0 1 1 . The magnitude of the corresponding two's complement number can be obtained by comparing the size of the unsigned number for the remaining three codes, and the speed is faster than the size comparison of the signed number.
The 6-digit format of the 3-digit exponent code of 0.5D is:
001000, where S = 0, E = 010, M = ( 1. ) 00 001 000, where S=0, E=010, M=(1.)000 0 1 0 0 0 , where S=0E=010M=( 1 . ) 0 0
Generally speaking, if the normalized mantissa and remainder code are used to represent the order code, then the value corresponding to the number with n-bit order code is: ( −
1 ) S × 1. M × 2 ( E − ( 2 ( n − 1 ) − 1 ) ) (-1)^{S} \times 1.M \times 2^{(E-(2^{(n-1)}-1))}(1)S×1.M×2(E(2(n1)1))

representable number

The following table lists the representation of floating-point numbers in 5-bit IEEE format. According to the above formula, it should be possible to calculate the representation of the non-zero column (No_Zero) as shown in the figure.
insert image description here

The positive numbers of the non-zero representation are drawn on the number axis (negative numbers are symmetrical):
insert image description here
5 conclusions can be drawn from the above two figures:

  1. The interval between numbers that can be represented depends on the exponent. There are three major intervals on each side of 0, and since the exponent has two bits and a mode-reserved bit (11), these two exponents can form 3 different bits (−2 −1 = −0.5D ,
    −2 0 = −1.0D, −2 1 = −2.0D), and there are also three symmetrical ones on the left.
  2. The number of numbers that can be represented in each main interval depends on the number of digits in the mantissa. If it is N bits, 2 N numbers can be represented in each interval .
  3. 0 cannot be represented in this format.
  4. The closer to 0, the closer the representable numbers are. As you move toward 0, each interval is half the size of the previous interval.
  5. The previous item is not true near 0, and the number that can be represented near 0 is blank.

In normalized floating-point representation, one way to accommodate 0 is to use underflow (abrupt underflow). When the exponent is 0, the corresponding number is 0, but the accuracy is not good.
insert image description here

In fact, the IEEE standard takes a denormalized approach. When E=0, the mantissa is not in the form of 1.XX, but in the form of 0.XX. If the n-bit exponent is 0, then this value is: 0. M × 2 − 2 ( n + 1 ) + 2 0.M \times 2^{-2^{(n+1)+2}}0.M×22(n+1)+2

In short, adding one more mantissa can reduce the maximum error by half, thereby improving the accuracy.

Special bit patterns and precision in IEEE format

If all bits of the exponent are 1 and the mantissa is 0, then the number represents an infinite value. When the mantissa is not 0, it means NaN. All special modes of the IEEE floating-point format are shown in the figure below.
insert image description here
All other numbers are normalized floating point numbers. Single-precision floating-point numbers have 1 sign bit, 8 exponent bits, and 23 mantissa bits. Double precision has 1 sign bit, 11 exponent bits, and 52 mantissa bits. The mantissa of the double-precision floating-point number is 29 more than that of the single-precision number, so the maximum error of the double-precision floating-point number representation is 1 / 2 less 29 1/2^{29}1/22 9 times. Due to the addition of 3-digit exponent codes, the range that can be represented by double-precision floating-point numbers has been greatly expanded.

Meaningless operations such as 0/0, 0*∞, ∞/∞, ∞–∞ will produce NaN. In the IEEE standard, there are two types of NaN: signaling and quite. Signaling NaN is signaled by clearing the most significant bit of the mantissa, and quite NaN is signaled by setting the most significant bit of the mantissa.
Using sNaN as an input to an arithmetic operation causes an exception, and using qNaN as an input results in qNaN.

Algorithm optimization

In matrix multiplication, the dot product operation needs to sum the results of multiplying the input matrices. Due to the associative law of addition, the order of these summations does not affect the result of the addition, but due to the limited precision of floating point, the order of these summations will affect the accuracy of the final operation result. For example, the sum of 4 numbers expressed in 5-bit format: 1.00 B ∗ 2 0 + 1.00 B ∗ 2 0 + 1.00 B ∗ 2 − ​​2 + 1.00 B ∗ 2 − ​​2 1.00B*2^0+1.00B*2^ 0+1.00B*2^{−2}+1.00B*2^{−2}1.00B20+1.00B20+1.00B22+1.00B22
If in sequence:
insert image description here
In the second and third steps, the smaller operand disappears directly, because when added to the larger operand, it is smaller than the least significant bit of the large operand .

Using a parallel algorithm:
insert image description here

The result after changing the order is different from the previous result because the addition of items 3 and 4 is not sufficient for them to be lost.

Therefore, in order to maximize the accuracy of arithmetic operations, the usual technique is to sort the data before the reduction calculation .

numerical stability

Let's look at an example of solving a system of equations:

insert image description here
Transform the coefficient matrix into a unit matrix, and the solution can be obtained:
insert image description here
a Gaussian elimination kernel function can be designed as shown in the above figure, and each thread processes all calculation iterations to be completed on one row in the matrix. After each division step, all threads are synchronized with __syncthreads(). Then start the subtraction operation. After the subtraction operation, all threads need to synchronize again to ensure that the updated information is used in the next step. A thread will be suspended after completing the specified task until the return phase begins.

However, using simple Gaussian elimination will encounter numerical stability problems, for example:
insert image description here
the coefficient of X in the first row is 0, and it is impossible to divide equation 1 by 0, so the above algorithm is numerically unstable for this equation system .

Therefore, we need to do elementary row transformation and change the position of Equation 1:
insert image description here

Divide the current Equation 1 by 2, and then subtract Equation 1 from the current Equation 3. The following can be continued with the solution of the above equation system:
insert image description here
Generally speaking, the elementary row transformation selects the equation with the largest absolute value of the primary variable coefficient and the top equation to exchange. Although elementary row transformation is relatively simple in concept, it will lead to complex algorithm implementation and affect performance. In the kernel, threads can be reassigned for each row.

Guess you like

Origin blog.csdn.net/weixin_45773137/article/details/124957447