Technical information|How to use DolphinDB to solve data accuracy problems

1 Calculation characteristics of DECIMAL

1.1 Calculation method of DECIMAL

The storage of DECIMAL is divided into two parts, namely raw data that stores integer data and scale that stores the number of decimal places. For example, for the DECIMAL32(2) type 1.23, it stores two data: (1) raw data: that is, the integer type 123, (2) scale: that is, 2. The advantage of this storage is that integer raw data can be used directly for calculation, thereby avoiding loss of accuracy.

For most calculation functions, if the final result returned is a floating point number type, DECIMAL will use raw data to participate in the calculation when performing calculations, trying to delay the conversion to floating point numbers as much as possible to ensure accurate results. For example, when calculating avg, assume that the data is of DECIMAL32(2) type: [1.11, 2.22, 3.33], and its raw data is: [111, 222, 333]. When calculating, first calculate the sum of raw data: 111 + 222 + 333 = 666, and then convert it into a floating point number: double(666) / 102 / 3 = 2.22.

For the rules of DECIMAL arithmetic operations, please refer to: DECIMAL usage tutorial - Zhihu

1.2 Calculation output of DECIMAL

This section mainly describes the output results of DECIMAL as the input of the calculation function.

Among the DolphinDB calculation functions, there are few functions that take DECIMAL type as input and the output result is still DECIMAL type. There are only: , sum, max, min, first, last, firstNotand lastNot cum, m, tm, TopN series functions, such as  cummax, mmin, , tmsum, msumTopN,  etc; and .cumsumTopNtmsumTopN cumPositiveStreak

a = decimal64(rand(10,100),4)

typestr(sum(a))
>> DECIMAL128

typestr(cummax(a))
>> FAST DECIMAL64 VECTOR

typestr(mmin(a,5))
>> FAST DECIMAL64 VECTOR

T = 2023.03.23..2023.06.30
typestr(tmsum(T, a, 3))
>> FAST DECIMAL128 VECTOR

typestr(cumPositiveStreak(a))
>> FAST DECIMAL128 VECTOR

It should be noted that after the introduction of the DECIMAL128 type, starting from version 2.00.10, the  cumPositiveStreak corresponding rules for the input and output types of the sum series functions are as follows:

In addition to the above functions, cum, m, tm, TopN series of functions, including their corresponding original functions, such as  avg, std, var, skew etc., which take DECIMAL type as input, will return DOUBLE type output results.

a = decimal64(rand(10,100),4)

typestr(avg(a))
>> DOUBLE

typestr(cumstd(a))
>> FAST DOUBLE VECTOR

typestr(mvar(a,5))
>> FAST DOUBLE VECTOR

T = 2023.03.23..2023.06.30
typestr(tmskew(T, a, 3))
>> FAST DOUBLE VECTOR

The following uses DECIMAL64 as an example to list the output result types of common calculation functions.  

2 Advantages and Disadvantages of DECIMAL

2.1 Advantages of DECIMAL

There are two main reasons why real numbers cannot be accurately represented as floating point numbers inside the computer: The first reason is that numbers like 0.1 have limited decimal representation, but can be represented as infinitely repeated data in binary, which is equal to The approximate value of 0.1 cannot be accurately represented; the second reason is that the value exceeds the value range that the data type can represent, and the system will perform certain processing on the data. Compared with floating point numbers, the biggest advantage of the DECIMAL type is that it can accurately represent and calculate data.

For example, to represent 123.0001

a =123.0001
print(a)
>> 123.000100000000003

b = decimal64(`123.0001,15)
print(b)
>> 123.000100000000000

It can be seen that floating point numbers cannot accurately represent 123.0001, but DECIMAL can.

When calculating avg:

a = array(DOUBLE,0)
for (i in 1..100){
	a.append!(123.0000+0.0003*i)
}
avg(a)
>> 123.015149999999
avg(a) == 123.01515
>> false
eqFloat(avg(a),123.01515)
>> true

b= array(DECIMAL64(4),0)
for (i in 1..100){
	b.append!(123.0000+0.0003*i)
}
avg(b)
>> 123.015150000000
typestr(avg(b))
>> DOUBLE
avg(b) == 123.01515
>> true

It can be seen that when performing avg calculation, the floating point number does not return an accurate result, while DECIMAL returns an accurate result although the result is also of type DOUBLE.

2.2 Disadvantages of DECIMAL

2.2.1 Easy to overflow

The numerical range of DECIMAL32/DECIMAL64/DECIMAL128 types is as shown in the following table, where S in DECIMAL32(S), DECIMAL64(S) and DECIMAL128(S) represents the number of decimal places retained.

The DECIMAL type is prone to overflow due to limitations in the range of valid values ​​and the maximum number of representation digits. Since version 2.00.10, we support the overflow of arithmetic operations. If a higher-precision type exists, the data type of the result will be automatically expanded, thereby reducing the risk of overflow:

version 2.00.9.6:

a = decimal32(4.0000,4)
b = decimal32(8.0000,4)
c = a*b
>> Server response: 'c = a * b => Decimal math overflow'

version 2.00.10:
a = decimal32(4.0000,4)
b = decimal32(8.0000,4)
c = a*b
print(c)
>> 32.00000000
typestr(c)
>> DECIMAL64

But even so, DECIMAL128 still has the risk of overflow:

a = decimal128(36.00000000,8)
b = a*a*a*a
>> Server response: 'b = a * a * a * a => Decimal math overflow'

As shown above, since the scale of the DECIMAL type data multiplication results in DolphinDB is accumulated one by one, the expected result of b will be 1679616.000000000000000000000000000000000, which will be 39 digits, exceeding the maximum representation digits of DECIMAL128 of 38 digits, resulting in overflow.

Version 2.00.10 newly supports the multiplication function of the DECIMAL type  decimalMultiply. Compared with  multiply the function ( * operator), this function can specify the precision of the calculation result. When the multiplication operation of the DEICMAL type causes the risk of overflow in the scale accumulation, you can use  decimalMultiply the function as needed to specify the precision of the calculation result.

a = decimal64(36.00000000,8)
decimalMultiply(a, a, 8)
>> 1296.00000000

2.2.2 Conversion error

When we directly use constants to generate the DECIMAL type, errors may occur due to the conversion of floating point numbers:

a = 0.5599
decimal64(a,4)
>> 0.5598

To avoid this error, you can use a string to generate DECIMAL:

a = "0.5599"
decimal64(a,4)
>> 0.5599

2.2.3 Memory usage

The DECIMAL32, DECIMAL64 and DECIMAL128 types occupy 4 bytes, 8 bytes and 16 bytes respectively in memory, while the FLOAT and DOUBLE types occupy 4 bytes and 8 bytes in memory. Therefore, with the same amount of data, DECIMAL128 takes up twice as much memory as DOUBLE.

And since the data type of the result will be automatically expanded after an arithmetic operation overflows, the memory occupied by the same amount of data will be doubled after each expansion, and there is a certain memory risk.

2.2.4 Performance differences

Compared to the FLOAT and DOUBLE types, the DECIMAL type is slower to calculate, we will compare it in detail in Section 3.

2.2.5 Limitations

In DolphinDB, the DECIMAL type currently supports fewer functions and structures than the FLOAT/DOUBLE types.

  • In terms of function support, there are still a few calculation functions that do not support the DECIMAL type.
  • In terms of calculation results, cum, tm, m, TopN series of functions, including their corresponding original functions ( except sum, , max, , , min, , firstNot, , lastNot, , etc.), even if the original data is of DECIMAL type, the return result is still a floating point number type.cumPositiveStreak 
  • In terms of data structure, the DolphinDB system currently does not support the use of the DECIMAL type in matrix and set.
  • In terms of data type conversion, the DolphinDB system currently does not support the conversion of BOOL/CHAR/SYMBOL/UUID/IPADDR/INT128 and other types and the time-related types under the temporal collection to the DECIMAL type. If the STRING/BLOB type data needs to be converted to the DECIMAL type, , it must meet the prerequisite that STRING/BLOB type data can be converted into numerical types.

3 Comparison of computing performance differences between floating point numbers and DECIMAL

In order to compare the computing performance differences between floating-point types and DECIMAL types, we selected some commonly used computing functions to compare the computing time of each data type under the same data.

First, the simulation data script is as follows:

n = 1000000
data1 = rand(float(100.0),n)
data2 = double(data1)
data3 = decimal32(data1,4)
data4 = decimal64(data1,4)
data5 = decimal128(data1,4)

Then, use  timer the statement to count the calculation time of common calculation functions for each type of data:

timer(100){sum(data1)}  //执行100次,避免单次计算误差,并放大不同类型间的耗时差异
timer(100){sum(data2)}
timer(100){sum(data3)}
timer(100){sum(data4)}
timer(100){sum(data5)}
... ...

The obtained calculation time consumption statistics are as follows (unit: ms):

 According to the above statistical results, it can be known that:

(1) For most calculation functions, the performance of DECIMAL is worse than FLOAT/DOUBLE;

(2) DECIMAL128 will be converted into LONG DOUBLE during calculation (DECIMAL32/DECIMAL64 will be converted into DOUBLE), and the implementation of LONG DOUBLE depends on the compiler and CPU, which may be 12 bytes or 16 bytes. The multiplication of LONG DOUBLE is in Very time consuming when the data is large. In the test example in this section, the value range of the elements in the vector is large, so the product is very large and the calculation is very time-consuming. And if the value range is smaller, such as [0.5, 1.0], the calculation of DECIMAL128 is not much different from DECIMAL32/DECIMAL64;

(3) For functions such as , and  , the DECIMAL type does not perform loop expansion during calculation, while the floating point type  does not  perform loop expansion  . In addition, since DECIMAL calculation will use raw data to participate in the calculation first, and then directly return the DECIMAL type or convert it to a floating point number, the essence is integer calculation, and the actual process is more efficient than floating point number calculation. Therefore, their performance is not much different, and even the performance of DECIMAL32/DECIMAL64 is better than that of FLOAT/DOUBLE. sumavgstdsumavg

4 DECIMAL best practice: avoid mavg calculation accuracy loss

This section will use specific scenarios to compare the accuracy differences between the DECIMAL type and the floating-point number type in actual calculations.

mavg Although and  moving(avg,…) are identical in meaning, their implementation is not consistent. mavg The algorithm is: as the window moves, the sum is added to the number entering the window, subtracting the number leaving the window, and then calculating avg. Therefore, in this process of addition and subtraction, floating point number accuracy problems will arise.

First, we import the sample data  tick.csv . The original data types are all DOUBLE types, and  mavgcalculate moving(avg,…):

data = loadText("<yourDirectory>/tick.csv")

t = select
	MDTime,
	((LastPx - prev(LastPx)) / (prev(LastPx) + 1E-10) * 1000) as val,
	mavg(((LastPx - prev(LastPx)) / (prev(LastPx) + 1E-10) * 1000), 20, 1),
	moving(avg, ((LastPx - prev(LastPx)) / (prev(LastPx) + 1E-10) * 1000), 20, 1)
from data

The result obtained is shown below:

It can be seen that the calculation results start to have errors from 09:31:53.000.

In order to avoid this calculation error, we first convert the intermediate calculation result to the DECIMAL128 type, and then calculate the  mavg sum : moving(avg,…)

t = select
	MDTime,
	((LastPx - prev(LastPx)) / (prev(LastPx) + 1E-10) * 1000) as val,
	mavg(decimal128(((LastPx - prev(LastPx)) / (prev(LastPx) + 1E-10) * 1000),12), 20, 1),
	moving(avg, decimal128(((LastPx - prev(LastPx)) / (prev(LastPx) + 1E-10) * 1000),12), 20, 1)
from data

The result obtained is shown below:

It can be seen that the calculation results of mavg and  moving(avg,…) are completely consistent.

Due to the precision problem of floating point numbers, cum, tm, m, TopN series of functions, including their corresponding original functions, such as  avg, std etc., may lead to calculation accuracy errors. In scenarios where accuracy is of utmost concern, we recommend using DECIMAL for calculations. It should be noted that except for  sum, max, min, firstNot, lastNot, cumPositiveStreak and its corresponding series of functions, which can return calculation results of DECIMAL type when the input parameter is of DECIMAL type, other functions will return floating point number types. Although there is a certain risk of precision error in converting the DECIMAL type to a floating-point number type, the loss of precision during the calculation process is avoided. You can also use functions on the calculation results or convert them  round to the DECIMAL type again to obtain relatively accurate results.

5 Summary

Due to the way they are implemented, floating-point numbers cannot accurately represent certain values ​​inside the computer, so precision errors are prone to occur, causing storage or calculation results to be inconsistent with expectations. DECIMAL can accurately represent numerical values, but it has shortcomings such as easy overflow, large memory usage, poor performance, and certain limitations. Nonetheless, in some scenarios, choosing the DECIMAL type can still be a good way to avoid situations where the storage or calculation results are inconsistent with expectations.

To sum up, in practical applications, it is necessary to consider specific needs, select appropriate data types, and manage data with corresponding accuracy.

Guess you like

Origin blog.csdn.net/qq_41996852/article/details/132317825