Comparison of Python + HDF5 factor calculation and DolphinDB integrated factor calculation scheme

In quantitative trading, it is a very common investment research requirement to perform high-frequency factor calculation based on L1/L2 quotations and high-frequency trading data in the financial market. At present, the L2 historical data of the whole domestic market for ten years is about 20 ~ 50T, and the amount of new data added every day is about 10 ~ 20G. Traditional relational databases such as MS SQL Server or MySQL have been difficult to support data of this magnitude, even if the database and table are divided, the query performance is far from meeting the requirements. As a result, some users choose the distributed file system, use HDF5 to store data, and combine it with Python for quantitative financial calculations.

Although the HDF5 storage solution can support massive amounts of high-frequency data, there are still some pain points, such as difficulty in data authority management, inconvenient association of different data, inconvenient retrieval and query, and the need to improve performance through data redundancy. In addition, reading calculations through Python also takes some time in data transmission.

DolphinDB is an analytical distributed time-series database. At present, more and more securities companies and private equity institutions have begun to use DolphinDB to store and perform factor calculation of high-frequency data, and many customers who are still using the Python + HDF5 solution for high-frequency factor calculation have shown strong interest in DolphinDB. Therefore, we wrote this article comparing Python + HDF5 factor calculation and DolphinDB integrated factor calculation scheme for reference.

This article will introduce how to implement factor calculation based on Python + HDF5 and DolphinDB, and compare the calculation performance of the two.

1. Test environment and test data

1.1 Test environment

This test compares the factor calculation implemented by Python +HDF5 and DolphinDB. in:

  • The Python + HDF5 factor calculation scheme relies on libraries such as Numpy, Pandas, DolphinDB, and Joblib.
  • The DolphinDB integrated factor calculation solution uses DolphinDB Server as the computing platform. This test uses a single-node deployment method.

The hardware and software environment information required for testing is as follows:

  • hardware environment
hardware name configuration information
CPU Intel(R) Xeon(R) Silver 4210R CPU @ 2.40GHz
Memory 512 G
disk SSD 500 G
  • Software Environment
name of software Version Information
operating system CentOS Linux release 7.9.2009 (Core)
DolphinDB V2.00.8.7
Python V3.6.8
Numpy V1.19.3
Pandas V1.1.5

1.2 Test data

This test selects part of the snapshot data of the three trading days from 2020.01.02 to 2020.01.06 in Shenzhen Stock Exchange. The data includes about 1992 stocks, with an average daily data volume of about 7.6 million pieces and a total data volume of about 22 million strip. The initial data is stored in DolphinDB, and HDF5 format data files can be exported and generated from DolphinDB.

The snapshot table test data has a total of 55 fields in DolphinDB, some of which are shown as follows:

  field name type of data   field name type of data
1 TradeDate DATE 29 OfferPX1 INT
2 OrigTime TIMESTAMP 30 BidPX1 DOUBLE
3 dbtime TIMESTAMP 31 OfferSize1 DOUBLE
4 SecurityID SYMBOL 32 BidSize1 INT
5 …… …… 33 …… ……

Some data examples are as follows:

| record | TradeDate  | OrigTime                | SendTime                | Recvtime                | dbtime                  | ChannelNo | SecurityID | SecurityIDSource | MDStreamID | OfferPX1 | BidPX1 | OfferSize1 | BidSize1 | OfferPX2 | BidPX2 | OfferSize2 | BidSize2 | OfferPX3 | BidPX3 | OfferSize3 | BidSize3 | OfferPX4 | BidPX4 | OfferSize4 | BidSize4 | OfferPX5 | BidPX5 | OfferSize5 | BidSize5 | OfferPX6 | BidPX6 | OfferSize6 | BidSize6 | OfferPX7 | BidPX7 | OfferSize7 | BidSize7 | OfferPX8 | BidPX8 | OfferSize8 | BidSize8 | OfferPX9 | BidPX9 | OfferSize9 | BidSize9 | OfferPX10 | BidPX10 | OfferSize10 | BidSize10 | NUMORDERS_B1 | NOORDERS_B1 | ORDERQTY_B1 | NUMORDERS_S1 | NOORDERS_S1 | ORDERQTY_S1 |
|--------|------------|-------------------------|-------------------------|-------------------------|-------------------------|-----------|------------|------------------|------------|----------|--------|------------|----------|----------|--------|------------|----------|----------|--------|------------|----------|----------|--------|------------|----------|----------|--------|------------|----------|----------|--------|------------|----------|----------|--------|------------|----------|----------|--------|------------|----------|----------|--------|------------|----------|-----------|---------|-------------|-----------|--------------|-------------|-------------|--------------|-------------|-------------|
| 0      | 2020.01.02 | 2020.01.02 09:00:09.000 | 2020.01.02 09:00:09.264 | 2020.01.02 09:00:09.901 | 202.01.02 09:00:09.902  | 1,014     | 14         | 102              | 10         | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0         | 0       | 0           | 0         | 0            | 0           |             | 0            | 0           |             |
| 1      | 2020.01.02 | 2020.01.02 09:01:09.000 | 2020.01.02 09:01:09.330 | 2020.01.02 09:01:09.900 | 2020.01.02 09:01:09.904 | 1,014     | 14         | 102              | 10         | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0         | 0       | 0           | 0         | 0            | 0           |             | 0            | 0           |             |
| 2      | 2020.01.02 | 2020.01.02 09:02:09.000 | 2020.01.02 09:02:09.324 | 2020.01.02 09:02:10.131 | 2020.01.02 09:02:10.139 | 1,014     | 14         | 102              | 10         | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0         | 0       | 0           | 0         | 0            | 0           |             | 0            | 0           |             |
| 3      | 2020.01.02 | 2020.01.02 09:03:09.000 | 2020.01.02 09:03:09.223 | 2020.01.02 09:03:10.183 | 2020.01.02 09:03:10.251 | 1,014     | 14         | 102              | 10         | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0         | 0       | 0           | 0         | 0            | 0           |             | 0            | 0           |             |
| 4      | 2020.01.02 | 2020.01.02 09:04:09.000 | 2020.01.02 09:04:09.227 | 2020.01.02 09:04:10.188 | 2020.01.02 09:04:10.215 | 1,014     | 14         | 102              | 10         | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0         | 0       | 0           | 0         | 0            | 0           |             | 0            | 0           |             |
| 5      | 2020.01.02 | 2020.01.02 09:05:09.000 | 2020.01.02 09:05:09.223 | 2020.01.02 09:05:09.930 | 2020.01.02 09:05:09.936 | 1,014     | 14         | 102              | 10         | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0         | 0       | 0           | 0         | 0            | 0           |             | 0            | 0           |             |
| 6      | 2020.01.02 | 2020.01.02 09:06:09.000 | 2020.01.02 09:06:09.218 | 2020.01.02 09:06:10.040 | 2020.01.02 09:06:10.044 | 1,014     | 14         | 102              | 10         | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0         | 0       | 0           | 0         | 0            | 0           |             | 0            | 0           |             |
| 7      | 2020.01.02 | 2020.01.02 09:07:09.000 | 2020.01.02 09:07:09.224 | 2020.01.02 09:07:09.922 | 2020.01.02 09:07:09.925 | 1,014     | 14         | 102              | 10         | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0         | 0       | 0           | 0         | 0            | 0           |             | 0            | 0           |             |
| 8      | 2020.01.02 | 2020.01.02 09:08:09.000 | 2020.01.02 09:08:09.220 | 2020.01.02 09:08:10.137 | 2020.01.02 09:08:10.154 | 1,014     | 14         | 102              | 10         | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0         | 0       | 0           | 0         | 0            | 0           |             | 0            | 0           |             |
| 9      | 2020.01.02 | 2020.01.02 09:09:09.000 | 2020.01.02 09:09:09.215 | 2020.01.02 09:09:10.175 | 2020.01.02 09:09:10.198 | 1,014     | 14         | 102              | 10         | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0        | 0      | 0          | 0        | 0         | 0       | 0           | 0         | 0            | 0           |             | 0            | 0           |             |

2. Comparison of high-frequency factor and code implementation

In this section, two high-frequency factors, flow state factor (flow factor) and weight skewness factor (mathWghtSkew factor) will be used to calculate and compare.

2.1 High frequency factor

  • flow factor
    • Factor introduction: The calculation of the flow state factor needs to rely on historical data. The required parameters are four fields: buy_vol, sell_vol, askPrice1, bidPrice1, and are calculated using functions such as . Its calculation results can reflect the situation of the buyer's capital flow. mavg iif 
    • official:

flow factor calculation formula

  • mathWghtSkew factor
    • Factor introduction: Receive multiple quotation data (price BidPX1, BidPX2...BidPX10) and calculate the weight skewness of multiple quotations.
    • official:

mathWghtSkew factor calculation formula

2.2 Factor implementation in DolphinDB

In this section, we use the programming language of DolphinDB to implement flow factor and mathWghtSkew factor. DolphinDB provides mavg (sliding window series m series) and rowWavg (row calculation series row series) functions, and has carried out targeted performance optimization. These two families of functions not only allow quick and easy development of these two factors, but also provide excellent computational performance. As shown in the following code:

  • The implementation code of the flow factor:
def flow(buy_vol, sell_vol, askPrice1, bidPrice1){
	buy_vol_ma = round(mavg(buy_vol, 60), 5)
	sell_vol_ma = round(mavg(sell_vol, 60), 5)
	buy_prop = iif(abs(buy_vol_ma+sell_vol_ma) < 0, 0.5 , buy_vol_ma/ (buy_vol_ma+sell_vol_ma))
	spd_tmp = askPrice1 - bidPrice1
	spd = iif(spd_tmp  < 0, 0, spd_tmp)
	spd_ma = round(mavg(spd, 60), 5)
	return iif(spd_ma == 0, 0, buy_prop / spd_ma)
}
  • Implementation code of mathWghtSkew factor:
def mathWghtCovar(x, y, w){
	v = (x - rowWavg(x, w)) * (y - rowWavg(y, w))
	return rowWavg(v, w)
}
def mathWghtSkew(x, w){
	x_var = mathWghtCovar(x, x, w)
	x_std = sqrt(x_var)
	x_1 = x - rowWavg(x, w)
	x_2 = x_1*x_1
	len = size(w)
	adj = sqrt((len - 1) * len) \ (len - 2)
	skew = rowWsum(x_2, x_1) \ (x_var * x_std) * adj \ len
	return iif(x_std==0, 0, skew)
}

2.3 Factor implementation in Python

In this subsection, we implement the flow factor and mathWghtSkew factor in Python.

  • The implementation code of the flow factor:
def flow(df):
    buy_vol_ma = np.round(df['BidSize1'].rolling(60).mean(), decimals=5)
    sell_vol_ma = np.round((df['OfferSize1']).rolling(60).mean(), decimals=5)
    buy_prop = np.where(abs(buy_vol_ma + sell_vol_ma) < 0, 0.5, buy_vol_ma / (buy_vol_ma + sell_vol_ma))
    spd = df['OfferPX1'].values - df['BidPX1'].values
    spd = np.where(spd < 0, 0, spd)
    spd = pd.DataFrame(spd)
    spd_ma = np.round((spd).rolling(60).mean(), decimals=5)
    return np.where(spd_ma == 0, 0, pd.DataFrame(buy_prop) / spd_ma)
  • Implementation code of mathWghtSkew factor:
def rowWavg(x, w):
    rows = x.shape[0]
    res = [[0]*rows]
    for row in range(rows):
        res[0][row] = np.average(x[row], weights=w)
    res = np.array(res)
    return res

def rowWsum(x, y):
    rows = x.shape[0]
    res = [[0]*rows]
    for row in range(rows):
        res[0][row] = np.dot(x[row],y[row])
    res = np.array(res)
    return res

def mathWghtCovar(x, y, w):
    v = (x - rowWavg(x, w).T)*(y - rowWavg(y, w).T)
    return rowWavg(v, w)

def mathWghtSkew(x, w):
    x_var = mathWghtCovar(x, x, w)
    x_std = np.sqrt(x_var)
    x_1 = x - rowWavg(x, w).T
    x_2 = x_1*x_1
    len = np.size(w)
    adj = np.sqrt((len - 1) * len) / (len - 2)
    skew = rowWsum(x_2, x_1) / (x_var*x_std)*adj / len
    return np.where(x_std == 0, 0, skew)

2.4 Summary

From the perspective of code implementation, there is not much difference in the difficulty of developing these two factors using DolphinDB and Python. In terms of code size, DolphinDB provides row series of row calculation functions, which makes development easier. In addition, in the calculation of the moving window, in Python, the calculation is performed through the combination of functions such as function + that need to be calculated. In DolphinDB, it is also possible to perform calculations by combining functions (for example , etc.) to be calculated similarly. However, when this combined method is used for data movement, it is difficult to perform targeted incremental optimization because the specific computing business is not known. In order to improve performance, DolphinDB specially provides m series sliding window functions. Targeted performance optimization can be performed for different computing requirements. This targeted mobile computing function can improve the performance by up to 100 times in different scenarios compared with the combined computing method. rolling  mean  moving  mean avg 

3. Factor calculation efficiency comparison

This section will further show the specific calling method of flow factor and mathWghtSkew factor, and how to realize parallel calling.

3.1 DolphinDB factor call calculation

  • factor call calculation

Among the factors in this comparison, the flow factor needs to be calculated in moving windows, and the mathWghtSkew factor needs to be calculated in multiple columns within the same row. However, the data involved in the calculation are calculated independently for each stock. For such scenarios, DolphinDB provides functions that simplify time series data operations like moving window calculations. Through the operation on the stock SecurityID , you can directly use the SQL engine of DolphinDB to calculate the moving window of batch stocks. The mathWghtSkew factor is calculated across multiple columns within the same row, which can be calculated together. The calculation code is as follows: context by  context by 

m = 2020.01.02:2020.01.06
w = 10 9 8 7 6 5 4 3 2 1
res = select dbtime, SecurityID,flow(BidSize1,OfferSize1, OfferPX1, BidPX1) as Flow_val,mathWghtSkew(matrix(BidPX1,BidPX2,BidPX3,BidPX4,BidPX5,BidPX6,BidPX7,BidPX8,BidPX9,BidPX10),w) as Skew_val  from loadTable("dfs://LEVEL2_SZ","Snap") where TradeDate between  m context by SecurityID
  • Parallel calls

The DolphinDB computing framework will disassemble a large computing task into multiple sub-tasks for parallel computing. Therefore, when calculating with the above Sql, it will automatically calculate the partitions involved in the data, and automatically perform multi-threaded parallel calculations according to the maximum number of worker threads limited by the configured parameters workerNum and localExecutors. But in this comparison, we need to perform integrated calculations under the specified number of concurrency. We can precisely limit the use of system resources by modifying the workerNum and localExecutors parameters in the DolphinDB configuration file. In the multi-thread mechanism of DolphinDB, the task allocation thread will also participate in the calculation, and the actual number of working threads is the number plus 1. The relationship between parameter adjustment and the actual number of worker threads is as follows: localExecutors 

number of workerNum number of localExecutors Actual number of worker threads
1 0 1
2 1 2
20 19 20

An example of parameter modification is as follows:

workerNum=20
localExecutors=19

By adjusting the parameters, you can control the CPU resource usage parallelism when the factor is called and calculated by the DolphinDB Sql engine this time.

The parallelism of DolphinDB's current tasks can be viewed through the web version of the manager. The figure below shows a parallel computing job configured with 4 parallelisms.

3.2 Factor scheduling calculation of Python + HDF5

  • HDF5 data file reading

Unlike DolphinDB's integrated calculation in the library, when using Python's calculation library for calculation, it is necessary to read the snapshot data from the HDF5 file first. There are several Python reading packages for HDF5. Different packages have different data structures when stored, and the corresponding reading methods and performance are also different.

Here we verify two commonly used methods:

  • pandas HDFStore()
  • pandas read_hdf()

The method supports reading only the data of the required columns. However, the actual test results show that there is very little difference between the time consumption of reading some column data and the time consumption of reading all column data through the method. The HDF5 file data saved by saving is about 30% more than the HDF5 file data saved by calling. Due to the large actual storage space, it takes more time to read the same HDF5 stored data through the method than using the method. Therefore, this test chooses to use the pandas method to store and read HDF5 file data and perform calculations. In this test, the HDF5 file data is stored in the most common way, without deduplication or other targeted storage optimization before storage. The reason is that these targeted optimizations not only bring about data redundancy, but also increase the complexity of data usage and management. Especially in massive data scenarios, the complexity of data management will increase significantly. read_hdf  pandas.read_hdf()  pandas.to_hdf()  pandas.HDFStore()  pandas.read_hdf()  pandas.HDFStore()  HDFStore() 

The code for storing and fetching HDF5 using the HDFStore method is as follows:

# 保存 HDF5 文件
def saveByHDFStore(path,df):
    store = pd.HDFStore(path)
    store["Table"] = df
    store.close()

# 读取单个 HDF5 文件
def loadData(path):
    store = pd.HDFStore(path, mode='r')
    data = store["Table"]
    store.close()
    return data
  • factor call calculation

For each stock, read the daily data in a loop, convert it into a dataframe after splicing, and perform factor calculation. code show as below:

def ParallelBySymbol(SecurityIDs,dirPath):
    for SecurityID in SecurityIDs:
        sec_data_list=[]
        for date in os.listdir(dirPath):
            filepath = os.path.join(dirPath,str(date),"data_"+str(date) + "_" + str(SecurityID) + ".h5")
            sec_data_list.append(loadData(filepath))
        df=pd.concat(sec_data_list)
        # 计算 flow 因子
        df_flow_res = flow(df)
        # 计算 mathWghtSkew 因子
        w = np.array([10, 9, 8, 7, 6, 5, 4, 3, 2, 1])
        pxs = np.array(df[["BidPX1","BidPX2","BidPX3","BidPX4","BidPX5","BidPX6","BidPX7","BidPX8","BidPX9","BidPX10"]])
        np.seterr(divide='ignore', invalid='ignore')
        df_skew_res = mathWghtSkew(pxs,w)
  • Parallel calls

Here we use Python's Joblib library to implement parallel scheduling of multiple processes. We read the stocks to be calculated from the file, and then split them into multiple groups according to the degree of parallelism for parallel calculation. code show as below:

...
# 股票 HDF5 文件地址
pathDir=/data/stockdata
# 并行度
n = 1
# SecurityIDs 为要计算的全部股票列表
SecurityIDs_splits = np.array_split(SecurityIDs, n)
Parallel(n_jobs=n)(delayed(ParallelBySymbol)(SecurityIDs,pathDir) for SecurityIDs in SecurityIDs_splits)

3.3 Computing performance comparison

Based on the foregoing, in this section we compare the performance of Python + HDF5 factor calculations with different degrees of parallelism and the integrated factor calculations in the DolphinDB library. The calculated data volume is 1992 stocks, with a total of 22 million rows in 3 days. We adjust different degrees of parallelism and test the time-consuming calculation of factors in two ways when using different numbers of CPU cores. All tests were performed after clearing the operating system cache. The test results are as follows:

number of CPU cores Python+HDF5 (seconds) DolphinDB (seconds) (Python+HDF5)/ DolphinDB
1 2000 21.0 95
2 993 11.5 86
4 540 6.8 79
8 255 5.8 44
16 133 4.9 27
24 114 4.3 26
40 106 4.2 25

It can be seen from the comparison results. In this test, under the premise of a single core, the calculation in the DolphinDB library is nearly 100 times faster than the calculation in Python + HDF5. As the number of available CPU cores gradually increases, the time-consuming ratio between DolphinDB library calculations and Python + HDF5 calculations gradually approaches about 1:25. Considering the characteristics of the two calculation methods, the reasons are probably as follows:

  • The reading efficiency of DolphinDB's own data storage system is much better than Python's reading of HDF5 file storage using a common storage method.
  • DolphinDB's targeted moving calculation function sliding window series (m series) can provide better calculation performance for different moving window calculation optimizations

Although the reading of data files in HDF5 format can be technically targeted for redundant storage or other targeted optimization, it will also bring additional hardware resource costs, data usage and management costs, etc. In contrast, DolphinDB's own data storage system is more efficient, convenient and simple to use. Therefore, the calculation speed of DolphinDB's integrated factor calculation in the library is far superior to the factor calculation method of Python + HDF5 in the complete factor data reading and calculation process.

From the comparison of computing performance, it is not difficult to find the following phenomena:

  • In terms of code implementation , the Sql calculation in DolphinDB's library is easier to implement factor calculation calls and parallel calls.
  • 并行计算方面,DolphinDB 可以自动使用当前可用的 CPU 资源,而Python 脚本需要通过并行调度代码实现,但更易于控制并发度。
  • 计算速度方面,DolphinDB 的库内计算比 Python + HDF5 的计算方式快 25 倍以上。

3.4 计算结果对比

上一节中,我们比对了两种计算的方式的计算性能。DolphinDB 的库内因子计算在计算速度上要远优于 Python + HDF5 的因子计算方式。但是计算快的前提是计算结果要正确一致。我们将 Python +HDF5 和 DolphinDB 的因子计算结果分别存入 DolphinDB 的分布式表中。部分计算结果比对展示如下图,显示的部分结果完全一致。全部的数据比对通过 DolphinDB 脚本进行,比对结果也是完全一致。

也可以通过如下代码进行全数据验证,输出 int[0] 则表示两张表内容一致。

resTb = select * from loadTable("dfs://tempResultDB","result_cyc_test")
resTb2 = select * from loadTable("dfs://tempResultDB","result_cyc_test2")
resTb.eq(resTb2).matrix().rowAnd().not().at()

验证结果显示,采用Python + HDF5 和 DolphinDB 两种方式进行计算,结果完全一致。

4. 总结

本篇文章的比较结果显示,在使用相同核数 CPU 的前提下:

  • DolphinDB 的库内一体化计算性能约为 Python + HDF5 因子计算方案的 25 倍左右,计算结果和 Python 方式完全一样。
  • 两种方案的因子开发效率相近
  • 在数据的管理和读取方面:
    • 每个 HDF5 文件的读取或写入基本是单线程处理的,难以对数据进行并行操作。 此外 HDF5 文件中数据读取效率较依赖于数据在 HDF5 的 group 和 dataset 组织结构。为了追求更好的读写效率,经常要考虑不同的数据读取需求来设计存储结构。需要通过一些数据冗余存储来满足不同场景的高效读写。这种数据操作方式在数据量逐渐上升到 T 级之后,数据管理和操作的复杂度更会大幅增加,实际使用中会增加大量的时间成本和存储空间成本。
    • 而 DolphinDB 的数据管理、查询、使用更为简单便捷。得益于 DolphinDB 的不同存储引擎及分区机制,用户可以以普通数据库的方式轻松管理、使用 PB 级及以上数量级别的数据。

综合而言,在生产环境中,使用 DolphinDB 进行因子计算和存储远比使用 Python + HDF5 计算方式更加高效。

附录

本教程中的对比测试使用了以下测试脚本:

createDBAndTable.dos

ddbFactorCal.dos

gentHdf5Files.py

pythonFactorCal.py

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4865736/blog/9914130