Dry goods丨DolphinDB historical data playback tutorial

When a quantitative strategy is used in actual transactions, the program that processes real-time data is usually event-driven. When developing a quantitative strategy, you need to use historical data for backtesting, and the program at this time is usually not event-driven. Therefore, the same strategy requires writing two sets of code, which is time-consuming and error-prone. In the DolphinDB database, users can import historical data into the streaming data table in “real-time data” chronological order, so that the same set of codes can be used for backtesting and real trading.

DolphinDB's streaming data processing framework uses a publish-subscribe-consumption model. Data producers continue to publish real-time data to all data subscribers in the form of streams. After the subscriber receives the message, it can use a custom function or the built-in aggregation engine of DolphinDB to process the message. The DolphinDB streaming data interface supports APIs in multiple languages, including C++, C#, Java, and Python. Users can use these APIs to write more complex processing logic and better integrate with the actual production environment. Interested friends can search "DolphinDB Streaming Data Tutorial" to learn more.

This article introduces the replay and replayDS functions, and then uses financial data to show the process and application scenarios of data playback.

1. Function introduction

replay

replay(inputTables, outputTables, [dateColumn], [timeColumn], [replayRate], [parallelLevel=1])

The function of replay is to play back several tables or data sources to the corresponding output table at the same time. The user needs to specify the input data table or data source, output table, date column, time column, playback speed and parallelism.

The concept of replay function parameters is as follows:

  • inputTables: A single table or a tuple containing several tables or data sources (see introduction to replayDS).
  • outputTables: A single table or a tuple containing several tables. These tables are usually streaming data tables. The number of input tables and output tables is the same, and there is a one-to-one correspondence, and the structure of each pair of input and output tables is the same.
  • dateColumn, timeColumn: string, represents the date and time column of the input table, if not specified, the first column is the date column by default. If the time column in the input table contains both date and time, you need to set dateColumn and timeColumn to the same column. During playback, the system will determine the minimum time accuracy of playback according to the settings of dateColumn and timeColumn. Under this time precision, the data at the same time will be output in the same batch. For example, a table has both a date column and a time column, but only the dateColumn is set in the replay function, then all data on the same day will be output in one batch.
  • replayRate: Integer, representing the number of data items replayed per second. Since the data is output in the same batch at the same time during playback, when the replayRate is less than the number of rows in a batch, the actual output rate will be greater than the replayRate.
  • parallelLevel: Integer, indicating the parallelism of reading data. When the size of the source data exceeds the memory size, the replayDS function needs to be used to divide the source data into several small data sources, and then read the data from the disk and play it back. Specifying multiple threads to read data can increase the data reading speed.

replayDS

replayDS(sqlObj, [dateColumn], [timeColumn], [timeRepartitionSchema])

The replayDS function can convert the input SQL query into a data source and use it in conjunction with the replay function. Its function is to split the original SQL query into several small SQL queries in chronological order according to the partition of the input table and timeRepartitionSchema.

The concept of replayDS function parameters is as follows:

  • sqlObj: SQL meta code, representing the data to be played back, such as <select * from sourceTable>.
  • dateColumn: string, representing the date column. If not specified, the first column is the date column by default. The default date column of the replayDS function is a partition column of the data source, and the original SQL query is split into multiple queries based on the partition information.
  • timeColumn: string, representing the time column, used with timeRepartitionSchema.
  • timeRepartitionSchema: Time type vector, such as 08:00:00 .. 18:00:00. If timeColumn is specified at the same time, the SQL query is further split in the time dimension.

Single memory table playback

Single memory table playback only needs to set the input table, output table, date column, time column and playback speed.

replay(inputTable, outputTable, `date, `time, 10)

Single table playback using data source

When there are too many rows in a single table, you can use replayDS for playback. First, use replayDS to generate the data source. In this example, the date column and timeRepartitionColumn are specified. The playback call is similar to the playback of a single memory table, but the parallelism of playback can be specified. The internal implementation of replay uses the pipeline framework, and data fetching and output are executed separately. When the input is a data source, multiple blocks of data can be read in parallel to avoid waiting for the output thread. In this example, the degree of parallelism is set to 2, which means that there are two threads performing data fetching operations at the same time.

inputDS = replayDS(<select * from inputTable>, `date, `time, 08:00:00.000 + (1..10) * 3600000)
replay(inputDS, outputTable, `date, `time, 1000, 2)

Multi-table playback using data source

Replay also supports simultaneous playback of multiple tables. You only need to pass multiple input tables to replay in the form of tuples, and specify the output tables separately. Here the output table and the input table should correspond one-to-one, and each pair must have the same table structure. If a date column or a time column is specified, there should be corresponding columns in all tables.

ds1 = replayDS(<select * from input1>, `date, `time, 08:00:00.000 + (1..10) * 3600000)
ds2 = replayDS(<select * from input2>, `date, `time, 08:00:00.000 + (1..10) * 3600000)
ds3 = replayDS(<select * from input3>, `date, `time, 08:00:00.000 + (1..10) * 3600000)
replay([ds1, ds2, ds3], [out1, out2, out3], `date, `time, 1000, 2)

Cancel playback

If the replay function is called by submitJob, you can use getRecentJob to get the jobId, and then use cancelJob to cancel the playback.

getRecentJobs()
cancelJob(jobid)

If it is called directly, you can use getConsoleJobs to get the jobId in another GUI session, and then use cancelConsoleJob to cancel the playback task.

getConsoleJobs()
cancelConsoleJob(jobId)

2. How to use playback data

The replayed data exists in the form of streaming data. We can use the following three ways to subscribe and consume this data:

  • Subscribe in DolphinDB and use DolphinDB script to customize the callback function to consume streaming data.
  • Subscribe in DolphinDB and use the built-in streaming computing engine to process streaming data, such as time series aggregation engine, cross-section aggregation engine, anomaly detection engine, etc. DolphinDB's built-in aggregation engine can perform real-time aggregation calculations on streaming data, which is easy to use and has excellent performance. In 3.2, we use the cross-sectional aggregation engine to process the replayed data and calculate the intrinsic value of the ETF. For the specific usage of the cross-section aggregation engine, please refer to the DolphinDB user manual.
  • Third-party clients subscribe and consume data through DolphinDB's streaming data API.

3. Financial example

Replay the level 1 trading data of the US stock market for one day and calculate the ETF value

In this example, the level1 transaction data of the US stock market on August 17, 2007 is used, replayDS is used for data playback, and the ETF value is calculated through the cross-sectional aggregation engine built in DolphinDB. The data is stored in the quotes table of the distributed database dfs://TAQ. Below is the structure and data preview of the quotes table.

//Load the data of the quotes table in the database and view the table structure
quotes = database("dfs://TAQ").loadTable("quotes");
quotes.schema().colDefs;

name    typeString    typeInt
time    SECOND        10
symbol  SYMBOL        17
ofrsiz INT 4
ofr     DOUBLE        16
mode    INT           4
mmid    SYMBOL        17
ex      CHAR          2
date DATE 6
bidsize INT           4
bid     DOUBLE        16

//View the first ten rows of data in the quotes table
select top 10 * from quotes;

symbol    date         time       bid     ofr     bidsiz    ofrsiz    mode    ex    mmid
A         2007.08.17   04:15:06   0.01    0       10        0         12      80
A         2007.08.17   06:21:16   1       0       1         0         12      80
A         2007.08.17   06:21:44   0.01    0       10        0         12      80
A         2007.08.17   06:49:02   32.03   0       1         0         12      80
A         2007.08.17   06:49:02   32.03   32.78   1         1         12      80
A         2007.08.17   07:02:01   18.5    0       1         0         12      84
A         2007.08.17   07:02:01   18.5    45.25   1         1         12      84
A         2007.08.17   07:54:55   31.9    45.25   3         1         12      84
A         2007.08.17   08:00:00   31.9    40      3         2         12      84
A         2007.08.17   08:00:00   31.9    35.5    3         2         12      84

(1) Divide the data to be played back. Since there are a total of 336,305,414 pieces of data in one day, there will be a long delay in replaying after importing the memory once, and it may cause memory overflow. Therefore, first use the replayDS function and specify the parameter timeRepartitionSchema to divide the data into 62 parts according to the timestamp.

sch = select name,typeString as type from quotes.schema().colDefs
trs = cutPoints(09:30:00.001..18:00:00.001, 60)
rds = replayDS(<select * from quotes>, `date, `time,  trs);

(2) Define the output table outQuotes, which is generally a flow data table.

share streamTable(100:0, sch.name,sch.type) as outQuotes

(3) Define the stock weight dictionary weights and the aggregation function etfVal to calculate the ETF value. In this example, we only calculate the ETF value of AAPL, IBM, MSFT, NTES, AMZN, and GOOG.

defg etfVal(weights,sym, price) {
    return wsum(price, weights[sym])
}
weights = dict(STRING, DOUBLE)
weights[`AAPL] = 0.1
weights[`IBM] = 0.1
weights[`MSFT] = 0.1
weights[`NTES] = 0.1
weights[`AMZN] = 0.1
weights[`GOOG] = 0.5

(4) Create a stream aggregation engine and subscribe to the output table outQuotes for data playback. When subscribing to the outQuotes table, we specify the filter conditions for the publishing table. Only data with the symbol AAPL, IBM, MSFT, NTES, AMZN, GOOG will be published to the cross-section aggregation engine, reducing unnecessary network overhead and data transmission.

setStreamTableFilterColumn(outQuotes, `symbol)
outputTable = table(1:0, `time`etf, [TIMESTAMP,DOUBLE])
tradesCrossAggregator=createCrossSectionalAggregator("etfvalue", <[etfVal{weights}(symbol, ofr)]>, quotes, outputTable, `symbol, `perBatch)
subscribeTable(,"outQuotes","tradesCrossAggregator",-1,append!{tradesCrossAggregator},true,,,,,`AAPL`IBM`MSFT`NTES`AMZN`GOOG)

(5) Start playback, set to playback 100,000 data per second, and the aggregation engine will consume the playback data in real time.

submitJob("replay_quotes", "replay_quotes_stream",  replay,  [rds],  [`outQuotes], `date, `time,100000,4)

(6) View the ETF value of the stock we selected at different points in time.

//View the first 15 rows of data in the outputTable table, where the time in the first column is the time when the aggregation calculation occurred
>select top 15 * from outputTable;

time etf
2019.06.04T16: 40: 18.476 14.749
2019.06.04T16: 40: 19.476 14.749
2019.06.04T16: 40: 20.477 14.749
2019.06.04T16: 40: 21.477 22.059
2019.06.04T16: 40: 22.477 22.059
2019.06.04T16: 40: 23.477 34.049
2019.06.04T16: 40: 24.477 34.049
2019.06.04T16: 40: 25.477 284.214
2019.06.04T16: 40: 26.477 284.214
2019.06.04T16: 40: 27.477 285.68
2019.06.04T16: 40: 28.477 285.68
2019.06.04T16: 40: 29.478 285.51
2019.06.04T16: 40: 30.478 285.51
2019.06.04T16: 40: 31.478 285.51
2019.06.04T16: 40: 32.478 285.51

4. Performance Test

We conducted a performance test on the data playback function of DolphinDB on the server. The server configuration is as follows:

Host: DELL PowerEdge R730xd

CPU: Intel Xeon(R) CPU E5-2650 v4 (24 cores 48 threads 2.20GHz)

Memory: 512 GB (32GB × 16, 2666 MHz)

Hard Disk: 17T HDD (1.7T × 10, read speed 222 MB/s, write speed 210 MB/s)

Network: 10 Gigabit Ethernet

The test script is as follows:

sch = select name,typeString as type from  quotes.schema().colDefs
trs = cutPoints(09:30:00.001..18:00:00.001,60)
rds = replayDS(<select * from quotes>, `date, `time,  trs);
share streamTable(100:0, sch.name,sch.type) as outQuotes
jobid = submitJob("replay_quotes","replay_quotes_stream",  replay,  [rds],  [`outQuotes], `date, `time, , 4)

When the playback rate is not set (that is, playback at the fastest rate) and there is no subscription to the output table, it only takes 90 to 110 seconds to playback 336,305,414 pieces of data.


Guess you like

Origin blog.51cto.com/15022783/2590411