High-frequency data processing skills: the application of data perspective

 

High-frequency data processing skills: the application of data perspective

Pivot is a common requirement to organize data, also known as transposition or perspective.

High-frequency data is usually saved in the format of the following figure: each row of information about a stock at a certain moment.

When we process data, considering subsequent vectorization operations, we sometimes hope that the data or intermediate results will transpose the original data, that is, each row represents a different moment, and each column represents a stock. In DolphinDB, the original data or the result of grouping aggregation can be transposed by pivot by statement. If used with vectorization operations, in high-frequency data processing and calculations, row-column conversion can not only simplify the strategy code, but also improve code efficiency. Please see the following two examples for details.

1. Calculate the pairwise correlation of stock returns

In pair trading and risk hedging (hedging), it is often necessary to calculate the pairwise correlation between a given basket of stocks. Such complex calculations cannot be performed in traditional databases, and the use of general statistical software not only requires data migration, but also cumbersome code. Below we use DolphinDB to calculate the pairwise correlation of stock returns.

First, load the high-frequency trading database of US stocks:

quotes = loadTable("dfs://TAQ", "quotes")

Next, select the 500 stocks with the most frequent price changes on August 4, 2009:

dateValue=2009.08.04
num=500
syms = (exec count(*) from quotes where date = dateValue, time between 09:30:00 : 15:59:59, 0<bid, bid<ofr, ofr<bid*1.1 group by Symbol order by count desc).Symbol[0:num]

Next, we use pivot by to reduce the dimensionality of high-frequency data into minute-level data, and change the structure of the original data to generate a minute-level stock price matrix: each column is a stock; each row is a minute.

priceMatrix = exec avg(bid + ofr)/2.0 as price from quotes where date = dateValue, Symbol in syms, 0<bid, bid<ofr, ofr<bid*1.1, time between 09:30:00 : 15:59:59 pivot by time.minute() as minute, Symbol

DolphinDB's language is very flexible. Here, pivot by not only converts the data into a pivot table, but also can be used with aggregate functions, with the function of "group by".

Use the higher-order function each to convert the price matrix to the rate of return matrix:

retMatrix = each(def(x):ratios(x)-1, priceMatrix)

Use the higher-order function pcross to calculate the pairwise correlation of the returns between these 500 stocks:

corrMatrix = pcross(corr, retMatrix)

Select the 10 stocks with the highest correlation with each stock:

mostCorrelated = select * from table(corrMatrix).rename!(`sym`corrSym`corr) context by sym having rank(corr,false) between 1:10

Select the 10 stocks with the highest correlation with SPY:

select * from mostCorrelated where sym='SPY' order by corr desc

Quotes has a total of 269.3 billion pieces of data. There were nearly 190 million pieces of data on August 4, 2009. The calculation above took only 1,952 milliseconds .

2. Calculate the value of the stock portfolio

When conducting index arbitrage trading backtests, it is necessary to calculate the value of a given stock portfolio. When the amount of data is extremely large, a general data analysis system is used in the backtest, which requires extremely high system memory and speed. The DolphinDB database is optimized from the bottom and does not require high hardware.

In this example, for the sake of simplicity, it is assumed that an index consists of only two stocks: AAPL and FB, the time stamp accuracy is nanoseconds, and the index component weights are stored in the weights dictionary.

Symbol=take(`AAPL, 6) join take(`FB, 5)
Time=2019.02.27T09:45:01.000000000+[146, 278, 412, 445, 496, 789, 212, 556, 598, 712, 989]
Price=173.27 173.26 173.24 173.25 173.26 173.27 161.51 161.50 161.49 161.50 161.51
quotes=table(Symbol, Time, Price)
weights=dict(`AAPL`FB, 0.6 0.4)
ETF = select Symbol, Time, Price*weights[Symbol] as weightedPrice from quotes
select last(weightedPrice) from ETF pivot by Time, Symbol;

The results are as follows:

Time                          AAPL    FB
----------------------------- ------- ------
2019.02.27T09:45:01.000000146 103.962
2019.02.27T09:45:01.000000212         64.604
2019.02.27T09:45:01.000000278 103.956
2019.02.27T09:45:01.000000412 103.944
2019.02.27T09:45:01.000000445 103.95
2019.02.27T09:45:01.000000496 103.956
2019.02.27T09:45:01.000000556         64.6
2019.02.27T09:45:01.000000598         64.596
2019.02.27T09:45:01.000000712         64.6
2019.02.27T09:45:01.000000789 103.962
2019.02.27T09:45:01.000000989         64.604

Since the time stamp accuracy is nanoseconds, the time stamps of basically all transactions are inconsistent. If the number of data rows during the backtest is extremely large (hundreds of millions or billions of rows) and the number of index constituents is also large (such as the 500 constituent stocks of the S&P500 index), using traditional analysis systems, the index at any time must be calculated Value, you need to convert the 3 columns (time, stock code, price) of the original data table into a data table with the same length but the width of the number of index constituent stocks+1, forward fill NULLs, and then calculate each row The sum of the contribution of the index constituent stocks to the index price. This approach will produce an intermediate process data table that is many times larger than the original data table, which is likely to cause insufficient system memory. At the same time, the calculation speed is also very slow.

Using DolphinDB's pivot by statement, all the above steps can be achieved with only one line of code. The code is concise, and there is no need to generate intermediate process data tables, which effectively avoids the problem of insufficient memory and greatly improves the calculation speed.

select rowSum(ffill(last(weightedPrice))) from ETF pivot by Time, Symbol;

The results are as follows:

Time                          rowSum
----------------------------- -------
2019.02.27T09:45:01.000000146 103.962
2019.02.27T09:45:01.000000212 168.566
2019.02.27T09:45:01.000000278 168.56
2019.02.27T09:45:01.000000412 168.548
2019.02.27T09:45:01.000000445 168.554
2019.02.27T09:45:01.000000496 168.56
2019.02.27T09:45:01.000000556 168.556
2019.02.27T09:45:01.000000598 168.552
2019.02.27T09:45:01.000000712 168.556
2019.02.27T09:45:01.000000789 168.562
2019.02.27T09:45:01.000000989 168.566

 

DolphinDB下载:DolphinDB

Guess you like

Origin blog.csdn.net/qq_41996852/article/details/111151322