Quantifying dry goods丨How to use DolphinDB to calculate K-line

DolphinDB provides a powerful memory computing engine, built-in time series functions, distributed computing and streaming data processing engine, which can efficiently calculate K-line in many scenarios. This tutorial will introduce how DolphinDB calculates K-line through batch processing and streaming processing.

  • Historical data batch calculation K-line

The start time of the candlestick window can be specified; there can be multiple trading periods in a day, including overnight periods; the candlestick window can overlap; the transaction volume is used as the dimension for dividing the candlestick window. When the amount of data that needs to be read is particularly large and the results need to be written to the database, you can use DolphinDB's built-in Map-Reduce function for parallel calculation.

  • Streaming calculation K-line

Use API to receive market data in real time, and use DolphinDB's built-in streaming data timing calculation engine (TimeSeriesAggregator) to perform real-time calculations to obtain K-line data.

1. Historical data K-line calculation

To calculate K-line using historical data, you can use DolphinDB's built-in functions bar , dailyAlignedBar , or wj .

1.1 Do not specify the starting time of the K-line window, and automatically generate K-line results according to the data

bar(X,Y) returns the remainder of X minus X divided by Y, generally used to group data.

date = 09:32m 09:33m 09:45m 09:49m 09:56m 09:56m;
bar(date, 5);

The following results are returned:

[09:30m,09:30m,09:45m,09:45m,09:55m,09:55m]

Example 1: Use the following data to simulate the US stock market:

n = 1000000
date = take(2019.11.07 2019.11.08, n)
time = (09:30:00.000 + rand(int(6.5*60*60*1000), n)).sort!()
timestamp = concatDateTime(date, time)
price = 100+cumsum(rand(0.02, n)-0.01)
volume = rand(1000, n)
symbol = rand(`AAPL`FB`AMZN`MSFT, n)
trade = table(symbol, date, time, timestamp, price, volume).sortBy!(`symbol`timestamp)
undef(`date`time`timestamp`price`volume`symbol)

Calculate the 5-minute K-line:

barMinutes = 5
OHLC = select first(price) as open, max(price) as high, min(price) as low, last(price) as close, sum(volume) as volume from trade group by symbol, date, bar(time, barMinutes*60*1000) as barStart

Please note that in the above data, the precision of the time column is milliseconds. If the precision of the time column is not milliseconds, the number in barMinutes*60*1000 should be adjusted accordingly.

1.2 Need to specify the starting time of the candlestick window

Need to specify the starting time of the candlestick window, you can use the dailyAlignedBarfunction. This function can handle multiple trading hours per day, and can also handle overnight periods.

Please note that when using the dailyAlignedBarfunction, the time column must contain date information, including three types of data: DATETIME, TIMESTAMP or NANOTIMESTAMP. The parameter timeOffset that specifies the start time of each trading session window must use the corresponding SECOND, TIME or NANOTIME data after removing the date information.

Example 2 (One trading session per day): Calculate the 7-minute K-line of the US stock market. The data follows the trade table in Example 1.

barMinutes = 7
OHLC = select first(price) as open, max(price) as high, min(price) as low, last(price) as close, sum(volume) as volume from trade group by symbol, dailyAlignedBar(timestamp, 09:30:00.000, barMinutes*60*1000) as barStart

Example 3 (two trading hours per day): The Chinese stock market has two trading hours each day, the morning time is from 9:30 to 11:30, and the afternoon time is from 13:00 to 15:00.

Use the following data to simulate:

n = 1000000
date = take(2019.11.07 2019.11.08, n)
time = (09:30:00.000 + rand(2*60*60*1000, n/2)).sort!() join (13:00:00.000 + rand(2*60*60*1000, n/2)).sort!()
timestamp = concatDateTime(date, time)
price = 100+cumsum(rand(0.02, n)-0.01)
volume = rand(1000, n)
symbol = rand(`600519`000001`600000`601766, n)
trade = table(symbol, timestamp, price, volume).sortBy!(`symbol`timestamp)
undef(`date`time`timestamp`price`volume`symbol)

Calculate the 7-minute K-line:

barMinutes = 7
sessionsStart=09:30:00.000 13:00:00.000
OHLC = select first(price) as open, max(price) as high, min(price) as low, last(price) as close, sum(volume) as volume from trade group by symbol, dailyAlignedBar(timestamp, sessionsStart, barMinutes*60*1000) as barStart

Example 4 (Two trading hours per day, including overnight periods): Some futures have multiple trading periods each day, including overnight periods. In this example, the first trading period is from 8:45 to 13:45 in the afternoon, and the other period is an overnight period, from 15:00 in the afternoon to 05:00 the next day.

Use the following data to simulate:

daySession =  08:45:00.000 : 13:45:00.000
nightSession = 15:00:00.000 : 05:00:00.000
n = 1000000
timestamp = rand(concatDateTime(2019.11.06, daySession[0]) .. concatDateTime(2019.11.08, nightSession[1]), n).sort!()
price = 100+cumsum(rand(0.02, n)-0.01)
volume = rand(1000, n)
symbol = rand(`A120001`A120002`A120003`A120004, n)
trade = select * from table(symbol, timestamp, price, volume) where timestamp.time() between daySession or timestamp.time()>=nightSession[0] or timestamp.time()<nightSession[1] order by symbol, timestamp
undef(`timestamp`price`volume`symbol)

Calculate the 7-minute K-line:

barMinutes = 7
sessionsStart = [daySession[0], nightSession[0]]
OHLC = select first(price) as open, max(price) as high, min(price) as low, last(price) as close, sum(volume) as volume from trade group by symbol, dailyAlignedBar(timestamp, sessionsStart, barMinutes*60*1000) as barStart

1.3 Overlapping candlestick windows: using wjfunctions

In the above example, none of the candlestick windows overlap. To calculate the overlapped candlestick window, you can use a wjfunction. Using wjfunctions, you can specify a relative time range for the time column in the left table, and perform calculations in the right table.

Example 5  (Two trading hours per day, overlapping K-line windows): Simulate Chinese stock market data, and calculate 30-minute K-line every 5 minutes.

n = 1000000
sampleDate = 2019.11.07
symbols = `600519`000001`600000`601766
trade = table(take(sampleDate, n) as date, 
	(09:30:00.000 + rand(7200000, n/2)).sort!() join (13:00:00.000 + rand(7200000, n/2)).sort!() as time, 
	rand(symbols, n) as symbol, 
	100+cumsum(rand(0.02, n)-0.01) as price, 
	rand(1000, n) as volume)

First, generate windows based on time, and use cross join to generate a combination of stocks and trading windows.

barWindows = table(symbols as symbol).cj(table((09:30:00.000 + 0..23 * 300000).join(13:00:00.000 + 0..23 * 300000) as time))

Then use the wjfunction to calculate the K-line data of the overlapping window:

OHLC = wj(barWindows, trade, 0:(30*60*1000), 
		<[first(price) as open, max(price) as high, min(price) as low, last(price) as close, sum(volume) as volume]>, `symbol`time)

1.4 Use the trading volume to divide the candlestick window

In the above examples, we all use time as the dimension to divide the candlestick window. In practice, other dimensions can also be used as the basis for dividing the K-line window. For example, use the accumulated transaction volume to calculate the K-line.

Example 6  (Two trading hours per day, using the accumulated trading volume to calculate the K-line): Simulate China's stock market data, and calculate the K-line for every 10,000 increase in trading volume.

n = 1000000
sampleDate = 2019.11.07
symbols = `600519`000001`600000`601766
trade = table(take(sampleDate, n) as date, 
	(09:30:00.000 + rand(7200000, n/2)).sort!() join (13:00:00.000 + rand(7200000, n/2)).sort!() as time, 
	rand(symbols, n) as symbol, 
	100+cumsum(rand(0.02, n)-0.01) as price, 
	rand(1000, n) as volume)
	
volThreshold = 10000
select first(time) as barStart, first(price) as open, max(price) as high, min(price) as low, last(price) as close 
from (select symbol, price, cumsum(volume) as cumvol from trade context by symbol)
group by symbol, bar(cumvol, volThreshold) as volBar

The code uses a nested query method. The sub-query generates the cumulative trading volume cumvol for each stock, and then uses the barfunction to generate a window based on the cumulative trading volume in the main query .

1.5 Use MapReduce function to speed up

If you need to extract a relatively large amount of historical data from the database, calculate the K-line, and then store it in the database, you can use the built-in Map-Reduce function mr of DolphinDB to read and calculate the data in parallel. This method can significantly increase the speed.

This example uses nanosecond-accurate transaction data from the US stock market. The original data is stored in the "trades" table of the "dfs://TAQ" database. The "dfs://TAQ" database uses composite partitions: the value partition based on the transaction date Date and the range partition based on the stock code Symbol.

(1) Load the metadata of the original data table stored on the disk into the memory:

login(`admin, `123456)
db = database("dfs://TAQ")
trades = db.loadTable("trades")

(2) Create an empty data table on the disk to store the calculation results. The following code creates a template table (model), and creates an empty OHLC table in the database "dfs://TAQ" according to the schema of this template table to store the K-line calculation results:

model=select top 1 Symbol, Date, Time.second() as bar, PRICE as open, PRICE as high, PRICE as low, PRICE as close, SIZE as volume from trades where Date=2007.08.01, Symbol=`EBAY
if(existsTable("dfs://TAQ", "OHLC"))
	db.dropTable("OHLC")
db.createPartitionedTable(model, `OHLC, `Date`Symbol)

(3) Use the mrfunction to calculate the K-line data, and write the result into the OHLC table:

def calcOHLC(inputTable){
	tmp=select first(PRICE) as open, max(PRICE) as high, min(PRICE) as low, last(PRICE) as close, sum(SIZE) as volume from inputTable where Time.second() between 09:30:00 : 15:59:59 group by Symbol, Date, 09:30:00+bar(Time.second()-09:30:00, 5*60) as bar
	loadTable("dfs://TAQ", `OHLC).append!(tmp)
	return tmp.size()
}
ds = sqlDS(<select Symbol, Date, Time, PRICE, SIZE from trades where Date between 2007.08.01 : 2019.08.01>)
mr(ds, calcOHLC, +)

In the above code, ds is sqlDSa series of data sources generated by the function , each data source represents data extracted from a data partition; the custom function calcOHLCis the map function in the Map-Reduce algorithm, and the K-line is calculated for each data source Data, and write the result to the database, return the number of rows of K-line data written into the database; "+" is the reduce function in the Map-Reduce algorithm, and the results of all map functions, that is, the K-line data written into the database Add the number of rows to return the total number of K-line data written to the database.

2. Real-time K-line calculation

The process of calculating real-time K-line in DolphinDB database is shown in the figure below:

218754f5cab08f48231f1d645c30e360.pngFlow chart for calculating real-time K-line in DolphinDB

Real-time data providers generally provide data subscription services based on APIs in Python, Java or other common languages. In this example, Python is used to simulate receiving market data, and written into the streaming data table through the DolphinDB Python API. DolphinDB's streaming data timing aggregation engine (TimeSeriesAggregator) can calculate K-lines for real-time data according to a specified frequency and moving window.

The simulated real-time data source used in this example is the text file trades.csv . The file contains the following 4 columns (a row of sample data is given together):

cfd068280fa45745a880cd214d1fd2f3.jpeg

The following three subsections introduce the three steps of real-time candlestick calculation:

2.1 Use Python to receive real-time data and write it to the DolphinDB stream data table

  • Create a stream data table in DolphinDB
share streamTable(100:0, `Symbol`Datetime`Price`Volume,[SYMBOL,DATETIME,DOUBLE,INT]) as Trade
  • The Python program reads data from the data source trades.csv file and writes it into DolphinDB.

The data precision of Datetime in real-time data is seconds. Since only DateTime[64], which is the nanatimestamp type, can be used in pandas DataFrame, the following code has a data type conversion process before writing. This process is also applicable to scenarios where most data needs to be cleaned and converted.

import dolphindb as ddbimport pandas as pdimport numpy as npcsv_file = "trades.csv"csv_data = pd.read_csv(csv_file, dtype={'Symbol':str} )csv_df = pd.DataFrame(csv_data)s = ddb.session();s.connect("127.0.0.1",8848,"admin","123456")#上传DataFrame到DolphinDB,并对Datetime字段做类型转换s.upload({"tmpData":csv_df})s.run("data = select Symbol, datetime(Datetime) as Datetime, Price, Volume from tmpData")s.run("tableInsert(Trade,data)")

2.2 Real-time calculation of K-line

In this example, the time series aggregation engine createTimeSeriesAggregatorfunction is used to calculate K-line data in real time, and the calculation result is output to the stream data table OHLC.

Real-time calculation of K-line data can be divided into the following two situations according to different application scenarios:

  • The calculation is triggered only at the end of each time window
    • The time windows do not overlap at all, for example, calculate the K-line data of the past 5 minutes every 5 minutes
    • The time windows partially overlap, for example, the K-line data of the past 5 minutes is calculated every 1 minute
  • The calculation is triggered at the end of each time window, and the data in each time window will be updated
    at a certain frequency. For example, the K-line data of the past 1 minute is calculated every 1 minute, but the K-line of the last 1 minute does not want to wait until the end of the window Calculate later. Hope to update every 1 second

The following describes how to use createTimeSeriesAggregatorfunctions to calculate K-line data in real time for the above-mentioned situations . Please select the corresponding scenario to create a time series aggregation engine according to actual needs.

2.2.1 The calculation is triggered only at the end of each time window

Only when the calculation is triggered at the end of each time window, it can be divided into two scenarios: the time window does not overlap completely and the time window partially overlaps. These two situations can be achieved by setting createTimeSeriesAggregatorthe windowSize and step parameters of the function. This will be explained in detail below.

First define the output table:

share streamTable(100:0, `datetime`symbol`open`high`low`close`volume,[DATETIME, SYMBOL, DOUBLE,DOUBLE,DOUBLE,DOUBLE,LONG]) as OHLC

Then, according to different usage scenarios, choose any of the following scenarios to create a time series aggregation engine.

Scenario 1: Calculate the K-line data of the past 5 minutes every 5 minutes, use the following script to define the timing aggregation engine, where the value of the windowSize parameter is equal to the value of the step parameter

tsAggrKline = createTimeSeriesAggregator(name="aggr_kline", windowSize=300, step=300, metrics=<[first(Price),max(Price),min(Price),last(Price),sum(volume)]>, dummyTable=Trade, outputTable=OHLC, timeColumn=`Datetime, keyColumn=`Symbol)

Scenario 2: Calculate the K-line data of the past 5 minutes every 1 minute. You can use the following script to define the timing aggregation engine. Among them, the value of the windowSize parameter is a multiple of the value of the step parameter

tsAggrKline = createTimeSeriesAggregator(name="aggr_kline", windowSize=300, step=60, metrics=<[first(Price),max(Price),min(Price),last(Price),sum(volume)]>, dummyTable=Trade, outputTable=OHLC, timeColumn=`Datetime, keyColumn=`Symbol)

Finally, define the streaming data subscription. If real-time data has been written in the streaming data table Trade at this time, the real-time data will be subscribed immediately and injected into the aggregation engine:

subscribeTable(tableName="Trade", actionName="act_tsaggr", offset=0, handler=append!{tsAggrKline}, msgAsTable=true)

The first 5 rows of data in the output table of scenario 1:

655ed074427df36dcdccd27e73fa3181.jpeg

2.2.2 The calculation is triggered at the end of each time window, and the calculation result is updated at a certain frequency at the same time

Take the calculation of vwap price with a window time of 1 minute as an example. After the aggregation result is updated at 10:00, the next update must be at least 10:01. According to the calculation rules, even if a lot of transactions occur within this minute, no calculation will be triggered. This is unacceptable in many financial transaction scenarios. It is hoped that the information will be updated at a higher frequency. For this reason, the updateTime parameter of the timing aggregation engine is introduced .

The updateTime parameter represents the time interval for calculation. If updateTime is not specified, the time series aggregation engine will trigger a calculation only at the end of each time window. But if updateTime is specified, the calculation will be triggered in the following three cases:

  • At the end of each time window, the time series aggregation engine will trigger a calculation
  • Every updateTime time unit, the time series aggregation engine will trigger a calculation
  • If the data enters more than 2*updateTime time units (if 2*updateTime is less than 2 seconds, set to 2 seconds), there is still uncalculated data in the current window, and the time series aggregation engine will trigger a calculation

In this way, it can be ensured that the timing aggregation engine can end the trigger calculation in each time window, and at the same time, the calculation will be triggered at a certain frequency within each time window.

It should be noted that the timing aggregation engine requires that when using the updateTime parameter, keyedTable must be used as the output table. The specific reasons are as follows:

  • If the ordinary table or streamTable is used as the output table,
    table and streamTable will not restrict the writing of duplicate data. Therefore, when the data meets the condition of triggering updateTime but has not yet met the condition of triggering step, the timing aggregation engine will continue to output the table Adding the calculation result of the same time, the final output table will have a large number of records with the same time, this result is meaningless.
  • If the keyedStreamTable is used as the output table, the
    keyedStreamTable is not allowed to update history records, nor is it allowed to add records with the same key value to the table. When adding a new record to the table, the system will automatically check the primary key value of the new record. If the primary key value of the new record is the same as the primary key value of the existing record, the new record will not be written. The result of performance in this scenario is that when the data has not yet met the conditions for triggering step, but when the conditions for triggering updateTime are met, the timing aggregation engine writes the calculation results of the most recent window to the output table, but it is forbidden to write because the time is the same. Enter, the updateTIme parameter also loses its meaning.
  • Use keyedTable as the output table
    keyedTable allows updating. When adding a new record to the table, the system will automatically check the primary key value of the new record. If the primary key value of the new record is the same as the primary key value of the existing record, the table will be updated The corresponding record. The result of performance in this scenario is that the calculation result may be updated at the same time. When the data has not yet met the condition of triggering step, but the condition of triggering updateTime is met, the calculation result will be modified to the result of calculation based on the data in the most recent window, instead of adding a new record to the output table. Until the data meets the conditions that trigger the step, a new record will be added to the output table. And this result is what we expect to achieve, so the timing aggregation engine requires that when using the updateTime parameter, keyedTable must be used as the output table.

For example, you want to calculate the K-line with a window of 1 minute, but you don't want to wait until the end of the window to calculate the K-line with the last 1 minute. Hope to update the K-line data for nearly 1 minute every 1 second. We can implement this scenario through the following steps.

First, we need to create a keyedTable as the output table, and use the time column and stock code column as the primary key. When new data is injected into the output table, if the time of the new record already exists in the table, the record of the corresponding time in the table will be updated. This will ensure that the data at each moment of each query is up to date.

share keyedTable(`datetime`Symbol, 100:0, `datetime`Symbol`open`high`low`close`volume,[DATETIME,SYMBOL,DOUBLE,DOUBLE,DOUBLE,DOUBLE,LONG]) as OHLC
Please note: When using the time series aggregation engine, the keyedTable is used as the output table. If the time series aggregation engine specifies the keyColumn parameter, then kyedTable needs to use the time-related column and the keyColumn column as the primary key at the same time.

每隔1分钟计算一次过去1分钟的K线数据,并且每隔1秒钟都更新一次近1分钟的K线数据,可以使用以下脚本定义时序聚合引擎。其中,windowSize参数取值与step参数取值相等,并指定updateTime参数取值为1秒钟,即每隔1秒种更新最近1分钟的数据。下例中的useWindowStartTime参数则用于指定输出表中的时间为数据窗口的起始时间。

tsAggrKline = createTimeSeriesAggregator(name="aggr_kline", windowSize=60, step=60, metrics=<[first(Price),max(Price),min(Price),last(Price),sum(volume)]>, dummyTable=Trade, outputTable=OHLC, timeColumn=`Datetime, keyColumn=`Symbol,updateTime=1, useWindowStartTime=true)
请注意,在使用时间序列聚合引擎时,windowSize必须是step的整数倍,并且step必须是updateTime的整数倍。

最后,定义流数据订阅。若此时流数据表Trade中已经有实时数据写入,那么实时数据会马上被订阅并注入聚合引擎:

subscribeTable(tableName="Trade", actionName="act_tsaggr", offset=0, handler=append!{tsAggrKline}, msgAsTable=true)

输出表的前5行数据:

73c6ae6f0cc537779af2032b7447ce6d.jpeg

2.3 在Python中展示K线数据

在本例中,聚合引擎的输出表也定义为流数据表,客户端可以通过Python API订阅输出表,并将计算结果展现到Python终端。

The following code uses Python API to subscribe to the real-time aggregation calculation output result table OHLC, and printprint the result through the function.

import dolphindb as ddb
import pandas as pd
import numpy as np
#Set the local port 20001 for subscribing to streaming data
s.enableStreaming(20001)
def handler(lst):         
    print(lst)
# Subscribe to the OHLC stream data table on DolphinDB (local port 8848)
s.subscribe("127.0.0.1", 8848, handler, "OHLC")

Also be Grafana connected DolphinDB database visualization system and the like, the output table and the query results in a graph to show.


Guess you like

Origin blog.51cto.com/15022783/2642177
Recommended