DolphinDB is a distributed time series database with rich calculation and analysis functions built-in. It can store terabytes of massive data on multiple physical machines, make full use of the CPU, and perform high-performance analysis and calculations on massive data. Through Orca, we can perform complex and efficient calculations on the data in the DolphinDB distributed database using scripts with the same syntax as pandas in the python environment. This tutorial mainly introduces Orca's operations on DolphinDB distributed tables.
This example uses DolphinDB stand-alone mode. First, create the sample database dfs://orca_stock for this tutorial. The DolphinDB script to create the database is as follows:
login("admin","123456")
if(existsDatabase("dfs://orca_stock")){
dropDatabase("dfs://orca_stock")
}
dates=2019.01.01..2019.01.31
syms="A"+string(1..30)
sym_range=cutPoints(syms,3)
db1=database("",VALUE,dates)
db2=database("",RANGE,sym_range)
db=database("dfs://orca_stock",COMPO,[db1,db2])
n=10000000
datetimes=2019.01.01T00:00:00..2019.01.31T23:59:59
t=table(rand(datetimes,n) as trade_time,rand(syms,n) as sym,rand(1000,n) as qty,rand(500.0,n) as price)
trades=db.createPartitionedTable(t,`trades,`trade_time`sym).append!(t)
n=200000
datetimes=2019.01.01T00:00:00..2019.01.02T23:59:59
syms="A"+string(1..30)
t2=table(rand(datetimes,n) as trade_time,rand(syms,n) as sym,rand(500.0,n) as bid,rand(500.0,n) as offer)
quotes=db.createPartitionedTable(t2,`quotes,`trade_time`sym).append!(t2)
syms="A"+string(1..30)
t3=table(syms as sym,rand(0 1,30) as type)
infos=db.createTable(t3,`infos).append!(t3)
Note: You need to create a distributed table on the DolphinDB client or through the DolphinDB Python API. You cannot create a distributed table directly in Orca.
connect
Connect to the DolphinDB server through the function in Orca :
>>> import dolphindb.orca as orca
>>> orca.connect("localhost",8848,"admin","123456")
The user needs to modify the IP address and port number according to the actual situation.
1 Read the distributed table
Orca read_table
reads distributed tables through functions, and the result returned is Orca DataFrame. For example: read the table trades in the sample database dfs://orca_stock:
>>> trades = orca.read_table('dfs://orca_stock','trades')
>>> type(trades)
orca.core.frame.DataFrame
View the column names of trades:
>>> trades.columns
Index(['trade_time', 'sym', 'qty', 'price'], dtype='object')
View the data type of each column of trades:
>>> trades.dtypes
trade_time datetime64[s]
sym object
qty int32
price float64
dtype: object
View the number of rows of trades:
>>> len(trades)
10000000
The Orca DataFrame corresponding to the DolphinDB distributed table only stores metadata, including information such as table names and data column names. Since the distributed table is not stored continuously and there is no strict order relationship between the partitions, the DataFrame corresponding to the distributed table does not have the concept of RangeIndex. If you need to set the index, you can use a set_index
function. For example, set trade_time in trades to index:
>>> trades.set_index('trade_time')
If you want to convert the index column into a data column, you can use a reset_index
function.
>>> trades.reset_index()
2 Query and calculation
Orca uses lazy evaluation. Certain calculations are not immediately calculated on the server side, but converted into an intermediate expression. The calculation does not occur until it is really needed. If the user needs to trigger the calculation immediately, the compute
function can be called .
Note that the data in the sample database dfs://orca_stock is randomly generated, so the results of the user's operation will be different from the results in this chapter.
2.1 Get the first n records
head
The function can query the first n records, and the first 5 records are selected by default. For example, take the first 5 records of trades:
>>> trades.head()
trade_time sym qty price
0 2019-01-01 18:04:33 A16 855 482.526769
1 2019-01-01 13:57:38 A12 244 61.675293
2 2019-01-01 23:58:15 A10 36 297.623295
3 2019-01-01 23:02:43 A16 426 109.041012
4 2019-01-01 04:33:53 A1 472 75.778951
2.2 Sort
sort_values
Methods can be sorted according to a certain column. For example, trades are sorted in descending order by price, and the first 5 records are taken:
>>> trades.sort_values(by='price', ascending=False).head()
trade_time sym qty price
0 2019-01-03 12:56:09 A22 861 499.999998
1 2019-01-18 17:25:21 A19 95 499.999963
2 2019-01-30 02:18:48 A30 114 499.999949
3 2019-01-23 08:31:56 A3 926 499.999926
4 2019-01-20 03:36:53 A3 719 499.999892
Sort by multiple columns:
>>> trades.sort_values(by=['qty','trade_time'], ascending=False).head()
trade_time sym qty price
0 2019-01-31 23:58:50 A24 999 359.887697
1 2019-01-31 23:57:26 A3 999 420.156175
2 2019-01-31 23:56:34 A2 999 455.228435
3 2019-01-31 23:52:58 A6 999 210.819227
4 2019-01-31 23:45:17 A14 999 310.813216
2.3 Query according to conditions
Orca supports multiple queries based on single or multiple conditions. E.g,
Query the data on January 2, 2019 in trades:
>>> tmp = trades[trades.trade_time.dt.date == "2019.01.01"]
>>> tmp.head()
trade_time sym qty price
0 2019-01-01 00:32:21 A2 139 383.971293
1 2019-01-01 21:19:09 A2 263 100.932553
2 2019-01-01 18:50:48 A2 890 335.614454
3 2019-01-01 23:29:16 A2 858 469.223992
4 2019-01-01 09:58:51 A2 883 235.753424
Query the data in trades on January 30, 2019, with the stock code A2:
>>> tmp = trades[(trades.trade_time.dt.date == '2019.01.30') & (trades.sym == 'A2')]
>>> tmp.head()
trade_time sym qty price
0 2019-01-30 04:41:56 A2 880 428.552654
1 2019-01-30 14:13:53 A2 512 488.826978
2 2019-01-30 14:31:28 A2 536 478.578219
3 2019-01-30 04:09:41 A2 709 255.435903
4 2019-01-30 13:18:50 A2 355 404.782260
2.4 groupby group query
groupby
Functions are used for grouping and aggregation. The following functions can be used for groupby objects:
count
: Return the number of non-NULL elementssum
: Summean
: Meanmin
: Minimummax
: Maximummode
: Modeabs
: Absolute valueprod
:productstd
: Standard deviationvar
:variancesem
: Standard error of the meanskew
: Tiltkurtosis
: Kurtosiscumsum
: Cumulative sumcumprod
: Cumulative productcummax
: Cumulative maximumcummin
: Cumulative minimum
Calculate the number of records per day in trades:
>>> trades.groupby(trades.trade_time.dt.date)['sym'].count()
trade_time
2019-01-01 322573
2019-01-02 322662
2019-01-03 323116
2019-01-04 322436
2019-01-05 322156
2019-01-06 324191
2019-01-07 321879
2019-01-08 323319
2019-01-09 322262
2019-01-10 322585
2019-01-11 322986
2019-01-12 322839
2019-01-13 322302
2019-01-14 322032
2019-01-15 322409
2019-01-16 321810
2019-01-17 321566
2019-01-18 323651
2019-01-19 323463
2019-01-20 322675
2019-01-21 322845
2019-01-22 322931
2019-01-23 322598
2019-01-24 322404
2019-01-25 322454
2019-01-26 321760
2019-01-27 321955
2019-01-28 322013
2019-01-29 322745
2019-01-30 322193
2019-01-31 323190
dtype: int64
Calculate the number of records for each stock in trades per day:
>>> trades.groupby([trades.trade_time.dt.date,'sym'])['price'].count()
trade_time sym
2019-01-01 A1 10638
A10 10747
A11 10709
A12 10715
A13 10914
...
2019-01-31 A5 10717
A6 10934
A7 10963
A8 10907
A9 10815
Length: 930, dtype: int64
Orca supports applying multiple aggregate functions at once through agg. Unlike pandas, Orca uses strings in agg to indicate the aggregate function to be called. For example, to calculate the maximum, minimum and average values of daily prices in trades:
>>> trades.groupby(trades.trade_time.dt.date)['price'].agg(["min","max","avg"])
price
min max avg
trade_time
2019-01-01 0.003263 499.999073 249.913612
2019-01-02 0.000468 499.999533 249.956874
2019-01-03 0.000054 499.999998 249.927257
2019-01-04 0.000252 499.999762 249.982737
2019-01-05 0.001907 499.999704 250.097487
2019-01-06 0.000318 499.999824 249.991605
2019-01-07 0.003196 499.999548 249.560505
2019-01-08 0.000216 499.996703 250.024405
2019-01-09 0.002635 499.998985 249.966446
2019-01-10 0.000725 499.996717 249.663324
2019-01-11 0.003140 499.998267 250.243786
2019-01-12 0.000105 499.998453 250.077061
2019-01-13 0.004297 499.999139 250.097489
2019-01-14 0.003510 499.999452 249.775830
2019-01-15 0.002501 499.999638 250.021218
2019-01-16 0.000451 499.998059 250.044059
2019-01-17 0.002359 499.998462 249.808932
2019-01-18 0.000104 499.999963 249.918651
2019-01-19 0.000999 499.998000 249.899495
2019-01-20 0.000489 499.999892 249.606668
2019-01-21 0.000729 499.999774 249.839876
2019-01-22 0.000834 499.999331 249.632037
2019-01-23 0.001982 499.999926 249.955031
2019-01-24 0.000323 499.993956 249.557851
2019-01-25 0.000978 499.999716 249.722053
2019-01-26 0.002582 499.998753 249.897519
2019-01-27 0.000547 499.999809 250.404666
2019-01-28 0.002729 499.998545 249.622289
2019-01-29 0.000487 499.999598 249.950167
2019-01-30 0.000811 499.999949 250.182493
2019-01-31 0.000801 499.999292 249.317517
Orca groupby supports filtering functions. Unlike pandas, the filter conditions in Orca are represented by expressions in the form of strings instead of lambda functions.
For example, return the records in trades where the average price of each stock per day is greater than 200 and the number of records is greater than 11000:
>>> trades.groupby([trades.trade_time.dt.date,'sym'])['price'].filter("avg(price) > 200 and count(price) > 11000")
0 499.171179
1 375.553059
2 119.240890
3 370.198534
4 5.876941
...
88416 37.872317
88417 373.259785
88418 435.154484
88419 436.163806
88420 428.455914
Length: 88421, dtype: float64
2.5 resample
Orca supports resample
functions that can resample and frequency conversion of regular time series data. Currently, the parameters of the resample function are as follows:
- rule: DateOffset, which can be a string or a dateoffset object
- on: time column, use this column for resampling
- level: string or integer. For MultiIndex, the column specified by level is used for resampling
The DateOffset supported by Orca is as follows:
'B':BDay or BusinessDay
'WOM':WeekOfMonth
'LWOM':LastWeekOfMonth
'M':MonthEnd
'MS':MonthBegin
'BM':BMonthEnd or BusinessMonthEnd
'BMS':BMonthBegin or BusinessMonthBegin
'SM':SemiMonthEnd
'SMS':SemiMonthBegin
'Q':QuarterEnd
'QS':QuarterBegin
'BQ':BQuarterEnd
'BQS':BQuarterBegin
'REQ':FY5253Quarter
'A':YearEnd
'AS' or 'BYS':YearBegin
'BA':BYearEnd
'BAS':BYearBegin
'RE':FY5253
'D':Day
'H':Hour
'T' or 'min':Minute
'S':Second
'L' or 'ms':Milli
'U' or 'us':Micro
'N':Nano
For example, resample the data in trades and calculate it every 3 minutes:
>>> trades.resample('3T', on='trade_time')['qty'].sum()
trade_time
2019-01-01 00:00:00 321063
2019-01-01 00:03:00 354917
2019-01-01 00:06:00 329419
2019-01-01 00:09:00 340880
2019-01-01 00:12:00 356612
...
2019-01-31 23:45:00 322829
2019-01-31 23:48:00 344753
2019-01-31 23:51:00 330959
2019-01-31 23:54:00 336712
2019-01-31 23:57:00 328730
Length: 14880, dtype: int64
If trades sets trade_time to index, you can also use the following methods to resample:
>>> trades.resample('3T', level='trade_time')['qty'].sum()
If you want to use the object generated by the dateoffset function to represent the dateoffset, you need to import the pandas dateoffset first. Resample by 3 minutes can also use the following writing:
>>> from pandas.tseries.offsets import *
>>> ofst = Minute(n=3)
>>> trades.resample(ofst,on='trade_time')['qty'].sum()
2.6 rolling window
Orca provides rolling
functions to perform calculations in moving windows. Currently, rolling
the parameters of the function are as follows:
- window:: integer, which means the length of the window
- on: string, calculate the window based on this column
The following functions can be used for the orca.DataFrame.rolling object:
count
: Return the number of non-NULL elementssum
: Summin
: Minimummax
: Maximumstd
: Standard deviationvar
:variancecorr
:Correlationcovar
:Covarianceskew
: Tiltkurtosis
: Kurtosis
For the DataFrame corresponding to the distributed table, when calculating in the sliding window, it is calculated separately in the unit of partition, so the first window-1 value of the calculation result of each partition is empty. For example, the data of 2019.01.01 and 2019.01.02 in trades is the sum of price in a sliding window of length 3:
>>> tmp = trades[(trades.trade_time.dt.date == '2019.01.01') | (trades.trade_time.dt.date == '2019.01.02')]
>>> re = tmp.rolling(window=3)['price'].sum()
0 NaN
1 NaN
2 792.386603
3 601.826312
4 444.858366
...
646057 1281.099161
646058 1287.816045
646059 963.262163
646060 865.797011
646061 719.050068
Name: price, Length: 646062, dtype: float64
2.7 Data connection
Orca provides the function of connecting DataFrame. The DataFrame corresponding to the distributed table can be connected to the DataFrame corresponding to the ordinary memory table or the DataFrame corresponding to the distributed table. The DataFrame corresponding to two distributed tables must meet the following conditions at the same time when connecting:
- Two distributed tables in the same database
- The join column must contain all partition columns
Orca provides the merge
sum join
function.
merge
The function supports the following parameters:
- right:Orca DataFrame或Series
- how: string, indicating the type of connection, can be left, right, outer and inner, the default value is inner
- on: string, indicating the connection column
- left_on: string, representing the connection column of the left table
- right_on: string, representing the connection column of the right table
- left_index: the index of the left table
- right_index: the index of the right table
- suffixes: string, representing the suffix of repeated columns
join
A function is merge
a special case of a function. Its parameters and meanings are merge
basically the same, except that the join
default is left outer join, that is, how='left'.
For example, to internally link trades and quotes:
>>> quotes = orca.read_table('dfs://orca_stock','quotes')
>>> trades.merge(right=quotes, left_on=['trade_time','sym'], right_on=['trade_time','sym'], how='inner')
trade_time sym qty price bid offer
0 2019-01-01 02:36:34 A15 273 186.144261 317.458480 155.361661
1 2019-01-01 05:37:59 A13 185 420.397500 248.447426 115.722893
2 2019-01-01 00:59:43 A10 751 89.801687 193.925714 144.345473
3 2019-01-01 21:58:36 A16 175 251.753495 116.810807 439.178207
4 2019-01-01 10:53:54 A16 532 71.733640 240.927647 388.718680
... ... ... ... ... ... ...
25035 2019-01-02 03:59:51 A3 220 50.004418 107.905522 167.375994
25036 2019-01-02 17:54:01 A3 202 195.189216 134.463906 142.443428
25037 2019-01-02 16:57:50 A9 627 68.661644 440.421876 110.801070
25038 2019-01-02 10:27:43 A28 414 487.337282 169.081363 261.171073
25039 2019-01-02 17:02:51 A3 661 243.960836 92.999404 26.747609
[25040 rows x 6 columns]
Use join
functions to connect trades and quotes to the left:
>>> trades.set_index(['trade_time','sym'], inplace=True)
>>> quotes.set_index(['trade_time','sym'], inplace=True)
>>> trades.join(quotes)
qty price bid offer
trade_time sym
2019-01-01 18:04:25 A14 435 378.595626 NaN NaN
2019-01-01 20:38:47 A13 701 275.039372 NaN NaN
2019-01-01 02:43:03 A16 787 138.751605 NaN NaN
2019-01-01 20:32:42 A14 989 188.035335 NaN NaN
2019-01-01 16:59:16 A13 847 118.071427 NaN NaN
... ... ... ... ...
2019-01-31 17:21:27 A30 3 49.855063 NaN NaN
2019-01-31 13:49:01 A6 273 245.966115 NaN NaN
2019-01-31 16:42:29 A7 548 197.814548 NaN NaN
2019-01-31 03:42:11 A5 563 263.999224 NaN NaN
2019-01-31 20:48:57 A9 809 318.420522 NaN NaN
[10000481 rows x 4 columns]
3 append the dataframe to the dfs table
Orca provides a append
function to append Orca DataFrame to the dfs table.
append
The function has the following parameters:
- other: the DataFrame to be appended
- ignore_index: Boolean value, whether to ignore the index. The default is False
- verify_integrity: Boolean value. The default is False
- sort: Boolean value, indicating whether to sort. Default is None
- inplace: Boolean value, indicating whether to insert into the dfs table. The default is False
For example, to append the dataframe to the distributed table corresponding to trades:
>>> import pandas as pd
>>> odf=orca.DataFrame({'trade_time':pd.date_range('20190101 12:30',periods=5,freq='T'),
'sym':['A1','A2','A3','A4','A5'],
'qty':[100,200,300,400,500],
'price':[100.5,263.1,254.9,215.1,245.6]})
>>> trades.append(odf,inplace=True)
>>> len(trades)
10000005
Orca extends the append function to support the inplace parameter, which allows adding data in place. If inplace is False, the behavior is the same as pandas. The contents of the distributed table will be copied to the memory. At this time, trades corresponds to only a memory table. The contents of the odf are only appended to the memory table, and not really appended to the dfs table.
4 summary
For distributed tables, Orca currently has some functional limitations. For example, the DataFrame corresponding to the partition table does not have the concept of RangeIndex, some functions do not support the use of distributed tables and the restrictions on modifying the data in the table. Please refer to Orca Quick Start Guide for details .