Dry goods丨How to use Orca to operate DolphinDB distributed tables

DolphinDB is a distributed time series database with rich calculation and analysis functions built-in. It can store terabytes of massive data on multiple physical machines, make full use of the CPU, and perform high-performance analysis and calculations on massive data. Through Orca, we can perform complex and efficient calculations on the data in the DolphinDB distributed database using scripts with the same syntax as pandas in the python environment. This tutorial mainly introduces Orca's operations on DolphinDB distributed tables.

This example uses DolphinDB stand-alone mode. First, create the sample database dfs://orca_stock for this tutorial. The DolphinDB script to create the database is as follows:

login("admin","123456")
if(existsDatabase("dfs://orca_stock")){
	dropDatabase("dfs://orca_stock")
}
dates=2019.01.01..2019.01.31
syms="A"+string(1..30)
sym_range=cutPoints(syms,3)
db1=database("",VALUE,dates)
db2=database("",RANGE,sym_range)
db=database("dfs://orca_stock",COMPO,[db1,db2])
n=10000000
datetimes=2019.01.01T00:00:00..2019.01.31T23:59:59
t=table(rand(datetimes,n) as trade_time,rand(syms,n) as sym,rand(1000,n) as qty,rand(500.0,n) as price)
trades=db.createPartitionedTable(t,`trades,`trade_time`sym).append!(t)

n=200000
datetimes=2019.01.01T00:00:00..2019.01.02T23:59:59
syms="A"+string(1..30)
t2=table(rand(datetimes,n) as trade_time,rand(syms,n) as sym,rand(500.0,n) as bid,rand(500.0,n) as offer)
quotes=db.createPartitionedTable(t2,`quotes,`trade_time`sym).append!(t2)

syms="A"+string(1..30)
t3=table(syms as sym,rand(0 1,30) as type)
infos=db.createTable(t3,`infos).append!(t3)
Note: You need to create a distributed table on the DolphinDB client or through the DolphinDB Python API. You cannot create a distributed table directly in Orca.

connectConnect to the DolphinDB server through the function in Orca :

>>> import dolphindb.orca as orca
>>> orca.connect("localhost",8848,"admin","123456")

The user needs to modify the IP address and port number according to the actual situation.

1 Read the distributed table

Orca read_tablereads distributed tables through functions, and the result returned is Orca DataFrame. For example: read the table trades in the sample database dfs://orca_stock:

>>> trades = orca.read_table('dfs://orca_stock','trades')
>>> type(trades)
orca.core.frame.DataFrame

View the column names of trades:

>>> trades.columns
Index(['trade_time', 'sym', 'qty', 'price'], dtype='object')

View the data type of each column of trades:

>>> trades.dtypes
trade_time    datetime64[s]
sym                  object
qty                   int32
price               float64
dtype: object

View the number of rows of trades:

>>> len(trades)
10000000

The Orca DataFrame corresponding to the DolphinDB distributed table only stores metadata, including information such as table names and data column names. Since the distributed table is not stored continuously and there is no strict order relationship between the partitions, the DataFrame corresponding to the distributed table does not have the concept of RangeIndex. If you need to set the index, you can use a set_indexfunction. For example, set trade_time in trades to index:

>>> trades.set_index('trade_time')

If you want to convert the index column into a data column, you can use a reset_indexfunction.

>>> trades.reset_index()

2 Query and calculation

Orca uses lazy evaluation. Certain calculations are not immediately calculated on the server side, but converted into an intermediate expression. The calculation does not occur until it is really needed. If the user needs to trigger the calculation immediately, the computefunction can be called .

Note that the data in the sample database dfs://orca_stock is randomly generated, so the results of the user's operation will be different from the results in this chapter.

2.1 Get the first n records

headThe function can query the first n records, and the first 5 records are selected by default. For example, take the first 5 records of trades:

>>> trades.head()
           trade_time  sym  qty       price
0 2019-01-01 18:04:33  A16  855  482.526769
1 2019-01-01 13:57:38  A12  244   61.675293
2 2019-01-01 23:58:15  A10   36  297.623295
3 2019-01-01 23:02:43  A16  426  109.041012
4 2019-01-01 04:33:53   A1  472   75.778951

2.2 Sort

sort_valuesMethods can be sorted according to a certain column. For example, trades are sorted in descending order by price, and the first 5 records are taken:

>>> trades.sort_values(by='price', ascending=False).head()
           trade_time  sym  qty       price
0 2019-01-03 12:56:09  A22  861  499.999998
1 2019-01-18 17:25:21  A19   95  499.999963
2 2019-01-30 02:18:48  A30  114  499.999949
3 2019-01-23 08:31:56   A3  926  499.999926
4 2019-01-20 03:36:53   A3  719  499.999892

Sort by multiple columns:

>>> trades.sort_values(by=['qty','trade_time'], ascending=False).head()
           trade_time  sym  qty       price
0 2019-01-31 23:58:50  A24  999  359.887697
1 2019-01-31 23:57:26   A3  999  420.156175
2 2019-01-31 23:56:34   A2  999  455.228435
3 2019-01-31 23:52:58   A6  999  210.819227
4 2019-01-31 23:45:17  A14  999  310.813216

2.3 Query according to conditions

Orca supports multiple queries based on single or multiple conditions. E.g,

Query the data on January 2, 2019 in trades:

>>> tmp = trades[trades.trade_time.dt.date == "2019.01.01"]
>>> tmp.head()
           trade_time sym  qty       price
0 2019-01-01 00:32:21  A2  139  383.971293
1 2019-01-01 21:19:09  A2  263  100.932553
2 2019-01-01 18:50:48  A2  890  335.614454
3 2019-01-01 23:29:16  A2  858  469.223992
4 2019-01-01 09:58:51  A2  883  235.753424

Query the data in trades on January 30, 2019, with the stock code A2:

>>> tmp = trades[(trades.trade_time.dt.date == '2019.01.30') & (trades.sym == 'A2')]
>>> tmp.head()
           trade_time sym  qty       price
0 2019-01-30 04:41:56  A2  880  428.552654
1 2019-01-30 14:13:53  A2  512  488.826978
2 2019-01-30 14:31:28  A2  536  478.578219
3 2019-01-30 04:09:41  A2  709  255.435903
4 2019-01-30 13:18:50  A2  355  404.782260

2.4 groupby group query

groupbyFunctions are used for grouping and aggregation. The following functions can be used for groupby objects:

  • count: Return the number of non-NULL elements
  • sum: Sum
  • mean: Mean
  • min: Minimum
  • max: Maximum
  • mode: Mode
  • abs: Absolute value
  • prod:product
  • std: Standard deviation
  • var:variance
  • sem: Standard error of the mean
  • skew: Tilt
  • kurtosis: Kurtosis
  • cumsum: Cumulative sum
  • cumprod: Cumulative product
  • cummax: Cumulative maximum
  • cummin: Cumulative minimum

Calculate the number of records per day in trades:

>>> trades.groupby(trades.trade_time.dt.date)['sym'].count()
trade_time
2019-01-01    322573
2019-01-02    322662
2019-01-03    323116
2019-01-04    322436
2019-01-05    322156
2019-01-06    324191
2019-01-07    321879
2019-01-08    323319
2019-01-09    322262
2019-01-10    322585
2019-01-11    322986
2019-01-12    322839
2019-01-13    322302
2019-01-14    322032
2019-01-15    322409
2019-01-16    321810
2019-01-17    321566
2019-01-18    323651
2019-01-19    323463
2019-01-20    322675
2019-01-21    322845
2019-01-22    322931
2019-01-23    322598
2019-01-24    322404
2019-01-25    322454
2019-01-26    321760
2019-01-27    321955
2019-01-28    322013
2019-01-29    322745
2019-01-30    322193
2019-01-31    323190
dtype: int64

Calculate the number of records for each stock in trades per day:

>>> trades.groupby([trades.trade_time.dt.date,'sym'])['price'].count()
trade_time  sym
2019-01-01  A1     10638
            A10    10747
            A11    10709
            A12    10715
            A13    10914
                   ...  
2019-01-31  A5     10717
            A6     10934
            A7     10963
            A8     10907
            A9     10815
Length: 930, dtype: int64

Orca supports applying multiple aggregate functions at once through agg. Unlike pandas, Orca uses strings in agg to indicate the aggregate function to be called. For example, to calculate the maximum, minimum and average values ​​of daily prices in trades:

>>> trades.groupby(trades.trade_time.dt.date)['price'].agg(["min","max","avg"])
               price                        
                 min         max         avg
trade_time                                  
2019-01-01  0.003263  499.999073  249.913612
2019-01-02  0.000468  499.999533  249.956874
2019-01-03  0.000054  499.999998  249.927257
2019-01-04  0.000252  499.999762  249.982737
2019-01-05  0.001907  499.999704  250.097487
2019-01-06  0.000318  499.999824  249.991605
2019-01-07  0.003196  499.999548  249.560505
2019-01-08  0.000216  499.996703  250.024405
2019-01-09  0.002635  499.998985  249.966446
2019-01-10  0.000725  499.996717  249.663324
2019-01-11  0.003140  499.998267  250.243786
2019-01-12  0.000105  499.998453  250.077061
2019-01-13  0.004297  499.999139  250.097489
2019-01-14  0.003510  499.999452  249.775830
2019-01-15  0.002501  499.999638  250.021218
2019-01-16  0.000451  499.998059  250.044059
2019-01-17  0.002359  499.998462  249.808932
2019-01-18  0.000104  499.999963  249.918651
2019-01-19  0.000999  499.998000  249.899495
2019-01-20  0.000489  499.999892  249.606668
2019-01-21  0.000729  499.999774  249.839876
2019-01-22  0.000834  499.999331  249.632037
2019-01-23  0.001982  499.999926  249.955031
2019-01-24  0.000323  499.993956  249.557851
2019-01-25  0.000978  499.999716  249.722053
2019-01-26  0.002582  499.998753  249.897519
2019-01-27  0.000547  499.999809  250.404666
2019-01-28  0.002729  499.998545  249.622289
2019-01-29  0.000487  499.999598  249.950167
2019-01-30  0.000811  499.999949  250.182493
2019-01-31  0.000801  499.999292  249.317517

Orca groupby supports filtering functions. Unlike pandas, the filter conditions in Orca are represented by expressions in the form of strings instead of lambda functions.

For example, return the records in trades where the average price of each stock per day is greater than 200 and the number of records is greater than 11000:

>>> trades.groupby([trades.trade_time.dt.date,'sym'])['price'].filter("avg(price) > 200 and count(price) > 11000")
0        499.171179
1        375.553059
2        119.240890
3        370.198534
4          5.876941
            ...    
88416     37.872317
88417    373.259785
88418    435.154484
88419    436.163806
88420    428.455914
Length: 88421, dtype: float64

2.5 resample

Orca supports resamplefunctions that can resample and frequency conversion of regular time series data. Currently, the parameters of the resample function are as follows:

  • rule: DateOffset, which can be a string or a dateoffset object
  • on: time column, use this column for resampling
  • level: string or integer. For MultiIndex, the column specified by level is used for resampling

The DateOffset supported by Orca is as follows:

'B':BDay or BusinessDay
'WOM':WeekOfMonth
'LWOM':LastWeekOfMonth
'M':MonthEnd
'MS':MonthBegin
'BM':BMonthEnd or BusinessMonthEnd
'BMS':BMonthBegin or BusinessMonthBegin
'SM':SemiMonthEnd
'SMS':SemiMonthBegin
'Q':QuarterEnd
'QS':QuarterBegin
'BQ':BQuarterEnd
'BQS':BQuarterBegin
'REQ':FY5253Quarter
'A':YearEnd
'AS' or 'BYS':YearBegin
'BA':BYearEnd
'BAS':BYearBegin
'RE':FY5253
'D':Day
'H':Hour
'T' or 'min':Minute
'S':Second
'L' or 'ms':Milli
'U' or 'us':Micro
'N':Nano

For example, resample the data in trades and calculate it every 3 minutes:

>>> trades.resample('3T', on='trade_time')['qty'].sum()
trade_time
2019-01-01 00:00:00    321063
2019-01-01 00:03:00    354917
2019-01-01 00:06:00    329419
2019-01-01 00:09:00    340880
2019-01-01 00:12:00    356612
                        ...  
2019-01-31 23:45:00    322829
2019-01-31 23:48:00    344753
2019-01-31 23:51:00    330959
2019-01-31 23:54:00    336712
2019-01-31 23:57:00    328730
Length: 14880, dtype: int64

If trades sets trade_time to index, you can also use the following methods to resample:

>>> trades.resample('3T', level='trade_time')['qty'].sum()

If you want to use the object generated by the dateoffset function to represent the dateoffset, you need to import the pandas dateoffset first. Resample by 3 minutes can also use the following writing:

>>> from pandas.tseries.offsets import *
>>> ofst = Minute(n=3)
>>> trades.resample(ofst,on='trade_time')['qty'].sum()

2.6 rolling window

Orca provides rollingfunctions to perform calculations in moving windows. Currently, rollingthe parameters of the function are as follows:

  • window:: integer, which means the length of the window
  • on: string, calculate the window based on this column

The following functions can be used for the orca.DataFrame.rolling object:

  • count: Return the number of non-NULL elements
  • sum: Sum
  • min: Minimum
  • max: Maximum
  • std: Standard deviation
  • var:variance
  • corr:Correlation
  • covar:Covariance
  • skew: Tilt
  • kurtosis: Kurtosis

For the DataFrame corresponding to the distributed table, when calculating in the sliding window, it is calculated separately in the unit of partition, so the first window-1 value of the calculation result of each partition is empty. For example, the data of 2019.01.01 and 2019.01.02 in trades is the sum of price in a sliding window of length 3:

>>> tmp = trades[(trades.trade_time.dt.date == '2019.01.01') | (trades.trade_time.dt.date == '2019.01.02')]
>>> re = tmp.rolling(window=3)['price'].sum()
0                 NaN
1                 NaN
2          792.386603
3          601.826312
4          444.858366
             ...     
646057    1281.099161
646058    1287.816045
646059     963.262163
646060     865.797011
646061     719.050068
Name: price, Length: 646062, dtype: float64

2.7 Data connection

Orca provides the function of connecting DataFrame. The DataFrame corresponding to the distributed table can be connected to the DataFrame corresponding to the ordinary memory table or the DataFrame corresponding to the distributed table. The DataFrame corresponding to two distributed tables must meet the following conditions at the same time when connecting:

  • Two distributed tables in the same database
  • The join column must contain all partition columns

Orca provides the mergesum joinfunction.

mergeThe function supports the following parameters:

  • right:Orca DataFrame或Series
  • how: string, indicating the type of connection, can be left, right, outer and inner, the default value is inner
  • on: string, indicating the connection column
  • left_on: string, representing the connection column of the left table
  • right_on: string, representing the connection column of the right table
  • left_index: the index of the left table
  • right_index: the index of the right table
  • suffixes: string, representing the suffix of repeated columns

joinA function is mergea special case of a function. Its parameters and meanings are mergebasically the same, except that the joindefault is left outer join, that is, how='left'.

For example, to internally link trades and quotes:

>>> quotes = orca.read_table('dfs://orca_stock','quotes')
>>> trades.merge(right=quotes, left_on=['trade_time','sym'], right_on=['trade_time','sym'], how='inner')
               trade_time  sym  qty       price         bid       offer
0     2019-01-01 02:36:34  A15  273  186.144261  317.458480  155.361661
1     2019-01-01 05:37:59  A13  185  420.397500  248.447426  115.722893
2     2019-01-01 00:59:43  A10  751   89.801687  193.925714  144.345473
3     2019-01-01 21:58:36  A16  175  251.753495  116.810807  439.178207
4     2019-01-01 10:53:54  A16  532   71.733640  240.927647  388.718680
...                   ...  ...  ...         ...         ...         ...
25035 2019-01-02 03:59:51   A3  220   50.004418  107.905522  167.375994
25036 2019-01-02 17:54:01   A3  202  195.189216  134.463906  142.443428
25037 2019-01-02 16:57:50   A9  627   68.661644  440.421876  110.801070
25038 2019-01-02 10:27:43  A28  414  487.337282  169.081363  261.171073
25039 2019-01-02 17:02:51   A3  661  243.960836   92.999404   26.747609

[25040 rows x 6 columns]

Use joinfunctions to connect trades and quotes to the left:

>>> trades.set_index(['trade_time','sym'], inplace=True)
>>> quotes.set_index(['trade_time','sym'], inplace=True)
>>> trades.join(quotes)
                         qty       price  bid  offer
trade_time          sym                             
2019-01-01 18:04:25 A14  435  378.595626  NaN    NaN
2019-01-01 20:38:47 A13  701  275.039372  NaN    NaN
2019-01-01 02:43:03 A16  787  138.751605  NaN    NaN
2019-01-01 20:32:42 A14  989  188.035335  NaN    NaN
2019-01-01 16:59:16 A13  847  118.071427  NaN    NaN
...                      ...         ...  ...    ...
2019-01-31 17:21:27 A30    3   49.855063  NaN    NaN
2019-01-31 13:49:01 A6   273  245.966115  NaN    NaN
2019-01-31 16:42:29 A7   548  197.814548  NaN    NaN
2019-01-31 03:42:11 A5   563  263.999224  NaN    NaN
2019-01-31 20:48:57 A9   809  318.420522  NaN    NaN

[10000481 rows x 4 columns]

3 append the dataframe to the dfs table

Orca provides a appendfunction to append Orca DataFrame to the dfs table.

appendThe function has the following parameters:

  • other: the DataFrame to be appended
  • ignore_index: Boolean value, whether to ignore the index. The default is False
  • verify_integrity: Boolean value. The default is False
  • sort: Boolean value, indicating whether to sort. Default is None
  • inplace: Boolean value, indicating whether to insert into the dfs table. The default is False

For example, to append the dataframe to the distributed table corresponding to trades:

>>> import pandas as pd
>>> odf=orca.DataFrame({'trade_time':pd.date_range('20190101 12:30',periods=5,freq='T'),
                   'sym':['A1','A2','A3','A4','A5'],
                   'qty':[100,200,300,400,500],
                   'price':[100.5,263.1,254.9,215.1,245.6]})
>>> trades.append(odf,inplace=True)
>>> len(trades)
10000005

Orca extends the append function to support the inplace parameter, which allows adding data in place. If inplace is False, the behavior is the same as pandas. The contents of the distributed table will be copied to the memory. At this time, trades corresponds to only a memory table. The contents of the odf are only appended to the memory table, and not really appended to the dfs table.

4 summary

For distributed tables, Orca currently has some functional limitations. For example, the DataFrame corresponding to the partition table does not have the concept of RangeIndex, some functions do not support the use of distributed tables and the restrictions on modifying the data in the table. Please refer to Orca Quick Start Guide for details .

Guess you like

Origin blog.csdn.net/qq_41996852/article/details/111505136