Dry Goods丨Orca Detailed Getting Started Guide

This article will introduce Orca's installation method, basic operations, and the difference between Orca and pandas, and the details that users need to pay attention to when using Orca programming, so that users can make full use of the advantages of DolphinDB and write efficient Orca code.

1. Installation

Orca supports Linux and Windows systems, and requires Python version 3.6 and above, pandas version 0.25.1 and above. The Orca project has been integrated into the DolphinDB Python API. You can use Orca by installing DolphinDB Python API through the pip tool.

pip install dolphindb

Orca is developed based on the DolphinDB Python API. Therefore, users need to have a DolphinDB server, connect to this server through the connect function, and then run Orca:

>>> import dolphindb.orca as orca
>>> orca.connect(MY_HOST, MY_PORT, MY_USERNAME, MY_PASSWORD)

If you already have a ready-made pandas program, you can replace the pandas import with:

# import pandas as pd
import dolphindb.orca as pd

pd.connect(MY_HOST, MY_PORT, MY_USERNAME, MY_PASSWORD)

2. Quick Start

Create an Orca Series object by passing in a list of values. Orca will automatically add a default index for it:

>>> s = orca.Series([1, 3, 5, np.nan, 6, 8])
>>> s

0    1.0
1    3.0
2    5.0
3 NaN
4    6.0
5    8.0
dtype: float64

Create an Orca DataFrame object by passing in a dictionary. Each element in the dictionary must be transformed into an object similar to Series:

>>> df = orca.DataFrame(
...     {"a": [1, 2, 3, 4, 5, 6],
...      "b": [100, 200, 300, 400, 500, 600],
...      "c": ["one", "two", "three", "four", "five", "six"]},
...      index=[10, 20, 30, 40, 50, 60])
>>> df
    a    b      c
10  1  100    one
20  2  200    two
30  3  300  three
40  4  400   four
50  5  500   five
60  6  600    six

You can also directly pass in a pandas DataFrame to create an Orca DataFrame:

>>> dates = pd.date_range('20130101', periods=6)
>>> pdf = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
>>> df = orca.DataFrame(pdf)
>>> df
                   A         B         C         D
2013-01-01  0.758590 -0.180460 -0.066231  0.259408
2013-01-02  1.165941  0.961164 -0.716258  0.143499
2013-01-03  0.441121 -0.232495 -0.275688  0.516371
2013-01-04  0.281048 -0.782518 -0.683993 -1.474788
2013-01-05 -0.959676  0.860089  0.374714 -0.535574
2013-01-06  1.357800  0.729484  0.142948 -0.603437

Now df is an Orca DataFrame:

>>> type(df)
<class 'orca.core.frame.DataFrame'>

When printing an Orca object directly, the server usually transmits the corresponding entire DolphinDB data locally, which may cause unnecessary network overhead. Users can view the top lines of an Orca object through the head function:

>>> df.head()
                   A         B         C         D
2013-01-01  0.758590 -0.180460 -0.066231  0.259408
2013-01-02  1.165941  0.961164 -0.716258  0.143499
2013-01-03  0.441121 -0.232495 -0.275688  0.516371
2013-01-04  0.281048 -0.782518 -0.683993 -1.474788
2013-01-05 -0.959676  0.860089  0.374714 -0.535574

View the index and column name of the data through index, columns:

>>> df.index
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

>>> df.columns
Index(['A', 'B', 'C', 'D'], dtype='object')

Convert an Orca DataFrame into a pandas DataFrame through to_pandas:

>>> pdf1 = df.to_pandas()
>>> type(pdf1)
<class 'pandas.core.frame.DataFrame'>

Load a CSV file through read_csv. The CSV file is required to be located on the DolphinDB server. The path given is its path on the server side:

>>> df = orca.read_csv("/home/DolphinDB/Orca/databases/USPrices.csv")

3. Orca's architecture

The top layer of Orca is the pandas API, and the bottom layer is the DolphinDB database. Through the DolphinDB Python API, the communication between the Orca client and the DolphinDB server is realized. The basic working principle of Orca is to generate DolphinDB scripts through Python on the client, and send the scripts to the DolphinDB server for analysis and execution through the DolphinDB Python API. Orca's DataFrame only stores the metadata of the corresponding DolphinDB table, and the actual storage and calculation are on the server side.

How Orca stores data

Orca objects are stored as a DolphinDB table in DolphinDB. Whether it is Orca DataFrame or Orca Series, their underlying storage is a DolphinDB table, and the data columns and index columns are stored in the same table. The DolphinDB table represented by an Orca DataFrame contains several data columns and several index columns. The DolphinDB table represented by an Orca Series contains one column of data and several index columns. This makes it easier to implement operations such as index alignment, calculation of each column in the table, grouping and aggregation.

Orca's DataFrame only stores the metadata of the corresponding DolphinDB table, including table names, data column names, and index column names. If you try to access the columns of a DataFrame, a new table will not be created when the Series is returned. The returned Series and the original DataFrame use the same table, but the metadata recorded by the Orca object has changed.

4. Functional limitations of Orca

Due to Orca's architecture, Orca's interface has some limitations:

  • Column data type

Each column of the DolphinDB table must specify a data type. DolphinDB's ANY type cannot be used as a column data type. Therefore, each column of Orca cannot include mixed data types. In addition, the data in the column is also not allowed to be a Python object that DolphinDB does not support, such as Python's built-in list, dict, or datetime in the standard library.

Some functions designed for these types that DolphinDB does not support, such as DataFrame.explode, have no practical meaning in Orca.

  • Listed restrictions

The column names in the DolphinDB database table must be valid DolphinDB variable names, that is, they only contain letters, numbers or underscores, start with a letter, and are not reserved words of DolphinDB, such as if.

DolphinDB does not allow duplicate column names. Therefore, the column names of Orca cannot be repeated.

Column names beginning with capital letters and underscore ORCA_ are Orca's reserved words for column names. Orca will internally name some special columns (such as index) in this form. Users should avoid using such strings as Orca column names, otherwise unexpected behavior may occur.

  • The partition table has no strict order relationship

If the DolphinDB table corresponding to the DataFrame is a partitioned table, the data storage is not continuous, so there is no concept of RangeIndex. There is no strict order relationship between the partitions of the DolphinDB partition table. Therefore, if a DataFrame represents a DolphinDB partition table, these operations cannot be completed:

(1) Access the corresponding row through iloc on the partition table

(2) Assign a Series or DataFrame of a different partition type to a DataFrame

  • Some functions only do not support distributed calls

Some built-in functions of DolphinDB currently do not support distributed versions, such as median, quantile, mad.

  • Null mechanism is different

DolphinDB's numerical null value is represented by the minimum value of each data type. The pandas null value is represented by the floating-point number nan. Orca's null value mechanism is consistent with DolphinDB. Only when a network transmission (download) occurs, the numeric column of DolphinDB containing null values ​​is converted into a floating-point number type, and the null value is converted into nan.

For string types, pandas's null value is still nan, which causes pandas to actually use a mixed type of string and floating-point numbers when storing strings containing null values. And mixed types of columns are not allowed in DolphinDB. DolphinDB uses an empty string to represent the null value of the string type. If the user wants to upload a string containing null values, they should preprocess the string column and fill in the null values:

df = pd.DataFrame({"str_col": ["hello", "world", np.nan]})
odf = orca.DataFrame(df)    # Error
odf = orca.DataFrame(df.fillna({"str_col": ""}))    # Correct way to upload a string column with NULL values
  • Limits of axis

DolphinDB, as a columnar storage database, supports row-wise operations better than column-wise operations. For many operations, such as aggregation operations such as summation and averaging, the performance of cross-row aggregation (to find the function value of each column) is higher than that of cross-column aggregation (to find the function value of each row), and most functions support cross-row calculation. But only a few functions, such as sum, mean, max, min, var, std, etc., support cross-column calculations. In pandas, specify axis=0 or axis='index' in the parameter of the function to complete the calculation across rows, and specify axis=1 or axis='columns' to complete the calculation across columns. The Orca function often only supports axis=0 or axis='index'.

Orca's DataFrame also does not support transpose (transpose) operation. Because a column in the transposed DataFrame may contain mixed types of data.

  • Does not accept Python callables as parameters

The DolphinDB Python API currently cannot parse Python functions, so functions such as DataFrame.apply, DataFrame.agg, etc. cannot accept a Python callable object as a parameter.

For this restriction, Orca provides an alternative: pass in a DolphinDB string, which can be DolphinDB's built-in functions, custom functions, or conditional expressions. For details, please refer to the section on higher-order functions.

5. Best Practices

  • Reduce to_pandas and from_pandas calls

Orca uses DolphinDB Python API to communicate with the server. The actual data storage, query, and calculation all happen on the server side, and orca is just a client that provides a pandas-like interface. Therefore, the bottleneck of the system is often network communication. When users write high-performance orca programs, they need to pay attention to how to optimize the programs to reduce network traffic.

When the to_pandas function is called to convert the orca object into a pandas object, the server will transmit the entire DolphinDB object to the client. If it is not necessary, such conversions should generally be reduced. In addition, the following operations will implicitly call to_pandas, so you also need to pay attention:

(1) Print an Orca DataFrame or Series representing a non-partitioned table

(2) Call to_numpy or access values

(3) Call Series.unique, orca.qcut and other functions that return numpy.ndarray

(4) Call plot related functions to draw

(5) Export Orca objects to third-party format data

Similarly, from_pandas will upload local pandas objects to the DolphinDB server. When the data parameters of orca.DataFrame and orca.Series are non-Orca objects, a pandas object will be created locally, and then uploaded to the DolphinDB server. When writing Orca code, you should consider reducing back and forth network communication.

  • Orca is not always evaluated immediately

Orca uses a lazy evaluation strategy. Certain operations are not immediately calculated on the server side, but converted into an intermediate expression, and the calculation does not occur until it is really needed. When the calculation needs to be triggered, the user should call the compute function. For example, performing four arithmetic operations on the columns in the same DataFrame will not trigger the calculation immediately:

>>> df = orca.DataFrame({"a": [1, 2, 3], "b": [10, 10, 30]})
>>> c = df["a"] + df["b"]
>>> c    # not calculated yet
<orca.core.operator.ArithExpression object at 0x0000027FA5527B70>

>>> c.compute()    # trigger the calculation
0    11
1    12
2    33
dtype: int64

For another example, the conditional filter query will not trigger the calculation immediately:

>>> d = df[df["a"] > 2]
>>> d
<orca.core.frame.DataFrame object with a WHERE clause>

>>> d.compute()    # trigger the calculation
   a   b
2  3  30

After grouping, using functions such as cumsum to aggregate, or calling transform, will not return the result immediately:

>>> c = df.groupby("b").cumsum()
>>> c
<orca.core.operator.DataFrameContextByExpression object at 0x0000017C010692B0>

>>> c.compute()    # trigger the calculation
   a
0  1
1  3
2  3

>>> c = df.groupby("b").transform("count")
>>> c
<orca.core.operator.DataFrameContextByExpression object at 0x0000012C414FE128>

>>> c.compute()    # trigger the calculation
   a
0  2
1  2
2  1
  • Operate the columns in the same DataFrame to improve performance

If you are operating on columns in the same DataFrame, Orca can optimize these operations into a single DolphinDB SQL expression. Such operations will have higher performance. E.g:

(1) Element-wise calculation: df.x + df.y, df * df, df.x.abs()

(2) Operation of filtering rows: df[df.x> 0]

(3) Isin operation: df[df.x.isin([1, 2, 3])]

(4) Time type/string accessor: df.date.dt.month

(5) Assign value with the calculation result of the same length: df["ret"] = df["ret"].abs()

When the DataFrame is the filtered result, if the filtering conditions are exactly the same (the same object in Python, that is, the value obtained by calling the id function is the same), this optimization can also be achieved.

The following scripts can be optimized:

df[df.x > 0] = df[df.x > 0] + 1

In the above script, although the filter conditions on both sides of the equal sign seem to be the same, two different objects are actually generated in Python. In the DolphinDB engine, a select statement is executed first, and then an update statement is executed. If you assign this filter condition to an intermediate variable, Orca can optimize the above code into a single DolphinDB update statement:

df_x_gt_0 = df.x > 0
df[df_x_gt_0] = df[df_x_gt_0] + 1
  • Restrictions on modifying table data

In DolphinDB, the data type of a table column cannot be modified.

In addition, a non-memory table (such as a DFS table) has these limitations:

(1) Cannot add new column

(2) The data cannot be modified through the update statement

And a partition table has these limitations:

(1) There is no strict order relationship between data in different partitions

(2) Cannot assign a vector to a column through the update statement

Therefore, when a user tries to modify an Orca object, the operation may fail. The modification of Orca objects has the following rules:

(1) The updated data types are not compatible, for example, when assigning a string to an integer column, an exception will be thrown

(2) When adding a column to an orca object representing a non-memory table, or modifying the data in it, the table will be copied to the memory table and a warning will be given

(3) When automatically adding a default index to an orca object representing a partition table, a column will not be added, and a warning will be given at this time

(4) When setting or adding a column to an orca object representing a partition table, if the column is a Python or numpy array, or an orca Series representing a memory table, an exception will be thrown

When trying to add a column to or modify the data in an orca object that represents a non-memory table, the data will be copied as a memory table and then modified. When processing massive amounts of data, it may cause insufficient memory. Therefore, modify operations on such orca objects should be avoided as much as possible.

Some Orca functions do not support inplace parameters. Because inplace involves modifying the data itself.

For example, the following orca script attempts to add a column to df, which will copy the DFS table as a memory table, which may cause performance problems when the amount of data is large:

df = orca.load_table("dfs://orca", "tb")
df["total"] = df["price"] * df["amount"]     # Will copy the DFS table as an in-memory segmented table!
total_group_by_symbol = df.groupby(["date", "symbol"])["total"].sum()

The above script can be optimized without setting new columns to avoid copying a large amount of data. The optimization method adopted in this example is to set the grouping fields date and symbol as indexes through set_index, and by specifying the level parameter of groupby to group and aggregate by index field, specify the lazy parameter of groupby as True, and do not calculate the total immediately. By doing this, you can avoid adding a new column:

df = orca.load_table("dfs://orca", "tb")
df.set_index(["date", "symbol"], inplace=True)
total = df["price"] * df["amount"]     # The DFS table is not copied
total_group_by_symbol = total.groupby(level=[0,1], lazy=True).sum()
  • Higher order function

pandas的许多接口,例如DataFrame.apply, GroupBy.filter等,都允许接受一个Python的可调用对象作为参数。Orca本质上是通过Python API,将用户的程序解析为DolphinDB的脚本进行调用。因此,Orca目前不支持解析Python的可调用对象。如果用户传入一个或多个可调用对象,这些函数会尝试将Orca对象转换为pandas对象,调用pandas的对应接口,然后将结果转换回Orca对象。这样做不仅带来额外的网络通信,也会返回一个新的DataFrame,使得部分计算无法达到在同一个DataFrame上操作时那样的高性能。

作为替代方案,对于这些接口,Orca可以接受一个字符串,将这个字符串传入DolphinDB进行计算。这个字符串可以是一个DolphinDB的内置函数(或内置函数的部分应用),一个DolphinDB的自定义函数,或者一个DolphinDB条件表达式,等等。这个替代方案为Orca带来了灵活性,用户可以按自己的需要,编写一段DolphinDB的脚本片段,然后,像pandas调用用户自定义函数一样,利用DolphinDB计算引擎执行这些脚本。

以下是将pandas接受可调用对象作为参数的代码改写为Orca代码的例子:

(1)求分组加权平均数

pandas:

wavg = lambda df: (df["prc"] * df["vol"]).sum() / df["vol"].sum()
df.groupby("symbol").apply(wavg)

Orca:

df.groupby("symbol")["prc"].apply("wavg{,vol}")

Orca脚本通过apply函数,对group by之后的prc列调用了一个DolphinDB的部分应用wavg{,vol},转化为DolphinDB的脚本,等价于:

select wavg{,vol}(prc) from df group by symbol

将这个部分应用展开,等价于:

select wavg(prc,vol) from df group by symbol

(2)分组后按条件过滤

pandas:

df.groupby("symbol").filter(lambda x: len(x) > 1000)

Orca:

df.groupby("symbol").filter("size(*) > 1000")

上述例子的Orca脚本中,filter函数接受的字符串是一个过滤的条件表达式,转化为DolphinDB的脚本,等价于:

select * from df context by symbol having size(*) > 10000

即,filter的字符串出现在了SQL的having语句中。

(3)对整个Series应用一个运算函数

pandas:

s.apply(lambda x: x + 1)

Orca:

s.apply("(x->x+1)")

pandas:

s.apply(np.log)

Orca:

s.apply("log")

常用的计算函数,比如log, exp, floor, ceil, 三角函数,反三角函数等,Orca已经集成。例如,求对数,通过s.log()即可实现。

(4)过滤时用逗号(,)代替&符号

DolphinDB的where表达式中,逗号表示执行顺序,并且效率更高,只有在前一个条件通过后才会继续验证下一个条件。Orca对pandas的条件过滤进行了扩展,支持在过滤语句中用逗号:

pandas:

df[(df.x > 0) & (df.y < 0)]

Orca:

df[(df.x > 0), (df.y < 0)]

使用传统的&符号,会在最后生成DolphinDB脚本时将where表达式中的&符号转换为DolphinDB的and函数。而使用逗号,会在where表达式中的对应位置使用逗号,以达到更高的效率。

(5)如何实现DolphinDB的context by语句

DolphinDB支持context by语句,支持在分组内处理数据。在Orca中,这个功能可以通过groupby后调用transform实现。而transform通常需要用户提供一个DolphinDB自定义函数字符串。Orca对transform进行了扩展。对一个中间表达式调用groupby,并指定扩展参数lazy=True,然后不给定参数调用transform,则Orca会对调用groupby的表达式进行context by的计算。例如:

pandas:

df.groupby("date")["prc"].transform(lambda x: x.shift(5))

Orca的改写:

df.groupby("date")["id"].transform("shift{,5}")

Orca的扩展用法:

df.shift(5).groupby("date", lazy=True)["id"].transform()

这是Orca的一个特别的用法,它充分利用了惰性求值的优势。在上述代码中,df.shift(5)并没有发生真正的计算,而只是生成了一个中间表达式(通过type(df.shift(5))会发现它是一个ArithExpression,而不是DataFrame)。如果指定了groupyby的扩展参数lazy=True,groupby函数就不会对表达式计算后的结果进行分组。

动量交易策略教程中,我们就充分利用了这个扩展功能,来实现DolphinDB的context by。

6. 如果Orca目前无法解决我的问题,我该怎么做?

本文解释了诸多Orca与pandas的差异,以及Orca的一些限制。如果你无法规避这些限制(比如,Orca的函数不支持某个参数,或者,apply一个复杂的自定义函数,其中包括了第三方库函数调用,DolphinDB中没有这些功能),那么,你可以将Orca的DataFrame/Series通过to_pandas函数转化为pandas的DataFrame/Series,通过pandas执行计算后,将计算结果转换回Orca对象。

比如,Orca目前不支持rank函数的method="average"和na_option="keep"参数,如果你必须使用这些参数,你可以这么做:

>>> df.rank(method='average', na_option='keep')
ValueError: method must be 'min'

>>> pdf = df.to_pandas()
>>> rank = pdf.rank(method='average', na_option='keep')
>>> rank = orca.DataFrame(rank)

这样做可以解决你的问题,但它带来了额外的网络通信,同时,新的DataFrame的底层存储的表不再是原先的DataFrame所表示的表,因此无法执行针对同一个DataFrame操作的一些优化。

Orca目前还处于开发阶段,我们今后会为DolphinDB添加更丰富的功能。届时,Orca的接口、支持的参数也会更完善。


Guess you like

Origin blog.51cto.com/15022783/2633843