Dry Goods丨Orca Getting Started Guide

This article will introduce Orca's installation method, basic operations, and the difference between Orca and pandas in detail, and the details that users need to pay attention to when using Orca programming, so that users can write efficient Orca code.

1. Installation

Orca supports Linux and Windows systems, and requires Python version 3.6 and above, pandas version 0.25.1 and above. The Orca project has been integrated into the DolphinDB Python API. You can use Orca by installing DolphinDB Python API through the pip tool.

pip install dolphindb

Orca is developed based on the DolphinDB Python API. Therefore, users need to have a DolphinDB server, connect to this server through the connect function, and then run Orca:

>>> import dolphindb.orca as orca
>>> orca.connect(MY_HOST, MY_PORT, MY_USERNAME, MY_PASSWORD)

If you already have a ready-made pandas program, you can replace the pandas import with:

# import pandas as pd
import dolphindb.orca as pd

pd.connect(MY_HOST, MY_PORT, MY_USERNAME, MY_PASSWORD)

2. Quick Start

Create an Orca Series object by passing in a list of values. Orca will automatically add a default index for it:

>>> s = orca.Series([1, 3, 5, np.nan, 6, 8])
>>> s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

Create an Orca DataFrame object by passing in a dictionary. Each element in the dictionary must be transformed into an object similar to Series:

>>> df = orca.DataFrame(
...     {"a": [1, 2, 3, 4, 5, 6],
...      "b": [100, 200, 300, 400, 500, 600],
...      "c": ["one", "two", "three", "four", "five", "six"]},
...      index=[10, 20, 30, 40, 50, 60])
>>> df
    a    b      c
10  1  100    one
20  2  200    two
30  3  300  three
40  4  400   four
50  5  500   five
60  6  600    six

You can also directly pass in a pandas DataFrame to create an Orca DataFrame:

>>> dates = pd.date_range('20130101', periods=6)
>>> pdf = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
>>> df = orca.DataFrame(pdf)
>>> df
                   A         B         C         D
2013-01-01  0.758590 -0.180460 -0.066231  0.259408
2013-01-02  1.165941  0.961164 -0.716258  0.143499
2013-01-03  0.441121 -0.232495 -0.275688  0.516371
2013-01-04  0.281048 -0.782518 -0.683993 -1.474788
2013-01-05 -0.959676  0.860089  0.374714 -0.535574
2013-01-06  1.357800  0.729484  0.142948 -0.603437

Now df is an Orca DataFrame:

>>> type(df)
<class 'orca.core.frame.DataFrame'>

When printing an Orca object directly, the server usually transmits the corresponding entire DolphinDB data locally, which may cause unnecessary network overhead. Users can view the top lines of an Orca object through the head function:

>>> df.head()
                   A         B         C         D
2013-01-01  0.758590 -0.180460 -0.066231  0.259408
2013-01-02  1.165941  0.961164 -0.716258  0.143499
2013-01-03  0.441121 -0.232495 -0.275688  0.516371
2013-01-04  0.281048 -0.782518 -0.683993 -1.474788
2013-01-05 -0.959676  0.860089  0.374714 -0.535574

View the index and column name of the data through index, columns:

>>> df.index
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

>>> df.columns
Index(['A', 'B', 'C', 'D'], dtype='object')

Convert an Orca DataFrame into a pandas DataFrame through to_pandas:

>>> pdf1 = df.to_pandas()
>>> type(pdf1)
<class 'pandas.core.frame.DataFrame'>

Load a CSV file through read_csv. The CSV file is required to be located on the DolphinDB server. The path given is its path on the server side:

>>> df = orca.read_csv("/home/DolphinDB/Orca/databases/USPrices.csv")

3. Orca's architecture

The top layer of Orca is the pandas API, and the bottom layer is the DolphinDB database. Through the DolphinDB Python API, the communication between the Orca client and the DolphinDB server is realized. The basic working principle of Orca is to generate DolphinDB scripts through Python on the client, and send the scripts to the DolphinDB server for analysis and execution through the DolphinDB Python API. Orca's DataFrame only stores the metadata of the corresponding DolphinDB table, and the actual storage and calculation are on the server side.

How Orca stores data

Orca objects are stored as a DolphinDB table in DolphinDB. Whether it is Orca DataFrame or Orca Series, their underlying storage is a DolphinDB table, and the data columns and index columns are stored in the same table. The DolphinDB table represented by an Orca DataFrame contains several data columns and several index columns. The DolphinDB table represented by an Orca Series contains one column of data and several index columns. This makes it easier to implement operations such as index alignment, calculation of each column in the table, grouping and aggregation.

Orca's DataFrame only stores the metadata of the corresponding DolphinDB table, including table names, data column names, and index column names. If you try to access the columns of a DataFrame, a new table will not be created when the Series is returned. The returned Series and the original DataFrame use the same table, but the metadata recorded by the Orca object has changed.

4. Functional limitations of Orca

Due to Orca's architecture, Orca's interface has some limitations:

  • Column data type

Each column of the DolphinDB table must specify a data type. DolphinDB's ANY type cannot be used as a column data type. Therefore, each column of Orca cannot include mixed data types. In addition, the data in the column is also not allowed to be a Python object that DolphinDB does not support, such as Python's built-in list, dict, or datetime in the standard library.

Some functions designed for these types that DolphinDB does not support, such as DataFrame.explode, have no practical meaning in Orca.

  • Listed restrictions

The column names in the DolphinDB table must be valid DolphinDB variable names, that is, only contain letters, numbers or underscores, start with a letter, and are not reserved words of DolphinDB, such as if.

DolphinDB does not allow duplicate column names. Therefore, the column names of Orca cannot be repeated.

Column names beginning with capital letters and underscore ORCA_ are Orca's reserved words for column names. Orca will internally name some special columns (such as index) in this form. Users should avoid using such strings as Orca column names, otherwise unexpected behavior may occur.

  • The partition table has no strict order relationship

If the DolphinDB table corresponding to the DataFrame is a partitioned table, the data storage is not continuous, so there is no concept of RangeIndex. There is no strict order relationship between the partitions of the DolphinDB partition table. Therefore, if a DataFrame represents a DolphinDB partition table, these operations cannot be completed:

(1) Access the corresponding row through iloc on the partition table

(2) Assign a Series or DataFrame of a different partition type to a DataFrame

  • Some functions only do not support distributed calls

Some built-in functions of DolphinDB currently do not support distributed versions, such as median, quantile, mad.

  • Null mechanism is different

DolphinDB's numerical null value is represented by the minimum value of each data type. The pandas null value is represented by the floating-point number nan. Orca's null value mechanism is consistent with DolphinDB. Only when a network transmission (download) occurs, the numeric column of DolphinDB containing null values ​​is converted into a floating-point number type, and the null value is converted into nan.

For string types, pandas's null value is still nan, which causes pandas to actually use a mixed type of string and floating-point numbers when storing strings containing null values. And mixed types of columns are not allowed in DolphinDB. DolphinDB uses an empty string to represent the null value of the string type. If the user wants to upload a string containing null values, they should preprocess the string column and fill in the null values:

df = pd.DataFrame({"str_col": ["hello", "world", np.nan]})
odf = orca.DataFrame(df)    # Error
odf = orca.DataFrame(df.fillna({"str_col": ""}))    # Correct way to upload a string column with NULL values
  • Limits of axis

DolphinDB, as a columnar storage database, supports row-wise operations better than column-wise operations. For many operations, such as aggregation operations such as summation and averaging, the performance of cross-row aggregation (to find the function value of each column) is higher than that of cross-column aggregation (to find the function value of each row), and most functions support cross-row calculation. But only a few functions, such as sum, mean, max, min, var, std, etc., support cross-column calculations. In pandas, specify axis=0 or axis='index' in the parameter of the function to complete the calculation across rows, and specify axis=1 or axis='columns' to complete the calculation across columns. The Orca function often only supports axis=0 or axis='index'.

Orca's DataFrame also does not support transpose (transpose) operation. Because a column in the transposed DataFrame may contain mixed types of data.

  • Does not accept Python callables as parameters

The DolphinDB Python API currently cannot parse Python functions, so functions such as DataFrame.apply, DataFrame.agg, etc. cannot accept a Python callable object as a parameter.

For this restriction, Orca provides an alternative: pass in a DolphinDB string, which can be DolphinDB's built-in functions, custom functions, or conditional expressions. For details, please refer to the section on higher-order functions.

5. Best Practices

  • Reduce to_pandas and from_pandas calls

Orca uses DolphinDB Python API to communicate with the server. The actual data storage, query, and calculation all happen on the server side, and orca is just a client that provides a pandas-like interface. Therefore, the bottleneck of the system is often network communication. When users write high-performance orca programs, they need to pay attention to how to optimize the programs to reduce network traffic.

When the to_pandas function is called to convert the orca object into a pandas object, the server will transmit the entire DolphinDB object to the client. If it is not necessary, such conversions should generally be reduced. In addition, the following operations will implicitly call to_pandas, so you also need to pay attention:

(1) Print an Orca DataFrame or Series representing a non-partitioned table

(2) Call to_numpy or access values

(3) Call Series.unique, orca.qcut and other functions that return numpy.ndarray

(4) Call plot related functions to draw

(5) Export Orca objects to third-party format data

Similarly, from_pandas will upload local pandas objects to the DolphinDB server. When the data parameters of orca.DataFrame and orca.Series are non-Orca objects, a pandas object will be created locally, and then uploaded to the DolphinDB server. When writing Orca code, you should consider reducing back and forth network communication.

  • Orca is not always evaluated immediately

Orca uses a lazy evaluation strategy. Certain operations are not immediately calculated on the server side, but converted into an intermediate expression, and the calculation does not occur until it is really needed. When the calculation needs to be triggered, the user should call the compute function. For example, performing four arithmetic operations on the columns in the same DataFrame will not trigger the calculation immediately:

>>> df = orca.DataFrame({"a": [1, 2, 3], "b": [10, 10, 30]})
>>> c = df["a"] + df["b"]
>>> c    # not calculated yet
<orca.core.operator.ArithExpression object at 0x0000027FA5527B70>

>>> c.compute()    # trigger the calculation
0    11
1    12
2    33
dtype: int64

For another example, the conditional filter query will not trigger the calculation immediately:

>>> d = df[df["a"] > 2]
>>> d
<orca.core.frame.DataFrame object with a WHERE clause>

>>> d.compute()    # trigger the calculation
   a   b
2  3  30

After grouping, using functions such as cumsum to aggregate, or calling transform, will not return the result immediately:

>>> c = df.groupby("b").cumsum()
>>> c
<orca.core.operator.DataFrameContextByExpression object at 0x0000017C010692B0>

>>> c.compute()    # trigger the calculation
   a
0  1
1  3
2  3

>>> c = df.groupby("b").transform("count")
>>> c
<orca.core.operator.DataFrameContextByExpression object at 0x0000012C414FE128>

>>> c.compute()    # trigger the calculation
   a
0  2
1  2
2  1
  • Operate the columns in the same DataFrame to improve performance

If you are operating on columns in the same DataFrame, Orca can optimize these operations into a single DolphinDB SQL expression. Such operations will have higher performance. E.g:

(1) Element-wise calculation: df.x + df.y, df * df, df.x.abs()

(2) Operation of filtering rows: df[df.x> 0]

(3) Isin operation: df[df.x.isin([1, 2, 3])]

(4) Time type/string accessor: df.date.dt.month

(5) Assign value with the calculation result of the same length: df["ret"] = df["ret"].abs()

When the DataFrame is the filtered result, if the filtering conditions are exactly the same (the same object in Python, that is, the value obtained by calling the id function is the same), this optimization can also be achieved.

The following scripts can be optimized:

df[df.x > 0] = df[df.x > 0] + 1

In the above script, although the filter conditions on both sides of the equal sign seem to be the same, two different objects are actually generated in Python. In the DolphinDB engine, a select statement is executed first, and then an update statement is executed. If you assign this filter condition to an intermediate variable, Orca can optimize the above code into a single DolphinDB update statement:

df_x_gt_0 = df.x > 0
df[df_x_gt_0] = df[df_x_gt_0] + 1
  • Restrictions on modifying table data

In DolphinDB, the data type of a table column cannot be modified.

In addition, a non-memory table (such as a DFS table) has these limitations:

(1) Cannot add new column

(2) The data cannot be modified through the update statement

And a partition table has these limitations:

(1) There is no strict order relationship between data in different partitions

(2) Cannot assign a vector to a column through the update statement

Therefore, when a user tries to modify an Orca object, the operation may fail. The modification of Orca objects has the following rules:

(1) The updated data types are not compatible, for example, when assigning a string to an integer column, an exception will be thrown

(2) When adding a column to an orca object representing a non-memory table, or modifying the data in it, the table will be copied to the memory table and a warning will be given

(3) When automatically adding a default index to an orca object representing a partition table, a column will not be added, and a warning will be given at this time

(4) When setting or adding a column to an orca object representing a partition table, if the column is a Python or numpy array, or an orca Series representing a memory table, an exception will be thrown

When trying to add a column to or modify the data in an orca object that represents a non-memory table, the data will be copied as a memory table and then modified. When processing massive amounts of data, it may cause insufficient memory. Therefore, modify operations on such orca objects should be avoided as much as possible.

Some Orca functions do not support inplace parameters. Because inplace involves modifying the data itself.

For example, the following orca script attempts to add a column to df, which will copy the DFS table as a memory table, which may cause performance problems when the amount of data is large:

df = orca.load_table("dfs://orca", "tb")
df["total"] = df["price"] * df["amount"]     # Will copy the DFS table as an in-memory segmented table!
total_group_by_symbol = df.groupby(["date", "symbol"])["total"].sum()

The above script can be optimized without setting new columns to avoid copying a large amount of data. The optimization method adopted in this example is to set the grouping fields date and symbol as indexes through set_index, and by specifying the level parameter of groupby to group and aggregate by index field, specify the lazy parameter of groupby as True, and do not calculate the total immediately. By doing this, you can avoid adding a new column:

df = orca.load_table("dfs://orca", "tb")
df.set_index(["date", "symbol"], inplace=True)
total = df["price"] * df["amount"]     # The DFS table is not copied
total_group_by_symbol = total.groupby(level=[0,1], lazy=True).sum()
  • Higher order function

Many interfaces of pandas, such as DataFrame.apply, GroupBy.filter, etc., allow to accept a Python callable object as a parameter. Orca essentially parses the user's program into DolphinDB scripts for calling through the Python API. Therefore, Orca currently does not support parsing Python callable objects. If the user passes in one or more callable objects, these functions will try to convert the Orca object into a pandas object, call the corresponding interface of pandas, and then convert the result back to an Orca object. This not only brings additional network communication, but also returns a new DataFrame, making some calculations unable to achieve the same high performance as when operating on the same DataFrame.

As an alternative, Orca can accept a string for these interfaces and pass this string into DolphinDB for calculation. This string can be a DolphinDB built-in function (or a partial application of the built-in function), a DolphinDB custom function, or a DolphinDB conditional expression, etc. This alternative brings flexibility to Orca. Users can write a fragment of DolphinDB scripts according to their needs, and then use the DolphinDB computing engine to execute these scripts just like pandas calls user-defined functions.

The following is an example of rewriting the code that pandas accepts callable objects as parameters into Orca code:

(1) Find group weighted average

pandas:

wavg = lambda df: (df["prc"] * df["vol"]).sum() / df["vol"].sum()
df.groupby("symbol").apply(wavg)

Orca:

df.groupby("symbol")["prc"].apply("wavg{,vol}")

The Orca script uses the apply function to call a DolphinDB partial application wavg{,vol} to the prc column after group by, which is converted into a DolphinDB script, which is equivalent to:

select wavg{,vol}(prc) from df group by symbol

Expanding this part of the application is equivalent to:

select wavg(prc,vol) from df group by symbol

(2) Filter by conditions after grouping

pandas:

df.groupby("symbol").filter(lambda x: len(x) > 1000)

Orca:

df.groupby("symbol").filter("size(*) > 1000")

In the Orca script of the above example, the string accepted by the filter function is a filter condition expression, which is converted into a DolphinDB script, which is equivalent to:

select * from df context by symbol having size(*) > 10000

That is, the filter string appears in the SQL having statement.

(3) Apply an arithmetic function to the entire Series

pandas:

s.apply(lambda x: x + 1)

Orca:

s.apply("(x->x+1)")

pandas:

s.apply(np.log)

Orca:

s.apply("log")

Common calculation functions, such as log, exp, floor, ceil, trigonometric functions, inverse trigonometric functions, etc., have been integrated by Orca. For example, to find the logarithm, it can be realized by s.log().

(4) Use comma (,) instead of ampersand when filtering

In DolphinDB's where expression, the comma indicates the order of execution and is more efficient. Only after the previous condition is passed will it continue to verify the next condition. Orca expands pandas' conditional filtering to support the use of commas in filtering statements:

pandas:

df[(df.x > 0) & (df.y < 0)]

Orca:

df[(df.x > 0), (df.y < 0)]

Using the traditional ampersand, will convert the ampersand in the where expression into DolphinDB's and function when the DolphinDB script is generated at the end. When using a comma, a comma will be used in the corresponding position in the where expression to achieve higher efficiency.

(5) How to implement the context by statement of DolphinDB

DolphinDB supports context by statements and supports processing data in groups. In Orca, this function can be achieved by calling transform after groupby. The transform usually requires the user to provide a DolphinDB custom function string. Orca has extended transform. Call groupby on an intermediate expression and specify the extended parameter lazy=True, and then call transform without a given parameter, then Orca will calculate the context by of the expression that calls groupby. E.g:

pandas:

df.groupby("date")["prc"].transform(lambda x: x.shift(5))

Orca's rewrite:

df.groupby("date")["id"].transform("shift{,5}")

Orca's extended usage:

df.shift(5).groupby("date", lazy=True)["id"].transform()

This is a special usage of Orca, which takes full advantage of lazy evaluation. In the above code, df.shift(5) does not actually calculate, but only generates an intermediate expression (through type(df.shift(5)) you will find that it is an ArithExpression, not a DataFrame). If the extended parameter lazy=True of groupyby is specified, the groupby function will not group the results of the expression calculation.

In the momentum trading strategy tutorial, we make full use of this extended function to implement DolphinDB's context by.

6. What should I do if Orca cannot solve my problem at this time?

This article explains many of the differences between Orca and pandas, and some limitations of Orca. If you cannot avoid these restrictions (for example, Orca's function does not support a certain parameter, or apply a complex custom function, which includes third-party library function calls, DolphinDB does not have these functions), then you can add Orca The DataFrame/Series is converted to pandas DataFrame/Series through the to_pandas function. After the calculation is performed by pandas, the calculation result is converted back to an Orca object.

For example, Orca currently does not support the method="average" and na_option="keep" parameters of the rank function. If you must use these parameters, you can do this:

>>> df.rank(method='average', na_option='keep')
ValueError: method must be 'min'

>>> pdf = df.to_pandas()
>>> rank = pdf.rank(method='average', na_option='keep')
>>> rank = orca.DataFrame(rank)

This can solve your problem, but it brings additional network communication. At the same time, the underlying storage table of the new DataFrame is no longer the table represented by the original DataFrame, so some optimizations for the same DataFrame operation cannot be performed. .

Guess you like

Origin blog.csdn.net/qq_41996852/article/details/111572703