Dry goods丨Orca: a series of tutorials on distributed pandas interface based on DolphinDB

The Orca project implements the pandas API on top of DolphinDB, enabling users to analyze and process massive amounts of data more efficiently.

If you are already familiar with pandas, you can use the Orca package to take full advantage of the high performance and concurrency of DolphinDB, and process massive amounts of data without additional learning curve. If you already have existing pandas code, you can migrate to Orca without major modifications to the existing pandas code.

At present, the Orca project is still in the development stage and iterating rapidly. We welcome you to give us feedback through GitHub issues while using Orca .

Orca's design philosophy

Python's third-party library pandas is a powerful tool for analyzing structured data. It has the characteristics of high performance, easy-to-use interface, and easy-to-learn. It is popular in the fields of data science and quantitative finance. However, when we started to deal with terabytes of massive data, pandas running on a single core seemed to be inadequate; the high memory usage of pandas was also one of the limitations that affected its performance. When we have more processor cores and multiple physical machines, we hope to take full advantage of concurrency and improve the efficiency of data processing.

DolphinDB is a distributed data analysis engine that can store terabytes of massive data on multiple physical machines, and can make full use of the CPU to perform high-performance analysis and calculations on massive data. When performing calculations with the same function, DolphinDB is one to two orders of magnitude faster than pandas in performance, and its memory footprint is usually less than 1/2 of pandas . However, the deployment and development methods of DolphinDB are significantly different from those of pandas. If users want to migrate from pandas to DolphinDB, they need to make a lot of changes to the existing code. Fortunately, DolphinDB has started to develop the Orca project, an implementation of the pandas DataFrame API based on the DolphinDB engine. It allows users to use the pandas programming style while taking advantage of the performance advantages of DolphinDB to efficiently analyze massive amounts of data. Compared with panddas' full-memory computing, Orca supports distributed storage and computing. For the same amount of data, the memory usage is generally less than 1/2 of pandas.

Orca's architecture

The top layer of Orca is the pandas API, and the bottom layer is the DolphinDB database. Through the DolphinDB Python API, the communication between the Orca client and the DolphinDB server is realized. The basic working principle of Orca is to generate DolphinDB scripts through Python on the client, and send the scripts to the DolphinDB server for analysis and execution through the DolphinDB Python API. Orca's DataFrame only stores the metadata of the corresponding DolphinDB table, and the actual storage and calculation are on the server side.

Therefore, Orca's interface has some limitations:

  • Each column in Orca's DataFrame cannot be of mixed type, and the column name must also be a valid DolphinDB variable name.
  • If the DolphinDB table corresponding to the DataFrame is a partitioned table, the data storage is not continuous, so there is no concept of RangeIndex, and it is impossible to assign an entire Series to a DataFrame column.
  • For the DolphinDB partition table, some functions that are not implemented by the distributed version, such as median, are not supported by Orca temporarily.
  • DolphinDB's null value mechanism is different from pandas. Pandas uses float type nan as the null value, while DolphinDB's null value is the minimum value of each type.
  • DolphinDB is a columnar storage database. For the pandas interface, some axis=columns parameters are not yet supported.
  • It is currently not possible to parse Python functions, so, for example DataFrame.apply,  DataFrame.aggsuch functions cannot accept a Python function as a parameter.

For the detailed differences between Orca and pandas, as well as the precautions for Orca programming, please refer to the Orca tutorial .

installation

Orca supports Linux and Windows systems, and requires Python version 3.6 and above, pandas version 0.25.1 and above.

The Orca project has been integrated into the DolphinDB Python API . You can use Orca by installing DolphinDB Python API through the pip tool.

pip install dolphindb

Orca is developed based on the DolphinDB Python API. Therefore, you need to have a DolphinDB server, connectconnect to this server through functions, and then run Orca:

>>> import dolphindb.orca as orca
>>> orca.connect(MY_HOST, MY_PORT, MY_USERNAME, MY_PASSWORD)

If you already have a ready-made pandas program, you can replace pandas import with:

# import pandas as pd
import dolphindb.orca as pd

pd.connect(MY_HOST, MY_PORT, MY_USERNAME, MY_PASSWORD)

More information

Use tutorials and precautions

Detailed differences between Orca and pandas API

Orca access to DolphinDB distributed database tutorial

Orca save data tutorial

Develop quantitative strategies with Orca

DolphinDB Python API

Guess you like

Origin blog.csdn.net/qq_41996852/article/details/111354519