The Orca project implements the pandas API on top of DolphinDB, enabling users to analyze and process massive amounts of data more efficiently. Compared with pandas, Orca has the following significant advantages in data storage:
- More flexible options
Orca can not only perform calculations in memory like pandas, export data in DatFrame to disk, but also append DataFrame data and calculation results to DolphinDB data tables at any time to provide reference for subsequent data query and analysis.
- Better performance
When the amount of data is very large and the data needs to be saved, the data of the entire DataFrame can be saved to the disk in pandas. The next time the Python program is run, the user will reload the data on the disk into the memory. This method undoubtedly requires It takes a lot of time to import and export operations. Orca has optimized the data storage and calculation process. The user only needs to write the data into the DolphinDB data table before the end of the program. When the Python program is run next time, the user does not need to reload the entire table data into the memory. Analysis and calculation operations can be performed immediately.
This article will introduce how to save data through Orca.
1 Export data to disk
Orca's Series and DataFrame both support to_csv
, to_excel
and other methods to export data as a fixed format file and save it to a specified path. to_csv
Special instructions are given below .
to_csv
function
The to_csv
value of the engine parameter of the pandas function can be'c ' or'python', indicating which engine is used for import.
The to_csv
value of the engine parameter of the Orca function can be {'c','python','dolphindb'}, and the default value of this parameter is'dolphindb'. When the value is'dolphindb', the to_csv
function will export the data to the DolphinDB server directory, and only supports sep and append two parameters; when the value is'python' or'c', the function will export the to_csv
data to python Under the client's directory, and support all the parameters supported by pandas.
Example
Call the to_csv
function to export the data, and read_csv
then import the data through the function. In the following script,'YOUR_DIR' represents the path where the user saves the csv file. Since the data is randomly generated, the table data generated by each execution is different, the following output results are for reference only.
>>> YOUR_DIR = "/dolphindb/database" # e.g. data_dir
>>> odf = orca.DataFrame({"type": np.random.choice(list("abcde"),10), "value": np.random.sample(10)*100})
>>> odf.to_csv(path_or_buf=YOUR_DIR + "/demo.csv")
>>> df1 = orca.read_csv(path=YOUR_DIR + "/demo.csv")
>>> df1
# output
type value
0 c 93.697510
1 c 64.533273
2 e 11.699053
3 c 46.758312
4 d 0.262836
5 e 30.315109
6 a 72.641846
7 e 60.980473
8 c 89.597063
9 d 25.223624
2 Save the data to the DolphinDB data table
An important scenario for using Orca is that users obtain data from other database systems or third-party Web APIs and store them in the DolphinDB data table. This section will introduce how to upload and save the obtained data to the DolphinDB data table through Orca.
Orca data tables are divided into three types according to storage methods:
- Memory table : The data is only stored in the memory, and the access speed is the fastest, but the data does not exist after the node is closed.
- Local disk table : The data is saved on the local disk. Can be loaded from disk to memory.
- Distributed table : The data is stored on the DolphinDB server, not loaded into the memory. The client only obtains the database and table information. Through the distributed computing engine of DolphinDB, it can still perform unified queries like local tables.
The following explains the difference between these three tables in the form of examples.
- Memory table
It can be read_csv
imported by function or DataFrame
created by function.
read_csv
Function import
Take the csv file in the example in Section 1 as an example. After importing like this, the table that can directly access the data in the table is called the Orca memory table.
>>> df1 = orca.read_csv(path=YOUR_DIR + "/demo.csv")
>>> df1
# output
type value
0 c 93.697510
1 c 64.533273
2 e 11.699053
3 c 46.758312
4 d 0.262836
5 e 30.315109
6 a 72.641846
7 e 60.980473
8 c 89.597063
9 d 25.223624
DataFrame
Function creation
The memory table created by the orca.DataFrame function can also directly access the data in the table:
>>> df = orca.DataFrame({"date":orca.date_range("20190101", periods=10),"price":np.random.sample(10)*100})
>>> df
# output
date price
0 2019-01-01 35.218404
1 2019-01-02 24.066378
2 2019-01-03 6.336181
3 2019-01-04 24.786319
4 2019-01-05 35.021376
5 2019-01-06 14.014935
6 2019-01-07 7.454209
7 2019-01-08 86.430214
8 2019-01-09 80.033767
9 2019-01-10 45.410883
- Disk table
The disk table is divided into a local disk table and a disk partition table. The difference between a local disk table and a memory table is that a local disk table is a memory table stored on a disk and does not need to be partitioned. The disk partition table is the partition table stored on the disk. The local disk table is explained in detail below.
The read_table
local disk table can be loaded in Orca through functions.
Orca provides a read_table
function to load the data of the DolphinDB data table by specifying the DolphinDB database and table name. The parameters supported by the function are as follows:
- database: database name
- table: table name
- partition: the partition to be imported, optional parameters
Please note: The
read_table
function requires that the database and table to be imported already exist on the DolphinDB server. If there is only a database and no table is created, the data cannot be successfully imported into Python.
It can be seen from the function definition that the read_table function can be used to import Orca's partition table, but when the imported table is a DolphinDB disk table, Orca will load all the table data into memory as an Orca memory table for access.
Example
Assuming that there are databases and tables on DolphinDB Server as follows, in the following script,'YOUR_DIR' represents the path where users save disk tables.
rows=10
tdata=table(rand(`a`b`c`d`e, rows) as type, rand(100.0, rows) as value)
saveTable(YOUR_DIR + "/testOnDiskDB", tdata, `tb1)
The path of the database created in the script is YOUR_DIR + "/testOnDiskDB", and the stored table name is "tb1". In the Python client, we can read_table
load this disk table into memory through a function and store it in an Orca DataFrame object.
>>> df = orca.read_table(YOUR_DIR + "/testOnDiskDB", "tb1")
The executable script in Python that integrates the above process is as follows:
>>> s = orca.default_session()
>>> data_dir = "/dolphindb/database" # e.g. data_dir
>>> tableName = "tb1"
>>> create_onDisk_table = """
rows=10
tdata=table(rand(`a`b`c`d`e, rows) as type, rand(100.0, rows) as value)
saveTable("{YOUR_DIR}" + "/testOnDiskDB", tdata, `{tbName})
""".format(YOUR_DIR=data_dir, tbName=tableName)
>>> s.run(create_onDisk_table)
>>> df = orca.read_table(data_dir + "/testOnDiskDB", tableName)
>>> df
# output
type value
0 e 42.537911
1 b 44.813589
2 d 28.939636
3 a 73.719393
4 b 66.576416
5 c 36.265364
6 a 43.936593
7 e 56.951759
8 e 4.290316
9 d 29.229366
In the above script, we are defalut_session
actually using orca.connect
the session created by the function. On the Python side, we can interact with the DolphinDB server through this session. For more functions, see DolphinDB Python API .
- Distributed table
Distributed table is a data storage method recommended by DolphinDB for use in a production environment. It supports transaction isolation at the snapshot level to ensure data consistency. Distributed tables support multiple copy mechanisms, which not only provide data fault tolerance, but also serve as load balancing for data access. In Orca, you can specify the distributed table to import data and load the distributed table information through the read_csv function, or load the distributed table information through the read_table function.
read_csv
function
Orca read_csv
specifies the db_handle, table_name and partition_columns parameters when calling the function, you can directly import the data into the DFS table of DolphinDB. For read_csv
the detailed introduction of the function to import the data into the partition table, please refer to the partition table of Orca .
Example
Please note that only the cluster environment with enableDFS=1 or the DolphinDB singleton mode can use distributed tables.
Taking the csv file in the example in Section 1 as an example, we create a DFS database on the DolphinDB server and import demo.csv into the database:
dbPath="dfs://demoDB"
login('admin', '123456')
if(existsDatabase(dbPath))
dropDatabase(dbPath)
db=database(dbPath, VALUE, `a`b`c`d`e)
Please note: The above script needs to be executed on the DolphinDB server, and the script can be executed through the DolphinDB Python API in the Python client.
Call the read_csv function of Orca in the Python client, specify the database db_handle as the DFS database "dfs://demoDB", specify the table name table_name as "tb1" and the partition column partition_columns as "type", import the data into the DolphinDB partition Table. At this time, the read_csv function returns an object representing the DolphinDB partition table, and the client cannot directly access the data in the table. In subsequent calculations, Orca will download the data required for calculation from the server.
>>> df = orca.read_csv(path=YOUR_DIR + "/demo.csv", dtype={"type": "SYMBOL", "value": np.float64}, db_handle="dfs://demoDB", table_name="tb1", partition_columns="type")
>>> df
# output
<'dolphindb.orca.core.frame.DataFrame' object representing a column in a DolphinDB segmented table>
If you need to view the data in df, you can call the to_pandas
function view. Since the data of the partition table is distributed on each partition, the calling to_pandas
function will download all the data to the client and output the data in the order of the partitions.
>>> df.to_pandas()
# output
type value
0 a 72.641846
1 c 93.697510
2 c 64.533273
3 c 46.758312
4 c 89.597063
5 d 0.262836
6 d 25.223624
7 e 11.699053
8 e 30.315109
9 e 60.980473
The executable script in Python that integrates the above process is as follows:
>>> YOUR_DIR = "/dolphindb/database" # e.g. data_dir
>>> s = orca.default_session()
>>> dfsDatabase = "dfs://demoDB"
>>> create_database = """
dbPath='{dbPath}'
login('admin', '123456')
if(existsDatabase(dbPath))
dropDatabase(dbPath)
db=database(dbPath, VALUE, `a`b`c`d`e)
""".format(dbPath=dfsDatabase)
>>> s.run(create_database)
>>> df=orca.read_csv(path=YOUR_DIR +"/demo.csv", dtype={"type": "SYMBOL", "value": np.float64},
db_handle=dfsDatabase, table_name="tb1", partition_columns="type")
Please note: Before specifying the database to import data through the read_csv function, you need to make sure that the corresponding database has been created on the DolphinDB server. The read_csv function imports data into the DolphinDB database according to the specified database, table name and partition field. If the table exists, the data is appended, if the table does not exist, the table is created and the data is imported.
read_table
Function to load partition table information
If Orca calls the read_table
function to load a disk partition table or dfs partition table, the data will not be downloaded when loading. Take the dfs partition table created in the above example as an example:
>>> df = orca.read_table("dfs://demoDB", "tb1")
>>> df
# output
<'orca.core.frame.DataFrame' object representing a column in a DolphinDB segmented table>
Calculate df, then download data for calculation:
>>> df.groupby("type").mean()
# output
value
type
a 72.641846
c 73.646539
d 12.743230
e 34.331545
The following describes the process of writing data to Orca's data table.
2.1 Save data to Orca memory table
The append function provided by pandas is used to append a DataFrame to another DataFrame and return a new DataFrame without modifying the original DataFrame. In Orca, the append function also supports the inplace parameter. When it is True, the appended data will be saved in the Dataframe, and the original DataFrame will be modified. This process is to append the data to the memory table of Orca.
>>> df1 = orca.DataFrame({"date":orca.date_range("20190101", periods=10),
"price":np.random.sample(10)*100})
>>> df1
# output
date price
0 2019-01-01 17.884136
1 2019-01-02 57.840625
2 2019-01-03 29.781247
3 2019-01-04 89.968203
4 2019-01-05 19.355847
5 2019-01-06 74.684634
6 2019-01-07 91.678632
7 2019-01-08 93.927549
8 2019-01-09 47.041906
9 2019-01-10 96.810450
>>> df2 = orca.DataFrame({"date":orca.date_range("20190111", periods=3),
"price":np.random.sample(3)*100})
>>> df2
# output
date price
0 2019-01-11 26.959939
1 2019-01-12 75.922693
2 2019-01-13 93.012894
>>> df1.append(df2, inplace=True)
>>> df1
# output
date price
0 2019-01-01 17.884136
1 2019-01-02 57.840625
2 2019-01-03 29.781247
3 2019-01-04 89.968203
4 2019-01-05 19.355847
5 2019-01-06 74.684634
6 2019-01-07 91.678632
7 2019-01-08 93.927549
8 2019-01-09 47.041906
9 2019-01-10 96.810450
0 2019-01-11 26.959939
1 2019-01-12 75.922693
2 2019-01-13 93.012894
Please note: When the inplace parameter is set to True, the value of the index_ignore parameter is not allowed to be set and can only be False.
2.2 Save data to Orca disk table
Orca provides two ways to modify disk table data:
save_table
functionappend
function
2.2.1 Save data to Orca local disk table
Orca provides save_table
functions for saving data to disk tables and distributed tables. The parameters of this function are as follows:
- db_path: database path
- table_name: table name
- df: the table to be saved
- ignore_index: Whether to ignore index append data
First read_table
import the disk table created above through the function.
>>> df = orca.read_table(YOUR_DIR + "/testOnDiskDB", "tb1")
>>> df
# output
type value
0 e 42.537911
1 b 44.813589
2 d 28.939636
3 a 73.719393
4 b 66.576416
5 c 36.265364
6 a 43.936593
7 e 56.951759
8 e 4.290316
9 d 29.229366
Generate the data to be appended, append the data to df, and save_table
save the data through .
>>> df2 = orca.DataFrame({"type": np.random.choice(list("abcde"),3),
"value": np.random.sample(3)*100})
>>> df.append(df2, inplace=True)
>>> df
# output
type value
0 e 42.537911
1 b 44.813589
2 d 28.939636
3 a 73.719393
4 b 66.576416
5 c 36.265364
6 a 43.936593
7 e 56.951759
8 e 4.290316
9 d 29.229366
0 d 20.702066
1 c 21.241707
2 a 97.333201
>>> orca.save_table(YOUR_DIR + "/testOnDiskDB", "tb1", df)
It should be noted that for disk tables, if the specified table name does not exist in the database, the save_table
corresponding table will be created; if there is a table with the same name in the database, the table save_table
will be overwritten.
2.2.2 Save data to Orca disk partition table
The difference between a disk partition table and a distributed table is that the database path of a distributed table starts with "dfs://", while the database path of a disk partition table is a local absolute path.
save_table
Save data to disk partition table through function
By calling the save_table
function directly , you can save a memory table to the disk in the form of a partition, which is similar to a disk non-partitioned table. If the table already exists, it will be overwritten.
>>> df2 = orca.DataFrame({"type": np.random.choice(list("abcde"),3),
"value": np.random.sample(3)*100})
>>> orca.save_table(YOUR_DIR + "/testOnDisPartitionedkDB", "tb1", df2)
>>> df = orca.read_table(YOUR_DIR + "/testOnDisPartitionedkDB", "tb1")
>>> df
# output
type value
0 d 86.549417
1 e 61.852710
2 d 28.747059
append
Append data to disk partition table by function
For the disk partition table, calling the append
function can append data to the disk partition table.
First, create a disk partition table in DolphinDB:
dbPath=YOUR_DIR + "/testOnDisPartitionedkDB"
login('admin', '123456')
if(existsDatabase(dbPath))
dropDatabase(dbPath)
db=database(dbPath, VALUE, `a`b`c`d`e)
Import the csv file in the example in section 1 in the Python client
>>> df = orca.read_csv(path=YOUR_DIR + "/demo.csv", dtype={"type": "SYMBOL", "value": np.float64}, db_handle=YOUR_DIR + "/testOnDisPartitionedkDB", table_name="tb1", partition_columns="type")
>>> df.to_pandas()
# output
type value
0 a 72.641846
1 c 93.697510
2 c 64.533273
3 c 46.758312
4 c 89.597063
5 d 0.262836
6 d 25.223624
7 e 11.699053
8 e 30.315109
9 e 60.980473
Call the append
function to append data to the df table, reload the disk partition table, and find that the data has been appended:
>>> df2 = orca.DataFrame({"type": np.random.choice(list("abcde"),3),
"value": np.random.sample(3)*100})
>>> df.append(df2,inplace=True)
>>> df = orca.read_table(YOUR_DIR + "/testOnDisPartitionedkDB", "tb1")
>>> df.to_pandas()
# output
type value
0 a 72.641846
1 c 93.697510
2 c 64.533273
3 c 46.758312
4 c 89.597063
5 c 29.233253
6 c 38.753028
7 d 0.262836
8 d 25.223624
9 d 55.085909
10 e 11.699053
11 e 30.315109
12 e 60.980473
The executable script that integrates the above process into the Python side is as follows:
>>> YOUR_DIR = "/dolphindb/database" # e.g. data_dir
>>> s = orca.default_session()
>>> create_database = """
dbPath='{dbPath}'
login('admin', '123456')
if(existsDatabase(dbPath))
dropDatabase(dbPath)
db=database(dbPath, VALUE, `a`b`c`d`e)
""".format(dbPath=YOUR_DIR + "/testOnDisPartitionedkDB")
>>> s.run(create_database)
>>> df = orca.read_csv(path=YOUR_DIR + "/demo.csv", dtype={"type": "SYMBOL", "value": np.float64}, db_handle=YOUR_DIR + "/testOnDisPartitionedkDB", table_name="tb1", partition_columns="type")
>>> df2 = orca.DataFrame({"type": np.random.choice(list("abcde"),3),
"value": np.random.sample(3)*100})
>>> df.append(df2,inplace=True)
>>> df = orca.read_table(YOUR_DIR + "/testOnDisPartitionedkDB", "tb1")
2.3 Save data to Orca distributed table
append
Append data to distributed table by function
For distributed tables, you can directly append
append data through functions.
First, create a distributed table in DolphinDB:
dbPath="dfs://demoDB"
login('admin', '123456')
if(existsDatabase(dbPath))
dropDatabase(dbPath)
db=database(dbPath, VALUE, `a`b`c`d`e)
Import the csv file in the example in Section 1 in the Python client:
>>> df = orca.read_csv(path=YOUR_DIR + "/demo.csv", dtype={"type": "SYMBOL", "value": np.float64}, db_handle="dfs://demoDB", table_name="tb1", partition_columns="type")
>>> df.to_pandas()
# output
type value
0 a 72.641846
1 c 93.697510
2 c 64.533273
3 c 46.758312
4 c 89.597063
5 d 0.262836
6 d 25.223624
7 e 11.699053
8 e 30.315109
9 e 60.980473
Call the append
function to append data to the df table, reload the distributed table, and find that the data has been appended:
>>> df2 = orca.DataFrame({"type": np.random.choice(list("abcde"),3),
"value": np.random.sample(3)*100})
>>> df.append(df2,inplace=True)
>>> df = orca.read_table("dfs://demoDB", "tb1")
>>> df.to_pandas()
# output
type value
0 a 72.641846
1 a 55.429765
2 a 51.230669
3 c 93.697510
4 c 64.533273
5 c 46.758312
6 c 89.597063
7 c 71.821263
8 d 0.262836
9 d 25.223624
10 e 11.699053
11 e 30.315109
12 e 60.980473
The executable script that integrates the above process into the Python side is as follows:
>>> YOUR_DIR = "/dolphindb/database" # e.g. data_dir
>>> s = orca.default_session()
>>> create_database = """
dbPath='{dbPath}'
login('admin', '123456')
if(existsDatabase(dbPath))
dropDatabase(dbPath)
db=database(dbPath, VALUE, `a`b`c`d`e)
""".format(dbPath="dfs://demoDB")
>>> s.run(create_database)
>>> df = orca.read_csv(path=YOUR_DIR + "/demo.csv", dtype={"type": "SYMBOL", "value": np.float64}, db_handle="dfs://demoDB", table_name="tb1", partition_columns="type")
>>> df2 = orca.DataFrame({"type": np.random.choice(list("abcde"),3),
"value": np.random.sample(3)*100})
>>> df.append(df2,inplace=True)
>>> df = orca.read_table("dfs://demoDB", "tb1")
save_table
Append data to distributed table by function
Unlike disk tables, calling save_table
functions on distributed tables can directly append data instead of overwriting data. And append
compared with the function, the save_table
function does not need to read_table
obtain the table information to be added on the client first , and then directly append data on the DolphinDB server.
In the following example, save_table
the data of the memory table is directly appended to the specified table through the function:
>>> df2 = orca.DataFrame({"type": np.random.choice(list("abcde"),3),
"value": np.random.sample(3)*100})
>>> orca.save_table("dfs://demoDB", "tb1", df2)
>>> df = orca.read_table("dfs://demoDB", "tb1")
>>> df.to_pandas()
# output
type value
0 a 72.641846
1 a 55.429765
2 a 51.230669
3 b 40.724064
4 c 93.697510
5 c 64.533273
6 c 46.758312
7 c 89.597063
8 c 71.821263
9 c 93.533380
10 d 0.262836
11 d 25.223624
12 d 47.238962
13 e 11.699053
14 e 30.315109
15 e 60.980473
3 summary
- Orca's to_csv function only supports sep and append in the default state of engine='dolphindb'.
2. For tables other than ordinary disk tables, when the inplce parameter is set to True, the append method will append data.
3. The save_table function, for the local disk table will overwrite the original table; for the dfs table, the data will be appended to the table