Dry goods丨How to write data with Orca?

The Orca project implements the pandas API on top of DolphinDB, enabling users to analyze and process massive amounts of data more efficiently. Compared with pandas, Orca has the following significant advantages in data storage:

  • More flexible options

Orca can not only perform calculations in memory like pandas, export data in DatFrame to disk, but also append DataFrame data and calculation results to DolphinDB data tables at any time to provide reference for subsequent data query and analysis.

  • Better performance

When the amount of data is very large and the data needs to be saved, the data of the entire DataFrame can be saved to the disk in pandas. The next time the Python program is run, the user will reload the data on the disk into the memory. This method undoubtedly requires It takes a lot of time to import and export operations. Orca has optimized the data storage and calculation process. The user only needs to write the data into the DolphinDB data table before the end of the program. When the Python program is run next time, the user does not need to reload the entire table data into the memory. Analysis and calculation operations can be performed immediately.

This article will introduce how to save data through Orca.

1 Export data to disk

Orca's Series and DataFrame both support to_csv, to_exceland other methods to export data as a fixed format file and save it to a specified path. to_csvSpecial instructions are given below .

  • to_csvfunction

The to_csvvalue of the engine parameter of the pandas function can be'c ' or'python', indicating which engine is used for import.

The to_csvvalue of the engine parameter of the Orca function can be {'c','python','dolphindb'}, and the default value of this parameter is'dolphindb'. When the value is'dolphindb', the to_csvfunction will export the data to the DolphinDB server directory, and only supports sep and append two parameters; when the value is'python' or'c', the function will export the to_csvdata to python Under the client's directory, and support all the parameters supported by pandas.

Example

Call the to_csvfunction to export the data, and read_csvthen import the data through the function. In the following script,'YOUR_DIR' represents the path where the user saves the csv file. Since the data is randomly generated, the table data generated by each execution is different, the following output results are for reference only.

>>> YOUR_DIR = "/dolphindb/database" # e.g. data_dir
>>> odf = orca.DataFrame({"type": np.random.choice(list("abcde"),10), "value": np.random.sample(10)*100})
>>> odf.to_csv(path_or_buf=YOUR_DIR + "/demo.csv")
>>> df1 = orca.read_csv(path=YOUR_DIR + "/demo.csv")
>>> df1
# output
  type      value
0    c  93.697510
1    c  64.533273
2    e  11.699053
3    c  46.758312
4    d   0.262836
5    e  30.315109
6    a  72.641846
7    e  60.980473
8    c  89.597063
9    d  25.223624

2 Save the data to the DolphinDB data table

An important scenario for using Orca is that users obtain data from other database systems or third-party Web APIs and store them in the DolphinDB data table. This section will introduce how to upload and save the obtained data to the DolphinDB data table through Orca.

Orca data tables are divided into three types according to storage methods:

  • Memory table : The data is only stored in the memory, and the access speed is the fastest, but the data does not exist after the node is closed.
  • Local disk table : The data is saved on the local disk. Can be loaded from disk to memory.
  • Distributed table : The data is stored on the DolphinDB server, not loaded into the memory. The client only obtains the database and table information. Through the distributed computing engine of DolphinDB, it can still perform unified queries like local tables.

The following explains the difference between these three tables in the form of examples.

  • Memory table

It can be read_csvimported by function or DataFramecreated by function.

  • read_csvFunction import

Take the csv file in the example in Section 1 as an example. After importing like this, the table that can directly access the data in the table is called the Orca memory table.

>>> df1 = orca.read_csv(path=YOUR_DIR + "/demo.csv")
>>> df1
# output
  type      value
0    c  93.697510
1    c  64.533273
2    e  11.699053
3    c  46.758312
4    d   0.262836
5    e  30.315109
6    a  72.641846
7    e  60.980473
8    c  89.597063
9    d  25.223624
  • DataFrameFunction creation

The memory table created by the orca.DataFrame function can also directly access the data in the table:

>>> df = orca.DataFrame({"date":orca.date_range("20190101", periods=10),"price":np.random.sample(10)*100})
>>> df
# output
         date      price
0 2019-01-01  35.218404
1 2019-01-02  24.066378
2 2019-01-03   6.336181
3 2019-01-04  24.786319
4 2019-01-05  35.021376
5 2019-01-06  14.014935
6 2019-01-07   7.454209
7 2019-01-08  86.430214
8 2019-01-09  80.033767
9 2019-01-10  45.410883
  • Disk table

The disk table is divided into a local disk table and a disk partition table. The difference between a local disk table and a memory table is that a local disk table is a memory table stored on a disk and does not need to be partitioned. The disk partition table is the partition table stored on the disk. The local disk table is explained in detail below.

The read_tablelocal disk table can be loaded in Orca through functions.

Orca provides a read_tablefunction to load the data of the DolphinDB data table by specifying the DolphinDB database and table name. The parameters supported by the function are as follows:

  • database: database name
  • table: table name
  • partition: the partition to be imported, optional parameters
Please note: The  read_table function requires that the database and table to be imported already exist on the DolphinDB server. If there is only a database and no table is created, the data cannot be successfully imported into Python.

It can be seen from the function definition that the read_table function can be used to import Orca's partition table, but when the imported table is a DolphinDB disk table, Orca will load all the table data into memory as an Orca memory table for access.

Example

Assuming that there are databases and tables on DolphinDB Server as follows, in the following script,'YOUR_DIR' represents the path where users save disk tables.

rows=10
tdata=table(rand(`a`b`c`d`e, rows) as type, rand(100.0, rows) as value)
saveTable(YOUR_DIR + "/testOnDiskDB", tdata, `tb1)

The path of the database created in the script is YOUR_DIR + "/testOnDiskDB", and the stored table name is "tb1". In the Python client, we can read_tableload this disk table into memory through a function and store it in an Orca DataFrame object.

>>> df = orca.read_table(YOUR_DIR + "/testOnDiskDB", "tb1")

The executable script in Python that integrates the above process is as follows:

>>> s = orca.default_session()
>>> data_dir = "/dolphindb/database" # e.g. data_dir
>>> tableName = "tb1"
>>> create_onDisk_table = """
rows=10
tdata=table(rand(`a`b`c`d`e, rows) as type, rand(100.0, rows) as value)
saveTable("{YOUR_DIR}" + "/testOnDiskDB", tdata, `{tbName})
""".format(YOUR_DIR=data_dir, tbName=tableName)
>>> s.run(create_onDisk_table)
>>> df = orca.read_table(data_dir + "/testOnDiskDB", tableName)
>>> df
# output
  type      value
0    e  42.537911
1    b  44.813589
2    d  28.939636
3    a  73.719393
4    b  66.576416
5    c  36.265364
6    a  43.936593
7    e  56.951759
8    e   4.290316
9    d  29.229366

In the above script, we are defalut_sessionactually using orca.connectthe session created by the function. On the Python side, we can interact with the DolphinDB server through this session. For more functions, see DolphinDB Python API .

  • Distributed table

Distributed table is a data storage method recommended by DolphinDB for use in a production environment. It supports transaction isolation at the snapshot level to ensure data consistency. Distributed tables support multiple copy mechanisms, which not only provide data fault tolerance, but also serve as load balancing for data access. In Orca, you can specify the distributed table to import data and load the distributed table information through the read_csv function, or load the distributed table information through the read_table function.

  • read_csvfunction

Orca read_csvspecifies the db_handle, table_name and partition_columns parameters when calling the function, you can directly import the data into the DFS table of DolphinDB. For read_csvthe detailed introduction of the function to import the data into the partition table, please refer to the partition table of Orca .

Example

Please note that only the cluster environment with enableDFS=1 or the DolphinDB singleton mode can use distributed tables.

Taking the csv file in the example in Section 1 as an example, we create a DFS database on the DolphinDB server and import demo.csv into the database:

dbPath="dfs://demoDB"
login('admin', '123456')
if(existsDatabase(dbPath))
      dropDatabase(dbPath)
db=database(dbPath, VALUE, `a`b`c`d`e)
Please note: The above script needs to be executed on the DolphinDB server, and the script can be executed through the DolphinDB Python API in the Python client.

Call the read_csv function of Orca in the Python client, specify the database db_handle as the DFS database "dfs://demoDB", specify the table name table_name as "tb1" and the partition column partition_columns as "type", import the data into the DolphinDB partition Table. At this time, the read_csv function returns an object representing the DolphinDB partition table, and the client cannot directly access the data in the table. In subsequent calculations, Orca will download the data required for calculation from the server.

>>> df = orca.read_csv(path=YOUR_DIR + "/demo.csv", dtype={"type": "SYMBOL", "value": np.float64}, db_handle="dfs://demoDB", table_name="tb1", partition_columns="type")
>>> df
# output
<'dolphindb.orca.core.frame.DataFrame' object representing a column in a DolphinDB segmented table>

If you need to view the data in df, you can call the to_pandasfunction view. Since the data of the partition table is distributed on each partition, the calling to_pandasfunction will download all the data to the client and output the data in the order of the partitions.

>>> df.to_pandas()
# output
   type      value
 0    a  72.641846
 1    c  93.697510
 2    c  64.533273
 3    c  46.758312
 4    c  89.597063
 5    d   0.262836
 6    d  25.223624
 7    e  11.699053
 8    e  30.315109
 9    e  60.980473

The executable script in Python that integrates the above process is as follows:

>>> YOUR_DIR = "/dolphindb/database" # e.g. data_dir
>>> s = orca.default_session()
>>> dfsDatabase = "dfs://demoDB"
>>> create_database = """
dbPath='{dbPath}'
login('admin', '123456')
if(existsDatabase(dbPath))
    dropDatabase(dbPath)
db=database(dbPath, VALUE, `a`b`c`d`e)
""".format(dbPath=dfsDatabase)
>>> s.run(create_database)
>>> df=orca.read_csv(path=YOUR_DIR +"/demo.csv", dtype={"type": "SYMBOL", "value": np.float64},
                     db_handle=dfsDatabase, table_name="tb1", partition_columns="type")
Please note: Before specifying the database to import data through the read_csv function, you need to make sure that the corresponding database has been created on the DolphinDB server. The read_csv function imports data into the DolphinDB database according to the specified database, table name and partition field. If the table exists, the data is appended, if the table does not exist, the table is created and the data is imported.
  • read_tableFunction to load partition table information

If Orca calls the read_tablefunction to load a disk partition table or dfs partition table, the data will not be downloaded when loading. Take the dfs partition table created in the above example as an example:

>>> df = orca.read_table("dfs://demoDB", "tb1")
>>> df
# output
<'orca.core.frame.DataFrame' object representing a column in a DolphinDB segmented table>

Calculate df, then download data for calculation:

>>> df.groupby("type").mean()
# output
          value
type           
 a     72.641846
 c     73.646539
 d     12.743230
 e     34.331545

The following describes the process of writing data to Orca's data table.

2.1 Save data to Orca memory table

The append function provided by pandas is used to append a DataFrame to another DataFrame and return a new DataFrame without modifying the original DataFrame. In Orca, the append function also supports the inplace parameter. When it is True, the appended data will be saved in the Dataframe, and the original DataFrame will be modified. This process is to append the data to the memory table of Orca.

>>> df1 = orca.DataFrame({"date":orca.date_range("20190101", periods=10),
                          "price":np.random.sample(10)*100})
>>> df1
# output
        date      price
0 2019-01-01  17.884136
1 2019-01-02  57.840625
2 2019-01-03  29.781247
3 2019-01-04  89.968203
4 2019-01-05  19.355847
5 2019-01-06  74.684634
6 2019-01-07  91.678632
7 2019-01-08  93.927549
8 2019-01-09  47.041906
9 2019-01-10  96.810450

>>> df2 = orca.DataFrame({"date":orca.date_range("20190111", periods=3),
                          "price":np.random.sample(3)*100})
>>> df2
# output 
        date      price
0 2019-01-11  26.959939
1 2019-01-12  75.922693
2 2019-01-13  93.012894

>>> df1.append(df2, inplace=True)
>>> df1
# output
        date      price
0 2019-01-01  17.884136
1 2019-01-02  57.840625
2 2019-01-03  29.781247
3 2019-01-04  89.968203
4 2019-01-05  19.355847
5 2019-01-06  74.684634
6 2019-01-07  91.678632
7 2019-01-08  93.927549
8 2019-01-09  47.041906
9 2019-01-10  96.810450
0 2019-01-11  26.959939
1 2019-01-12  75.922693
2 2019-01-13  93.012894
Please note: When the inplace parameter is set to True, the value of the index_ignore parameter is not allowed to be set and can only be False.

2.2 Save data to Orca disk table

Orca provides two ways to modify disk table data:

  • save_tablefunction
  • appendfunction

2.2.1 Save data to Orca local disk table

Orca provides save_tablefunctions for saving data to disk tables and distributed tables. The parameters of this function are as follows:

  • db_path: database path
  • table_name: table name
  • df: the table to be saved
  • ignore_index: Whether to ignore index append data

First read_tableimport the disk table created above through the function.

>>> df = orca.read_table(YOUR_DIR + "/testOnDiskDB", "tb1")
>>> df
# output 
  type      value
0    e  42.537911
1    b  44.813589
2    d  28.939636
3    a  73.719393
4    b  66.576416
5    c  36.265364
6    a  43.936593
7    e  56.951759
8    e   4.290316
9    d  29.229366

Generate the data to be appended, append the data to df, and save_tablesave the data through .

>>> df2 = orca.DataFrame({"type": np.random.choice(list("abcde"),3), 
                          "value": np.random.sample(3)*100})
>>> df.append(df2, inplace=True)
>>> df
# output
  type      value
0    e  42.537911
1    b  44.813589
2    d  28.939636
3    a  73.719393
4    b  66.576416
5    c  36.265364
6    a  43.936593
7    e  56.951759
8    e   4.290316
9    d  29.229366
0    d  20.702066
1    c  21.241707
2    a  97.333201
>>> orca.save_table(YOUR_DIR + "/testOnDiskDB", "tb1", df)

It should be noted that for disk tables, if the specified table name does not exist in the database, the save_tablecorresponding table will be created; if there is a table with the same name in the database, the table save_tablewill be overwritten.

2.2.2 Save data to Orca disk partition table

The difference between a disk partition table and a distributed table is that the database path of a distributed table starts with "dfs://", while the database path of a disk partition table is a local absolute path.

  • save_tableSave data to disk partition table through function

By calling the save_tablefunction directly , you can save a memory table to the disk in the form of a partition, which is similar to a disk non-partitioned table. If the table already exists, it will be overwritten.

>>> df2 = orca.DataFrame({"type": np.random.choice(list("abcde"),3), 
                          "value": np.random.sample(3)*100})
>>> orca.save_table(YOUR_DIR + "/testOnDisPartitionedkDB", "tb1", df2)
>>> df = orca.read_table(YOUR_DIR + "/testOnDisPartitionedkDB", "tb1")
>>> df
# output
  type      value
0    d  86.549417
1    e  61.852710
2    d  28.747059
  • appendAppend data to disk partition table by function

For the disk partition table, calling the appendfunction can append data to the disk partition table.

First, create a disk partition table in DolphinDB:

dbPath=YOUR_DIR + "/testOnDisPartitionedkDB"
login('admin', '123456')
if(existsDatabase(dbPath))
   dropDatabase(dbPath)
db=database(dbPath, VALUE, `a`b`c`d`e)

Import the csv file in the example in section 1 in the Python client

>>> df = orca.read_csv(path=YOUR_DIR + "/demo.csv", dtype={"type": "SYMBOL", "value": np.float64}, db_handle=YOUR_DIR + "/testOnDisPartitionedkDB", table_name="tb1", partition_columns="type")
>>> df.to_pandas()
# output
type      value
0    a  72.641846
1    c  93.697510
2    c  64.533273
3    c  46.758312
4    c  89.597063
5    d   0.262836
6    d  25.223624
7    e  11.699053
8    e  30.315109
9    e  60.980473

Call the appendfunction to append data to the df table, reload the disk partition table, and find that the data has been appended:

>>> df2 = orca.DataFrame({"type": np.random.choice(list("abcde"),3), 
                          "value": np.random.sample(3)*100})
>>> df.append(df2,inplace=True)
>>> df = orca.read_table(YOUR_DIR + "/testOnDisPartitionedkDB", "tb1")
>>> df.to_pandas()
# output
   type      value
0     a  72.641846
1     c  93.697510
2     c  64.533273
3     c  46.758312
4     c  89.597063
5     c  29.233253
6     c  38.753028
7     d   0.262836
8     d  25.223624
9     d  55.085909
10    e  11.699053
11    e  30.315109
12    e  60.980473

The executable script that integrates the above process into the Python side is as follows:

>>> YOUR_DIR = "/dolphindb/database" # e.g. data_dir
>>> s = orca.default_session()
>>> create_database = """
dbPath='{dbPath}'
login('admin', '123456')
if(existsDatabase(dbPath))
   dropDatabase(dbPath)
db=database(dbPath, VALUE, `a`b`c`d`e)
""".format(dbPath=YOUR_DIR + "/testOnDisPartitionedkDB")
>>> s.run(create_database)
>>> df = orca.read_csv(path=YOUR_DIR + "/demo.csv", dtype={"type": "SYMBOL", "value": np.float64}, db_handle=YOUR_DIR + "/testOnDisPartitionedkDB", table_name="tb1", partition_columns="type")
>>> df2 = orca.DataFrame({"type": np.random.choice(list("abcde"),3), 
                          "value": np.random.sample(3)*100})
>>> df.append(df2,inplace=True)
>>> df = orca.read_table(YOUR_DIR + "/testOnDisPartitionedkDB", "tb1")

2.3 Save data to Orca distributed table

  • appendAppend data to distributed table by function

For distributed tables, you can directly appendappend data through functions.

First, create a distributed table in DolphinDB:

dbPath="dfs://demoDB"
login('admin', '123456')
if(existsDatabase(dbPath))
   dropDatabase(dbPath)
db=database(dbPath, VALUE, `a`b`c`d`e)

Import the csv file in the example in Section 1 in the Python client:

>>> df = orca.read_csv(path=YOUR_DIR + "/demo.csv", dtype={"type": "SYMBOL", "value": np.float64}, db_handle="dfs://demoDB", table_name="tb1", partition_columns="type")
>>> df.to_pandas()
# output
type      value
0    a  72.641846
1    c  93.697510
2    c  64.533273
3    c  46.758312
4    c  89.597063
5    d   0.262836
6    d  25.223624
7    e  11.699053
8    e  30.315109
9    e  60.980473

Call the appendfunction to append data to the df table, reload the distributed table, and find that the data has been appended:

>>> df2 = orca.DataFrame({"type": np.random.choice(list("abcde"),3), 
                          "value": np.random.sample(3)*100})
>>> df.append(df2,inplace=True)
>>> df = orca.read_table("dfs://demoDB", "tb1")
>>> df.to_pandas()
# output
   type      value
0     a  72.641846
1     a  55.429765
2     a  51.230669
3     c  93.697510
4     c  64.533273
5     c  46.758312
6     c  89.597063
7     c  71.821263
8     d   0.262836
9     d  25.223624
10    e  11.699053
11    e  30.315109
12    e  60.980473

The executable script that integrates the above process into the Python side is as follows:

>>> YOUR_DIR = "/dolphindb/database" # e.g. data_dir
>>> s = orca.default_session()
>>> create_database = """
dbPath='{dbPath}'
login('admin', '123456')
if(existsDatabase(dbPath))
   dropDatabase(dbPath)
db=database(dbPath, VALUE, `a`b`c`d`e)
""".format(dbPath="dfs://demoDB")
>>> s.run(create_database)
>>> df = orca.read_csv(path=YOUR_DIR + "/demo.csv", dtype={"type": "SYMBOL", "value": np.float64}, db_handle="dfs://demoDB", table_name="tb1", partition_columns="type")
>>> df2 = orca.DataFrame({"type": np.random.choice(list("abcde"),3), 
                          "value": np.random.sample(3)*100})
>>> df.append(df2,inplace=True)
>>> df = orca.read_table("dfs://demoDB", "tb1")
  • save_tableAppend data to distributed table by function

Unlike disk tables, calling save_tablefunctions on distributed tables can directly append data instead of overwriting data. And appendcompared with the function, the save_tablefunction does not need to read_tableobtain the table information to be added on the client first , and then directly append data on the DolphinDB server.

In the following example, save_tablethe data of the memory table is directly appended to the specified table through the function:

>>> df2 = orca.DataFrame({"type": np.random.choice(list("abcde"),3), 
                          "value": np.random.sample(3)*100})
>>> orca.save_table("dfs://demoDB", "tb1", df2)
>>> df = orca.read_table("dfs://demoDB", "tb1")
>>> df.to_pandas()
# output
   type      value
0     a  72.641846
1     a  55.429765
2     a  51.230669
3     b  40.724064
4     c  93.697510
5     c  64.533273
6     c  46.758312
7     c  89.597063
8     c  71.821263
9     c  93.533380
10    d   0.262836
11    d  25.223624
12    d  47.238962
13    e  11.699053
14    e  30.315109
15    e  60.980473

3 summary

  1. Orca's to_csv function only supports sep and append in the default state of engine='dolphindb'.

2. For tables other than ordinary disk tables, when the inplce parameter is set to True, the append method will append data.

3. The save_table function, for the local disk table will overwrite the original table; for the dfs table, the data will be appended to the table

Guess you like

Origin blog.csdn.net/qq_41996852/article/details/111660943