Dry goods丨Orca write data tutorial

The Orca project implements the pandas API on top of DolphinDB, enabling users to analyze and process massive amounts of data more efficiently. Compared with pandas, Orca has the following significant advantages in data storage:

  • More flexible options

Orca can not only perform calculations in memory like pandas, export data in DatFrame to disk, but also append DataFrame data and calculation results to DolphinDB data tables at any time, providing reference for subsequent data query and analysis.

  • Better performance

When the amount of data is very large and the data needs to be saved, the data of the entire DataFrame can be saved to the disk in pandas. The next time the Python program is run, the user will reload the data on the disk into the memory. This method undoubtedly requires It takes a lot of time to import and export operations. Orca has optimized the data storage and calculation process. The user only needs to write the data to the DolphinDB data table before the end of the program. When the Python program is run next time, the user does not need to reload the entire table data into the memory. Analysis and calculation operations can be performed immediately.

This article will introduce how to save data through Orca.

1 Export data to disk

Orca's Series and DataFrame both support to_csv, to_exceland other methods to export data as a fixed format file and save it to a specified path. to_csvSpecial instructions are given below .

  • to_csvfunction

The to_csvvalue of the engine parameter of the pandas function can be'c ' or'python', indicating which engine is used for import.

The to_csvvalue of the engine parameter of the Orca function can be {'c','python','dolphindb'}, and the default value of this parameter is'dolphindb'. When the value is'dolphindb', the to_csvfunction will export the data to the DolphinDB server directory, and only supports sep and append two parameters; when the value is'python' or'c', the function will export the to_csvdata to python Under the client's directory, and support all the parameters supported by pandas.

Example

Call the to_csvfunction to export the data, and read_csvthen import the data through the function. In the following script,'YOUR_DIR' represents the path where the user saves the csv file. Since the data is randomly generated, the table data generated by each execution is different, the following output results are for reference only.

>>> YOUR_DIR = "/dolphindb/database" # e.g. data_dir
>>> odf = orca.DataFrame({"type": np.random.choice(list("abcde"),10), "value": np.random.sample(10)*100})
>>> odf.to_csv(path_or_buf=YOUR_DIR + "/demo.csv")
>>> df1 = orca.read_csv(path=YOUR_DIR + "/demo.csv")
>>> df1
# output
  type      value
0    c  93.697510
1    c  64.533273
2    e  11.699053
3    c  46.758312
4    d   0.262836
5    e  30.315109
6    a  72.641846
7    e  60.980473
8    c  89.597063
9    d  25.223624

2 Save the data to the DolphinDB data table

An important scenario for using Orca is that users obtain data from other database systems or third-party Web APIs and store them in the DolphinDB data table. This section will introduce how to upload and save the obtained data to the DolphinDB data table through Orca.

Orca data tables are divided into three types according to storage methods:

  • Memory table : The data is only stored in the memory, and the access speed is the fastest, but the data does not exist after the node is closed.
  • Local disk table : The data is saved on the local disk. Can be loaded from disk to memory.
  • Distributed table : The data is stored on the DolphinDB server, not loaded into the memory. The client only obtains the database and table information. Through the DolphinDB distributed computing engine, it can still perform unified queries like local tables.

The following explains the difference between these three tables in the form of examples.

  • Memory table

It can be read_csvimported by function or DataFramecreated by function.

  • read_csvFunction import

Take the csv file in the example in Section 1 as an example. After importing like this, the table that can directly access the data in the table is called the Orca memory table.

>>> df1 = orca.read_csv(path=YOUR_DIR + "/demo.csv")
>>> df1
# output
  type      value
0    c  93.697510
1    c  64.533273
2    e  11.699053
3    c  46.758312
4    d   0.262836
5    e  30.315109
6    a  72.641846
7    e  60.980473
8    c  89.597063
9    d  25.223624
  • DataFrameFunction creation

The memory table created by the orca.DataFrame function can also directly access the data in the table:

>>> df = orca.DataFrame({"date":orca.date_range("20190101", periods=10),"price":np.random.sample(10)*100})
>>> df
# output
         date      price
0 2019-01-01  35.218404
1 2019-01-02  24.066378
2 2019-01-03   6.336181
3 2019-01-04  24.786319
4 2019-01-05  35.021376
5 2019-01-06  14.014935
6 2019-01-07   7.454209
7 2019-01-08  86.430214
8 2019-01-09  80.033767
9 2019-01-10  45.410883
  • Disk table

The disk table is divided into a local disk table and a disk partition table. The difference between a local disk table and a memory table is that a local disk table is a memory table stored on a disk and does not need to be partitioned. The disk partition table is the partition table stored on the disk. The local disk table is explained in detail below.

The read_tablelocal disk table can be loaded in Orca through functions.

Orca provides a read_tablefunction to load the data of the DolphinDB data table by specifying the DolphinDB database and table name. The parameters supported by the function are as follows:

  • database: database name
  • table: table name
  • partition: the partition to be imported, optional parameters
Please note: The  read_table function requires that the database and table to be imported already exist on the DolphinDB server. If there is only the database and no table is created, the data cannot be successfully imported into Python.

It can be seen from the function definition that the read_table function can be used to import Orca's partition table, but when the imported table is a DolphinDB disk table, Orca will load all the table data into memory as an Orca memory table for access.

Example

Assuming that there are databases and tables on DolphinDB Server as follows, in the following script,'YOUR_DIR' represents the path where users save disk tables.

rows=10
tdata=table(rand(`a`b`c`d`e, rows) as type, rand(100.0, rows) as value)
saveTable(YOUR_DIR + "/testOnDiskDB", tdata, `tb1)

脚本中创建的数据库的路径为YOUR_DIR + "/testOnDiskDB",存储的表名为"tb1"。在Python客户端中,我们可以通过read_table函数将这个磁盘表表加载到内存中,存放在一个Orca的DataFrame对象里。

>>> df = orca.read_table(YOUR_DIR + "/testOnDiskDB", "tb1")

将上述过程整合成的Python中可执行的脚本如下:

>>> s = orca.default_session()
>>> data_dir = "/dolphindb/database" # e.g. data_dir
>>> tableName = "tb1"
>>> create_onDisk_table = """
rows=10
tdata=table(rand(`a`b`c`d`e, rows) as type, rand(100.0, rows) as value)
saveTable("{YOUR_DIR}" + "/testOnDiskDB", tdata, `{tbName})
""".format(YOUR_DIR=data_dir, tbName=tableName)
>>> s.run(create_onDisk_table)
>>> df = orca.read_table(data_dir + "/testOnDiskDB", tableName)
>>> df
# output
  type      value
0    e  42.537911
1    b  44.813589
2    d  28.939636
3    a  73.719393
4    b  66.576416
5    c  36.265364
6    a  43.936593
7    e  56.951759
8    e   4.290316
9    d  29.229366

上述脚本中,我们使用的defalut_session实际上就是通过orca.connect函数创建的会话,在Python端,我们可以通过这个会话与DolphinDB服务端进行交互。关于更多功能,请参见DolphinDB Python API

  • 分布式表

分布式表是DolphinDB推荐在生产环境下使用的数据存储方式,它支持快照级别的事务隔离,保证数据一致性。分布式表支持多副本机制,既提供了数据容错能力,又能作为数据访问的负载均衡。在Orca中,可以通过read_csv函数指定分布式表导入数据并加载分布式表信息,或者通过read_table函数加载分布式表信息。

  • read_csv函数

Orca在调用read_csv函数时指定db_handle, table_name以及partition_columns参数,可以直接将数据直接导入DolphinDB的DFS表,关于read_csv函数将数据导入分区表的详细介绍请参考Orca的分区表

示例

请注意只有启用enableDFS=1的集群环境或者DolphinDB单例模式才能使用分布式表。

以第1节例子中的csv文件为例,我们在DolphinDB服务端创建一个DFS数据库,将demo.csv导入数据库:

dbPath="dfs://demoDB"
login('admin', '123456')
if(existsDatabase(dbPath))
      dropDatabase(dbPath)
db=database(dbPath, VALUE, `a`b`c`d`e)
请注意:以上脚本需要在DolphinDB服务端执行,在Python客户端中则可以通过DolphinDB Python API执行脚本。

在Python客户端中调用Orca的read_csv函数,指定数据库db_handle为DFS数据库"dfs://demoDB",指定表名table_name为"tb1"和进行分区的列partition_columns为"type",将数据导入到DolphinDB分区表,这时,read_csv函数返回的是一个表示DolphinDB分区表的对象,客户端并不能直接访问表内的数据。在后续的计算中,Orca才会从服务端下载计算所需数据。

>>> df = orca.read_csv(path=YOUR_DIR + "/demo.csv", dtype={"type": "SYMBOL", "value": np.float64}, db_handle="dfs://demoDB", table_name="tb1", partition_columns="type")
>>> df
# output
<'dolphindb.orca.core.frame.DataFrame' object representing a column in a DolphinDB segmented table>

若需要查看df内的数据,可以调用to_pandas函数查看,由于分区表的数据分布在各个分区上,调用to_pandas函数会将所有数据下载到客户端,且按照分区的顺序输出数据。

>>> df.to_pandas()
# output
   type      value
 0    a  72.641846
 1    c  93.697510
 2    c  64.533273
 3    c  46.758312
 4    c  89.597063
 5    d   0.262836
 6    d  25.223624
 7    e  11.699053
 8    e  30.315109
 9    e  60.980473

将上述过程整合成的Python中可执行的脚本如下:

>>> YOUR_DIR = "/dolphindb/database" # e.g. data_dir
>>> s = orca.default_session()
>>> dfsDatabase = "dfs://demoDB"
>>> create_database = """
dbPath='{dbPath}'
login('admin', '123456')
if(existsDatabase(dbPath))
    dropDatabase(dbPath)
db=database(dbPath, VALUE, `a`b`c`d`e)
""".format(dbPath=dfsDatabase)
>>> s.run(create_database)
>>> df=orca.read_csv(path=YOUR_DIR +"/demo.csv", dtype={"type": "SYMBOL", "value": np.float64},
                     db_handle=dfsDatabase, table_name="tb1", partition_columns="type")
请注意:在通过read_csv函数指定数据库导入数据之前,需要确保在DolphinDB服务器上已经创建了对应的数据库。read_csv函数根据指定的数据库,表名和分区字段导入数据到DolphinDB数据库中,若表存在则追加数据,若表不存在则创建表并且导入数据。
  • read_table函数加载分区表信息

若Orca调用read_table函数加载的是磁盘分区表或者dfs分区表,则数据不会在加载的时候被下载,以上述例子中创建的dfs分区表为例:

>>> df = orca.read_table("dfs://demoDB", "tb1")
>>> df
# output
<'orca.core.frame.DataFrame' object representing a column in a DolphinDB segmented table>

对df进行计算,则下载数据进行计算:

>>> df.groupby("type").mean()
# output
          value
type           
 a     72.641846
 c     73.646539
 d     12.743230
 e     34.331545

下面介绍向Orca的数据表写数据的过程。

2.1 保存数据到Orca内存表

pandas提供的append函数用于将一个DataFrame追加到另一个Dataframe,并返回一个新的DataFrame,不会对原有的DataFrame进行修改。在Orca中,append函数还支持inplace参数,当它为True时,会将追加的数据保存到Dataframe中,对原有的DataFrame进行了修改,这个过程就是将数据追加到Orca的内存表中。

>>> df1 = orca.DataFrame({"date":orca.date_range("20190101", periods=10),
                          "price":np.random.sample(10)*100})
>>> df1
# output
        date      price
0 2019-01-01  17.884136
1 2019-01-02  57.840625
2 2019-01-03  29.781247
3 2019-01-04  89.968203
4 2019-01-05  19.355847
5 2019-01-06  74.684634
6 2019-01-07  91.678632
7 2019-01-08  93.927549
8 2019-01-09  47.041906
9 2019-01-10  96.810450

>>> df2 = orca.DataFrame({"date":orca.date_range("20190111", periods=3),
                          "price":np.random.sample(3)*100})
>>> df2
# output 
        date      price
0 2019-01-11  26.959939
1 2019-01-12  75.922693
2 2019-01-13  93.012894

>>> df1.append(df2, inplace=True)
>>> df1
# output
        date      price
0 2019-01-01  17.884136
1 2019-01-02  57.840625
2 2019-01-03  29.781247
3 2019-01-04  89.968203
4 2019-01-05  19.355847
5 2019-01-06  74.684634
6 2019-01-07  91.678632
7 2019-01-08  93.927549
8 2019-01-09  47.041906
9 2019-01-10  96.810450
0 2019-01-11  26.959939
1 2019-01-12  75.922693
2 2019-01-13  93.012894
请注意:当设置inplace参数为True时,index_ignore参数的值不允许设置,只能为False。

2.2 保存数据到Orca磁盘表

Orca提供两种方式修改磁盘表的数据:

  • save_table函数
  • append函数

2.2.1 保存数据到Orca本地磁盘表

Orca提供save_table函数,用于保存数据到磁盘表和分布式表,该函数参数如下:

  • db_path:数据库路径
  • table_name:表名
  • df:需要保存的表
  • ignore_index:是否忽略index追加数据

首先通过read_table函数导入上文中创建的磁盘表。

>>> df = orca.read_table(YOUR_DIR + "/testOnDiskDB", "tb1")
>>> df
# output 
  type      value
0    e  42.537911
1    b  44.813589
2    d  28.939636
3    a  73.719393
4    b  66.576416
5    c  36.265364
6    a  43.936593
7    e  56.951759
8    e   4.290316
9    d  29.229366

生成要追加的数据,追加数据到df,并通过save_table保存数据。

>>> df2 = orca.DataFrame({"type": np.random.choice(list("abcde"),3), 
                          "value": np.random.sample(3)*100})
>>> df.append(df2, inplace=True)
>>> df
# output
  type      value
0    e  42.537911
1    b  44.813589
2    d  28.939636
3    a  73.719393
4    b  66.576416
5    c  36.265364
6    a  43.936593
7    e  56.951759
8    e   4.290316
9    d  29.229366
0    d  20.702066
1    c  21.241707
2    a  97.333201
>>> orca.save_table(YOUR_DIR + "/testOnDiskDB", "tb1", df)

需要注意的是,对于磁盘表,若该指定的表名不存在于数据库中,save_table会创建对应的表;若数据库中已有同名的表,save_table会覆盖该表。

2.2.2 保存数据到Orca磁盘分区表

磁盘分区表与分布式表的差异就在于分布式表的数据库路径以"dfs://"开头,而磁盘分区表的数据库路径是本地的一个绝对路径。

  • 通过save_table函数将数据保存到磁盘分区表

直接调用save_table函数,可以将一个内存表以分区的形式保存到磁盘上,与磁盘非分区表类似,若表已存在,会覆盖该表。

>>> df2 = orca.DataFrame({"type": np.random.choice(list("abcde"),3), 
                          "value": np.random.sample(3)*100})
>>> orca.save_table(YOUR_DIR + "/testOnDisPartitionedkDB", "tb1", df2)
>>> df = orca.read_table(YOUR_DIR + "/testOnDisPartitionedkDB", "tb1")
>>> df
# output
  type      value
0    d  86.549417
1    e  61.852710
2    d  28.747059
  • 通过append函数追加数据到磁盘分区表

对于磁盘分区表,调用append函数可以向磁盘分区表追加数据。

首先,在DolphinDB中创建磁盘分区表:

dbPath=YOUR_DIR + "/testOnDisPartitionedkDB"
login('admin', '123456')
if(existsDatabase(dbPath))
   dropDatabase(dbPath)
db=database(dbPath, VALUE, `a`b`c`d`e)

在Python客户端中导入第1节例子中的csv文件

>>> df = orca.read_csv(path=YOUR_DIR + "/demo.csv", dtype={"type": "SYMBOL", "value": np.float64}, db_handle=YOUR_DIR + "/testOnDisPartitionedkDB", table_name="tb1", partition_columns="type")
>>> df.to_pandas()
# output
type      value
0    a  72.641846
1    c  93.697510
2    c  64.533273
3    c  46.758312
4    c  89.597063
5    d   0.262836
6    d  25.223624
7    e  11.699053
8    e  30.315109
9    e  60.980473

调用append函数向df表追加数据,重新加载该磁盘分区表,发现数据已经追加:

>>> df2 = orca.DataFrame({"type": np.random.choice(list("abcde"),3), 
                          "value": np.random.sample(3)*100})
>>> df.append(df2,inplace=True)
>>> df = orca.read_table(YOUR_DIR + "/testOnDisPartitionedkDB", "tb1")
>>> df.to_pandas()
# output
   type      value
0     a  72.641846
1     c  93.697510
2     c  64.533273
3     c  46.758312
4     c  89.597063
5     c  29.233253
6     c  38.753028
7     d   0.262836
8     d  25.223624
9     d  55.085909
10    e  11.699053
11    e  30.315109
12    e  60.980473

将上述过程整合成Python端的可执行脚本如下:

>>> YOUR_DIR = "/dolphindb/database" # e.g. data_dir
>>> s = orca.default_session()
>>> create_database = """
dbPath='{dbPath}'
login('admin', '123456')
if(existsDatabase(dbPath))
   dropDatabase(dbPath)
db=database(dbPath, VALUE, `a`b`c`d`e)
""".format(dbPath=YOUR_DIR + "/testOnDisPartitionedkDB")
>>> s.run(create_database)
>>> df = orca.read_csv(path=YOUR_DIR + "/demo.csv", dtype={"type": "SYMBOL", "value": np.float64}, db_handle=YOUR_DIR + "/testOnDisPartitionedkDB", table_name="tb1", partition_columns="type")
>>> df2 = orca.DataFrame({"type": np.random.choice(list("abcde"),3), 
                          "value": np.random.sample(3)*100})
>>> df.append(df2,inplace=True)
>>> df = orca.read_table(YOUR_DIR + "/testOnDisPartitionedkDB", "tb1")

2.3 保存数据到Orca分布式表

  • 通过append函数追加数据到分布式表

对于分布式表,可以直接通过append函数追加数据。

首先,在DolphinDB中创建分布式表:

dbPath="dfs://demoDB"
login('admin', '123456')
if(existsDatabase(dbPath))
   dropDatabase(dbPath)
db=database(dbPath, VALUE, `a`b`c`d`e)

在Python客户端中导入第1节例子中的csv文件:

>>> df = orca.read_csv(path=YOUR_DIR + "/demo.csv", dtype={"type": "SYMBOL", "value": np.float64}, db_handle="dfs://demoDB", table_name="tb1", partition_columns="type")
>>> df.to_pandas()
# output
type      value
0    a  72.641846
1    c  93.697510
2    c  64.533273
3    c  46.758312
4    c  89.597063
5    d   0.262836
6    d  25.223624
7    e  11.699053
8    e  30.315109
9    e  60.980473

调用append函数向df表追加数据,重新加载该分布式表,发现数据已经追加:

>>> df2 = orca.DataFrame({"type": np.random.choice(list("abcde"),3), 
                          "value": np.random.sample(3)*100})
>>> df.append(df2,inplace=True)
>>> df = orca.read_table("dfs://demoDB", "tb1")
>>> df.to_pandas()
# output
   type      value
0     a  72.641846
1     a  55.429765
2     a  51.230669
3     c  93.697510
4     c  64.533273
5     c  46.758312
6     c  89.597063
7     c  71.821263
8     d   0.262836
9     d  25.223624
10    e  11.699053
11    e  30.315109
12    e  60.980473

将上述过程整合成Python端的可执行脚本如下:

>>> YOUR_DIR = "/dolphindb/database" # e.g. data_dir
>>> s = orca.default_session()
>>> create_database = """
dbPath='{dbPath}'
login('admin', '123456')
if(existsDatabase(dbPath))
   dropDatabase(dbPath)
db=database(dbPath, VALUE, `a`b`c`d`e)
""".format(dbPath="dfs://demoDB")
>>> s.run(create_database)
>>> df = orca.read_csv(path=YOUR_DIR + "/demo.csv", dtype={"type": "SYMBOL", "value": np.float64}, db_handle="dfs://demoDB", table_name="tb1", partition_columns="type")
>>> df2 = orca.DataFrame({"type": np.random.choice(list("abcde"),3), 
                          "value": np.random.sample(3)*100})
>>> df.append(df2,inplace=True)
>>> df = orca.read_table("dfs://demoDB", "tb1")
  • 通过save_table函数追加数据到分布式表

与磁盘表不同的是,对分布式表调用save_table函数,可以直接追加数据,而不是覆盖数据。且与append函数相比,save_table函数无需先在客户端通过read_table获得将要追加的表信息,就直接在DolphinDB服务端上追加数据的操作。

下面的例子中,通过save_table函数直接将内存表的数据追加到指定表:

>>> df2 = orca.DataFrame({"type": np.random.choice(list("abcde"),3), 
                          "value": np.random.sample(3)*100})
>>> orca.save_table("dfs://demoDB", "tb1", df2)
>>> df = orca.read_table("dfs://demoDB", "tb1")
>>> df.to_pandas()
# output
   type      value
0     a  72.641846
1     a  55.429765
2     a  51.230669
3     b  40.724064
4     c  93.697510
5     c  64.533273
6     c  46.758312
7     c  89.597063
8     c  71.821263
9     c  93.533380
10    d   0.262836
11    d  25.223624
12    d  47.238962
13    e  11.699053
14    e  30.315109
15    e  60.980473

3 小结

  1. Orca的to_csv函数在engine='dolphindb'的默认状态下只支持sep和append两个参数。

2. 对于普通磁盘表以外的表,inplce参数置为True时,append方法将追加数据。

3. save_table函数,对于本地磁盘表会覆盖原表;对于dfs表,数据会被追加到表中


Guess you like

Origin blog.51cto.com/15022783/2636745