Dry goods丨Orca data loading tutorial

This article describes how to load data in Orca.

1 Establish a database connection

connectConnect to the DolphinDB server through the function in Orca :

>>> import dolphindb.orca as orca
>>> orca.connect(MY_HOST, MY_PORT, MY_USERNAME, MY_PASSWORD)

2 Import data

The following tutorial uses a data file: quotes.csv .

2.1 read_csv function

Orca provides read_csvfunctions for importing data sets. It should be noted that read_csvthe value of the engine parameter of the Orca function can be {'c','python','dolphindb'}, and the default value of this parameter is'dolphindb'. When the value is'dolphindb', the read_csvfunction will look for the data file to be imported in the DolphinDB server directory. When the value is'python' or'c', the read_csvfunction will look for the data file to be imported in the directory of the python client.

Note that, when the engine parameter is set to 'python' or 'c', Orca the read_csv function call corresponds to the pandas read_csv function introduced. This section is based on the premise Orca engine parameter value is 'dolphindb' the read_csv function explanation.

When the engine parameter is set to'dolphindb', read_csvthe parameters currently supported by Orca's function are as follows:

  • path: file path
  • sep: separator
  • delimiter: delimiter
  • names: Specify column names
  • index_col: Specify the column as the index
  • engine: the engine to import
  • usecols: Specify the columns to be imported
  • squeeze: When the data file has only one row or one column, whether to compress the DataFrame into Series
  • prefix: the prefix string added to each column
  • dtype: specify the data type to import
  • partitioned: Whether to allow data to be imported in a partitioned manner
  • db_handle: the database path to be imported
  • table_name: the name of the table to be imported
  • partition_columns: column names for partitioning

The following is a detailed introduction to several parameters that differ between Orca and pandas.

  • dtype parameter

Orca automatically recognizes the data type of the file to be imported when importing csv, and supports various common time formats. The user can also force the specified data type through the dtype parameter.

It should be noted that Orca's read_csvfunctions not only support specifying various numpy data types (np.bool, np.int8, np.float32, etc.), but also support specifying all data types provided by DolphinDB in the form of strings , Including all time types and string types.

E.g:

dfcsv = orca.read_csv("DATA_DIR/quotes.csv", dtype={"TIME": "NANOTIMESTAMP", "Exchange": "SYMBOL", "SYMBOL": "SYMBOL", "Bid_Price": np.float64, "Bid_Size": np.int32, "Offer_Price": np.float64, "Offer_Size": np.int32})
  • partitioned parameter

The bool type, the default is True. When this parameter is True, when the data size reaches a certain level, the data will be imported as a partitioned memory table. If it is set to False, csv will be directly imported as an unpartitioned DolphinDB ordinary memory table.

Please note: Compared with Orca's memory table, Orca's partition table has many differences in operation. For details, see the special differences of Orca's partition table . If your data volume is not very large, and the consistency between Orca and pandas is higher when using Orca, please try not to import the data in a partitioned manner. If you have a large amount of data and have extremely high performance requirements, it is recommended that you import the data by partitioning.
  • db_handle, table_name and partition_columns parameters

Orca read_csvalso supports the three parameters db_handle, table_name and partition_columns. These parameters are used to import data into DolphinDB's partition table by specifying DolphinDB's database and table and other related information when importing data.

DolphinDB supports importing data into the DolphinDB database in a variety of ways . Orca read_csvspecifies the db_handle, table_name and partition_columns parameters when calling the function, essentially calling the loadTextEx function of DolphinDB . In this way, we can directly import the data directly into the partition table of DolphinDB. .

2.1.1 Import to memory table

  • Import as a memory partition table

Call the read_csvfunction directly , and the data will be imported in parallel. Due to the parallel import, the import speed is fast, but the memory usage is twice that of a normal table. In the following example,'DATA_DIR' is the path where the data file is stored.

>>> DATA_DIR = "dolphindb/database" # e.g. data_dir
>>> df = orca.read_csv(DATA_DIR + "/quotes.csv")
>>> df.head()
# output

                        time Exchange Symbol  Bid_Price  Bid_Size  \
0 2017-01-01 04:40:11.686699        T   AAPL       0.00         0   
1 2017-01-01 06:42:50.247631        P   AAPL      26.70        10   
2 2017-01-01 07:00:12.194786        P   AAPL      26.75         5   
3 2017-01-01 07:15:03.578071        P   AAPL      26.70        10   
4 2017-01-01 07:59:39.606882        K   AAPL      26.90         1   
   Offer_Price  Offer_Size  
0        27.42           1  
1        27.47           1  
2        27.47           1  
3        27.47           1  
4         0.00           0
  • 导入为普通内存表

partitioned参数取值为False,导入为普通内存表。导入对内存要求低,但是计算速度略低于上面的导入方式:

df = orca.read_csv(DATA_DIR + "/quotes.csv", partitioned=False)

2.1.2 导入到磁盘表

DolphinDB的分区表可以保存在本地磁盘,也可以保存在dfs上,磁盘分区表与分布式表的区别就在于分布式表的数据库路径以"dfs://"开头,而磁盘分区表的数据库路径是本地路径。

示例

我们在DolphinDB服务端创建一个磁盘分区表,下面的脚本中,'YOUR_DIR'为保存磁盘数据库的路径:

dbPath=YOUR_DIR + "/demoOnDiskPartitionedDB"
login('admin', '123456')
if(existsDatabase(dbPath))
   dropDatabase(dbPath)
db=database(dbPath, RANGE, datehour(2017.01.01 00:00:00+(0..24)*3600))
请注意:以上两段脚本需要在DolphinDB服务端执行,在Python客户端中则可以通过DolphinDB Python API执行脚本。

在Python客户端中调用Orca的read_csv函数,指定数据库db_handle为磁盘分区数据库YOUR_DIR + "/demoOnDiskPartitionedDB",指定表名table_name为"quotes"和进行分区的列partition_columns为"time",将数据导入到DolphinDB的磁盘分区表,并返回一个表示DolphinDB数据表的对象给df,用于后续计算。

>>> df = orca.read_csv(path=DATA_DIR+"/quotes.csv", dtype={"Exchange": "SYMBOL", "SYMBOL": "SYMBOL"}, db_handle=YOUR_DIR + "/demoOnDiskPartitionedDB", table_name="quotes", partition_columns="time")
>>> df
# output
<'dolphindb.orca.core.frame.DataFrame' object representing a column in a DolphinDB segmented table>
>>> df.head()
# output
                        time Exchange Symbol  Bid_Price  Bid_Size  \
0 2017-01-01 04:40:11.686699        T   AAPL       0.00         0   
1 2017-01-01 06:42:50.247631        P   AAPL      26.70        10   
2 2017-01-01 07:00:12.194786        P   AAPL      26.75         5   
3 2017-01-01 07:15:03.578071        P   AAPL      26.70        10   
4 2017-01-01 07:59:39.606882        K   AAPL      26.90         1   
   Offer_Price  Offer_Size  
0        27.42           1  
1        27.47           1  
2        27.47           1  
3        27.47           1  
4         0.00           0

将上述过程整合成的Python中可执行的脚本如下:

>>> s = orca.default_session()
>>> DATA_DIR = "/dolphindb/database" # e.g. data_dir
>>> YOUR_DIR = "/dolphindb/database" # e.g. database_dir
>>> create_onDiskPartitioned_database = """
dbPath="{YOUR_DIR}" + "/demoOnDiskPartitionedDB"
login('admin', '123456')
if(existsDatabase(dbPath))
   dropDatabase(dbPath)
db=database(dbPath, RANGE, datehour(2017.01.01 00:00:00+(0..24)*3600))
""".format(YOUR_DIR=YOUR_DIR)
>>> s.run(create_onDiskPartitioned_database)
>>> df = orca.read_csv(path=DATA_DIR+"/quotes.csv", dtype={"Exchange": "SYMBOL", "SYMBOL": "SYMBOL"}, db_handle=YOUR_DIR + "/demoOnDiskPartitionedDB", table_name="quotes", partition_columns="time")

上述脚本中,我们使用的defalut_session实际上就是通过orca.connect函数创建的会话,在Python端,我们可以通过这个会话与DolphinDB服务端进行交互。关于更多功能,请参见DolphinDB Python API

请注意:在通过 read_csv函数指定数据库导入数据之前,需要确保在DolphinDB服务器上已经创建了对应的数据库。 read_csv函数根据指定的数据库,表名和分区字段导入数据到DolphinDB数据库中,若表存在则追加数据,若表不存在则创建表并且导入数据。

2.1.3 导入到分布式表

read_csv函数若指定db_handle参数为dfs数据库路径,则数据将直接导入到DolphinDB的dfs数据库中。

示例

请注意只有启用enableDFS=1的集群环境或者DolphinDB单例模式才能使用分布式表。

与磁盘分区表类似,首先需要在DolphinDB服务器上创建分布式表,只需要将数据库路径改为"dfs://"开头的字符串即可。

dbPath="dfs://demoDB"
login('admin', '123456')
if(existsDatabase(dbPath))
   dropDatabase(dbPath)
db=database(dbPath, RANGE, datehour(2017.01.01 00:00:00+(0..24)*3600))

在Python客户端中调用Orca的read_csv函数,指定数据库db_handle为分布式数据库"dfs://demoDB",指定表名table_name为"quotes"和进行分区的列partition_columns为"time",将数据导入到DolphinDB的分布式表。

>>> df = orca.read_csv(path=DATA_DIR+"/quotes.csv", dtype={"Exchange": "SYMBOL", "SYMBOL": "SYMBOL"}, db_handle="dfs://demoDB", table_name="quotes", partition_columns="time")
>>> df
# output
<'dolphindb.orca.core.frame.DataFrame' object representing a column in a DolphinDB segmented table>
>>> df.head()
# output
                        time Exchange Symbol  Bid_Price  Bid_Size  \
0 2017-01-01 04:40:11.686699        T   AAPL       0.00         0   
1 2017-01-01 06:42:50.247631        P   AAPL      26.70        10   
2 2017-01-01 07:00:12.194786        P   AAPL      26.75         5   
3 2017-01-01 07:15:03.578071        P   AAPL      26.70        10   
4 2017-01-01 07:59:39.606882        K   AAPL      26.90         1   
   Offer_Price  Offer_Size  
0        27.42           1  
1        27.47           1  
2        27.47           1  
3        27.47           1  
4         0.00           0

将上述过程整合成的Python中可执行的脚本如下:

>>> s = orca.default_session()
>>> DATA_DIR = "/dolphindb/database" # e.g. data_dir
>>> create_dfs_database = """
dbPath="dfs://demoDB"
login('admin', '123456')
if(existsDatabase(dbPath))
   dropDatabase(dbPath)
db=database(dbPath, RANGE, datehour(2017.01.01 00:00:00+(0..24)*3600))
"""
>>> s.run(create_dfs_database)
>>> df = orca.read_csv(path=DATA_DIR+"/quotes.csv", dtype={"Exchange": "SYMBOL", "SYMBOL": "SYMBOL"}, db_handle="dfs://demoDB", table_name="quotes", partition_columns="time")

2.2 read_table函数

Orca提供read_table函数,通过该函数指定DolphinDB数据库和表名来加载DolphinDB数据表的数据,可以用于加载DolphinDB的磁盘表、磁盘分区表和分布式表。若您已在DolphinDB中创建了数据库和表,则可以直接在Orca中调用该函数加载存放在DolphinDB服务端中的数据,read_table函数支持的参数如下:

  • database:数据库名称
  • table:表名
  • partition:需要导入的分区,可选参数

加载DolphinDB的磁盘表

read_table函数可以用于加载DolphinDB的磁盘表,首先在DolphinDB服务端创建一个本地磁盘表:

>>> s = orca.default_session()
>>> YOUR_DIR = "/dolphindb/database" # e.g. database_dir
>>> create_onDisk_database="""
saveTable("{YOUR_DIR}"+"/demoOnDiskDB", table(2017.01.01..2017.01.10 as date, rand(10.0,10) as prices), "quotes")
""".format(YOUR_DIR=YOUR_DIR)
>>> s.run(create_onDisk_database)

通过read_table函数加载磁盘表:

>>> df = orca.read_table(YOUR_DIR + "/demoOnDiskDB", "quotes")
>>> df.head()
# output
      date    prices
0 2017-01-01  8.065677
1 2017-01-02  2.969041
2 2017-01-03  3.688191
3 2017-01-04  4.773723
4 2017-01-05  5.567130
请注意: read_table函数要求所要导入的数据库和表在DolphinDB服务器上已经存在,若只存在数据库和没有创建表,则不能将数据成功导入到Python中。

加载DolphinDB的磁盘分区表

对于已经在DolphinDB上创建的数据表,可以通过read_table函数直接加载。例如,加载2.1.2小节中创建的磁盘分区表:

>>> df = orca.read_table(YOUR_DIR + "/demoOnDiskPartitionedDB", "quotes")

加载DolphinDB的分布式表

分布式表同样可以通过read_table函数加载。例如,加载2.1.3小节中创建的分布式表:

>>> df = orca.read_table("dfs://demoDB", "quotes")

2.3 from_pandas函数

Orca提供from_pandas函数,该函数接受一个pandas的DataFrame作为参数,返回一个Orca的DataFrame,通过这个方式,Orca可以直接加载原先存放在pandas的DataFrame中的数据。

>>> import pandas as pd
>>> import numpy as np

>>> pdf = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),  columns=['a', 'b', 'c'])
>>> odf = orca.from_pandas(pdf)

3 对其它格式文件的支持

对于其它数据格式的导入,Orca也提供了与pandas类似的接口。这些方法包括:read_pickleread_fwfread_msgpackread_clipboardread_excelread_json, json_normalize,build_table_schemaread_htmlread_hdfread_featherread_parquetread_sasread_sql_tableread_sql_queryread_sqlread_gbqread_stata


Guess you like

Origin blog.51cto.com/15022783/2638603