This article describes how to load data in Orca.
1 Establish a database connection
connect
Connect to the DolphinDB server through the function in Orca :
>>> import dolphindb.orca as orca >>> orca.connect(MY_HOST, MY_PORT, MY_USERNAME, MY_PASSWORD)
2 Import data
The following tutorial uses a data file: quotes.csv .
2.1 read_csv function
Orca provides read_csv
functions for importing data sets. It should be noted that read_csv
the value of the engine parameter of the Orca function can be {'c','python','dolphindb'}, and the default value of this parameter is'dolphindb'. When the value is'dolphindb', the read_csv
function will look for the data file to be imported in the DolphinDB server directory. When the value is'python' or'c', the read_csv
function will look for the data file to be imported in the directory of the python client.
Note that, when the engine parameter is set to 'python' or 'c', Orca theread_csv
function call corresponds to the pandasread_csv
function introduced. This section is based on the premise Orca engine parameter value is 'dolphindb' theread_csv
function explanation.
When the engine parameter is set to'dolphindb', read_csv
the parameters currently supported by Orca's function are as follows:
- path: file path
- sep: separator
- delimiter: delimiter
- names: Specify column names
- index_col: Specify the column as the index
- engine: the engine to import
- usecols: Specify the columns to be imported
- squeeze: When the data file has only one row or one column, whether to compress the DataFrame into Series
- prefix: the prefix string added to each column
- dtype: specify the data type to import
- partitioned: Whether to allow data to be imported in a partitioned manner
- db_handle: the database path to be imported
- table_name: the name of the table to be imported
- partition_columns: column names for partitioning
The following is a detailed introduction to several parameters that differ between Orca and pandas.
- dtype parameter
Orca automatically recognizes the data type of the file to be imported when importing csv, and supports various common time formats. The user can also force the specified data type through the dtype parameter.
It should be noted that Orca's read_csv
functions not only support specifying various numpy data types (np.bool, np.int8, np.float32, etc.), but also support specifying all data types provided by DolphinDB in the form of strings , Including all time types and string types.
E.g:
dfcsv = orca.read_csv("DATA_DIR/quotes.csv", dtype={"TIME": "NANOTIMESTAMP", "Exchange": "SYMBOL", "SYMBOL": "SYMBOL", "Bid_Price": np.float64, "Bid_Size": np.int32, "Offer_Price": np.float64, "Offer_Size": np.int32})
- partitioned parameter
The bool type, the default is True. When this parameter is True, when the data size reaches a certain level, the data will be imported as a partitioned memory table. If it is set to False, csv will be directly imported as an unpartitioned DolphinDB ordinary memory table.
Please note: Compared with Orca's memory table, Orca's partition table has many differences in operation. For details, see the special differences of Orca's partition table . If your data volume is not very large, and the consistency between Orca and pandas is higher when using Orca, please try not to import the data in a partitioned manner. If you have a large amount of data and have extremely high performance requirements, it is recommended that you import the data by partitioning.
- db_handle, table_name and partition_columns parameters
Orca read_csv
also supports the three parameters db_handle, table_name and partition_columns. These parameters are used to import data into DolphinDB's partition table by specifying DolphinDB's database and table and other related information when importing data.
DolphinDB supports importing data into the DolphinDB database in a variety of ways . Orca read_csv
specifies the db_handle, table_name and partition_columns parameters when calling the function, essentially calling the loadTextEx function of DolphinDB . In this way, we can directly import the data directly into the partition table of DolphinDB. .
2.1.1 Import to memory table
- Import as a memory partition table
Call the read_csv
function directly , and the data will be imported in parallel. Due to the parallel import, the import speed is fast, but the memory usage is twice that of a normal table. In the following example,'DATA_DIR' is the path where the data file is stored.
>>> DATA_DIR = "dolphindb/database" # e.g. data_dir >>> df = orca.read_csv(DATA_DIR + "/quotes.csv") >>> df.head() # output time Exchange Symbol Bid_Price Bid_Size \ 0 2017-01-01 04:40:11.686699 T AAPL 0.00 0 1 2017-01-01 06:42:50.247631 P AAPL 26.70 10 2 2017-01-01 07:00:12.194786 P AAPL 26.75 5 3 2017-01-01 07:15:03.578071 P AAPL 26.70 10 4 2017-01-01 07:59:39.606882 K AAPL 26.90 1 Offer_Price Offer_Size 0 27.42 1 1 27.47 1 2 27.47 1 3 27.47 1 4 0.00 0
- 导入为普通内存表
partitioned参数取值为False,导入为普通内存表。导入对内存要求低,但是计算速度略低于上面的导入方式:
df = orca.read_csv(DATA_DIR + "/quotes.csv", partitioned=False)
2.1.2 导入到磁盘表
DolphinDB的分区表可以保存在本地磁盘,也可以保存在dfs上,磁盘分区表与分布式表的区别就在于分布式表的数据库路径以"dfs://"开头,而磁盘分区表的数据库路径是本地路径。
示例
我们在DolphinDB服务端创建一个磁盘分区表,下面的脚本中,'YOUR_DIR'为保存磁盘数据库的路径:
dbPath=YOUR_DIR + "/demoOnDiskPartitionedDB" login('admin', '123456') if(existsDatabase(dbPath)) dropDatabase(dbPath) db=database(dbPath, RANGE, datehour(2017.01.01 00:00:00+(0..24)*3600))
请注意:以上两段脚本需要在DolphinDB服务端执行,在Python客户端中则可以通过DolphinDB Python API执行脚本。
在Python客户端中调用Orca的read_csv
函数,指定数据库db_handle为磁盘分区数据库YOUR_DIR + "/demoOnDiskPartitionedDB",指定表名table_name为"quotes"和进行分区的列partition_columns为"time",将数据导入到DolphinDB的磁盘分区表,并返回一个表示DolphinDB数据表的对象给df,用于后续计算。
>>> df = orca.read_csv(path=DATA_DIR+"/quotes.csv", dtype={"Exchange": "SYMBOL", "SYMBOL": "SYMBOL"}, db_handle=YOUR_DIR + "/demoOnDiskPartitionedDB", table_name="quotes", partition_columns="time") >>> df # output <'dolphindb.orca.core.frame.DataFrame' object representing a column in a DolphinDB segmented table> >>> df.head() # output time Exchange Symbol Bid_Price Bid_Size \ 0 2017-01-01 04:40:11.686699 T AAPL 0.00 0 1 2017-01-01 06:42:50.247631 P AAPL 26.70 10 2 2017-01-01 07:00:12.194786 P AAPL 26.75 5 3 2017-01-01 07:15:03.578071 P AAPL 26.70 10 4 2017-01-01 07:59:39.606882 K AAPL 26.90 1 Offer_Price Offer_Size 0 27.42 1 1 27.47 1 2 27.47 1 3 27.47 1 4 0.00 0
将上述过程整合成的Python中可执行的脚本如下:
>>> s = orca.default_session() >>> DATA_DIR = "/dolphindb/database" # e.g. data_dir >>> YOUR_DIR = "/dolphindb/database" # e.g. database_dir >>> create_onDiskPartitioned_database = """ dbPath="{YOUR_DIR}" + "/demoOnDiskPartitionedDB" login('admin', '123456') if(existsDatabase(dbPath)) dropDatabase(dbPath) db=database(dbPath, RANGE, datehour(2017.01.01 00:00:00+(0..24)*3600)) """.format(YOUR_DIR=YOUR_DIR) >>> s.run(create_onDiskPartitioned_database) >>> df = orca.read_csv(path=DATA_DIR+"/quotes.csv", dtype={"Exchange": "SYMBOL", "SYMBOL": "SYMBOL"}, db_handle=YOUR_DIR + "/demoOnDiskPartitionedDB", table_name="quotes", partition_columns="time")
上述脚本中,我们使用的defalut_session
实际上就是通过orca.connect
函数创建的会话,在Python端,我们可以通过这个会话与DolphinDB服务端进行交互。关于更多功能,请参见DolphinDB Python API。
请注意:在通过read_csv
函数指定数据库导入数据之前,需要确保在DolphinDB服务器上已经创建了对应的数据库。read_csv
函数根据指定的数据库,表名和分区字段导入数据到DolphinDB数据库中,若表存在则追加数据,若表不存在则创建表并且导入数据。
2.1.3 导入到分布式表
read_csv
函数若指定db_handle参数为dfs数据库路径,则数据将直接导入到DolphinDB的dfs数据库中。
示例
请注意只有启用enableDFS=1的集群环境或者DolphinDB单例模式才能使用分布式表。
与磁盘分区表类似,首先需要在DolphinDB服务器上创建分布式表,只需要将数据库路径改为"dfs://"开头的字符串即可。
dbPath="dfs://demoDB" login('admin', '123456') if(existsDatabase(dbPath)) dropDatabase(dbPath) db=database(dbPath, RANGE, datehour(2017.01.01 00:00:00+(0..24)*3600))
在Python客户端中调用Orca的read_csv
函数,指定数据库db_handle为分布式数据库"dfs://demoDB",指定表名table_name为"quotes"和进行分区的列partition_columns为"time",将数据导入到DolphinDB的分布式表。
>>> df = orca.read_csv(path=DATA_DIR+"/quotes.csv", dtype={"Exchange": "SYMBOL", "SYMBOL": "SYMBOL"}, db_handle="dfs://demoDB", table_name="quotes", partition_columns="time") >>> df # output <'dolphindb.orca.core.frame.DataFrame' object representing a column in a DolphinDB segmented table> >>> df.head() # output time Exchange Symbol Bid_Price Bid_Size \ 0 2017-01-01 04:40:11.686699 T AAPL 0.00 0 1 2017-01-01 06:42:50.247631 P AAPL 26.70 10 2 2017-01-01 07:00:12.194786 P AAPL 26.75 5 3 2017-01-01 07:15:03.578071 P AAPL 26.70 10 4 2017-01-01 07:59:39.606882 K AAPL 26.90 1 Offer_Price Offer_Size 0 27.42 1 1 27.47 1 2 27.47 1 3 27.47 1 4 0.00 0
将上述过程整合成的Python中可执行的脚本如下:
>>> s = orca.default_session() >>> DATA_DIR = "/dolphindb/database" # e.g. data_dir >>> create_dfs_database = """ dbPath="dfs://demoDB" login('admin', '123456') if(existsDatabase(dbPath)) dropDatabase(dbPath) db=database(dbPath, RANGE, datehour(2017.01.01 00:00:00+(0..24)*3600)) """ >>> s.run(create_dfs_database) >>> df = orca.read_csv(path=DATA_DIR+"/quotes.csv", dtype={"Exchange": "SYMBOL", "SYMBOL": "SYMBOL"}, db_handle="dfs://demoDB", table_name="quotes", partition_columns="time")
2.2 read_table函数
Orca提供read_table
函数,通过该函数指定DolphinDB数据库和表名来加载DolphinDB数据表的数据,可以用于加载DolphinDB的磁盘表、磁盘分区表和分布式表。若您已在DolphinDB中创建了数据库和表,则可以直接在Orca中调用该函数加载存放在DolphinDB服务端中的数据,read_table
函数支持的参数如下:
- database:数据库名称
- table:表名
- partition:需要导入的分区,可选参数
加载DolphinDB的磁盘表
read_table
函数可以用于加载DolphinDB的磁盘表,首先在DolphinDB服务端创建一个本地磁盘表:
>>> s = orca.default_session() >>> YOUR_DIR = "/dolphindb/database" # e.g. database_dir >>> create_onDisk_database=""" saveTable("{YOUR_DIR}"+"/demoOnDiskDB", table(2017.01.01..2017.01.10 as date, rand(10.0,10) as prices), "quotes") """.format(YOUR_DIR=YOUR_DIR) >>> s.run(create_onDisk_database)
通过read_table
函数加载磁盘表:
>>> df = orca.read_table(YOUR_DIR + "/demoOnDiskDB", "quotes") >>> df.head() # output date prices 0 2017-01-01 8.065677 1 2017-01-02 2.969041 2 2017-01-03 3.688191 3 2017-01-04 4.773723 4 2017-01-05 5.567130
请注意: read_table函数要求所要导入的数据库和表在DolphinDB服务器上已经存在,若只存在数据库和没有创建表,则不能将数据成功导入到Python中。
加载DolphinDB的磁盘分区表
对于已经在DolphinDB上创建的数据表,可以通过read_table函数直接加载。例如,加载2.1.2小节中创建的磁盘分区表:
>>> df = orca.read_table(YOUR_DIR + "/demoOnDiskPartitionedDB", "quotes")
加载DolphinDB的分布式表
分布式表同样可以通过read_table函数加载。例如,加载2.1.3小节中创建的分布式表:
>>> df = orca.read_table("dfs://demoDB", "quotes")
2.3 from_pandas函数
Orca提供from_pandas
函数,该函数接受一个pandas的DataFrame作为参数,返回一个Orca的DataFrame,通过这个方式,Orca可以直接加载原先存放在pandas的DataFrame中的数据。
>>> import pandas as pd >>> import numpy as np >>> pdf = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), columns=['a', 'b', 'c']) >>> odf = orca.from_pandas(pdf)
3 对其它格式文件的支持
对于其它数据格式的导入,Orca也提供了与pandas类似的接口。这些方法包括:read_pickle
, read_fwf
, read_msgpack
, read_clipboard
, read_excel
, read_json
, json_normalize
,build_table_schema
, read_html
, read_hdf
, read_feather
, read_parquet
, read_sas
, read_sql_table
, read_sql_query
, read_sql
, read_gbq
, read_stata
。