This article describes how to load data in Orca.
1 Establish a database connection
connect
Connect to the DolphinDB server through the function in Orca :
>>> import dolphindb.orca as orca
>>> orca.connect(MY_HOST, MY_PORT, MY_USERNAME, MY_PASSWORD)
2 Import data
The following tutorial uses a data file: quotes.csv .
2.1 read_csv function
Orca provides read_csv
functions for importing data sets. It should be noted that read_csv
the value of the engine parameter of the Orca function can be {'c','python','dolphindb'}, and the default value of this parameter is'dolphindb'. When the value is'dolphindb', the read_csv
function will look for the data file to be imported in the DolphinDB server directory. When the value is'python' or'c', the read_csv
function will look for the data file to be imported in the directory of the python client.
Note that, when the engine parameter is set to 'python' or 'c', Orca theread_csv
function call corresponds to the pandasread_csv
function introduced. This section is based on the premise Orca engine parameter value is 'dolphindb' theread_csv
function explanation.
When the engine parameter is set to'dolphindb', read_csv
the parameters currently supported by the Orca function are as follows:
- path: file path
- sep: separator
- delimiter: delimiter
- names: Specify column names
- index_col: Specify the column as the index
- engine: the engine to import
- usecols: Specify the columns to be imported
- squeeze: When the data file has only one row or one column, whether to compress the DataFrame into Series
- prefix: the prefix string added to each column
- dtype: specify the data type to import
- partitioned: Whether to allow data to be imported in a partitioned manner
- db_handle: database path to be imported
- table_name: the name of the table to be imported
- partition_columns: column names for partitioning
The following describes in detail several parameters that differ between Orca and pandas.
- dtype parameter
Orca automatically recognizes the data type of the file to be imported when importing csv, and supports various common time formats. The user can also force the specified data type through the dtype parameter.
It should be noted that Orca read_csv
functions not only support specifying various numpy data types (np.bool, np.int8, np.float32, etc.), but also support specifying all data types provided by DolphinDB in the form of strings , Including all time types and string types.
E.g:
dfcsv = orca.read_csv("DATA_DIR/quotes.csv", dtype={"TIME": "NANOTIMESTAMP", "Exchange": "SYMBOL", "SYMBOL": "SYMBOL", "Bid_Price": np.float64, "Bid_Size": np.int32, "Offer_Price": np.float64, "Offer_Size": np.int32})
- partitioned parameters
The bool type, the default is True. When this parameter is True, when the data size reaches a certain level, the data will be imported as a partitioned memory table. If it is set to False, csv will be directly imported as an unpartitioned DolphinDB ordinary memory table.
Please note: Compared with Orca's memory table, Orca's partition table has many differences in operation. For details, see the special differences of Orca's partition table . If your data volume is not very large, and the consistency between Orca and pandas is higher when using Orca, please try not to import the data in a partitioned manner. If you have a huge amount of data and have extremely high performance requirements, it is recommended that you import the data by partition.
- db_handle, table_name and partition_columns parameters
Orca read_csv
also supports the three parameters db_handle, table_name and partition_columns. These parameters are used to import data into the partition table of DolphinDB by specifying the DolphinDB database and table and other related information when importing data.
DolphinDB supports importing data into the DolphinDB database in a variety of ways . Orca read_csv
specifies the db_handle, table_name and partition_columns parameters when calling the function, essentially calling the loadTextEx function of DolphinDB . In this way, we can directly import the data into the partition table of DolphinDB. .
2.1.1 Import to memory table
- Import as a memory partition table
Call the read_csv
function directly , and the data will be imported in parallel. Due to the parallel import, the import speed is fast, but the memory usage is twice that of a normal table. In the following example,'DATA_DIR' is the path where the data file is stored.
>>> DATA_DIR = "dolphindb/database" # e.g. data_dir
>>> df = orca.read_csv(DATA_DIR + "/quotes.csv")
>>> df.head()
# output
time Exchange Symbol Bid_Price Bid_Size \
0 2017-01-01 04:40:11.686699 T AAPL 0.00 0
1 2017-01-01 06:42:50.247631 P AAPL 26.70 10
2 2017-01-01 07:00:12.194786 P AAPL 26.75 5
3 2017-01-01 07:15:03.578071 P AAPL 26.70 10
4 2017-01-01 07:59:39.606882 K AAPL 26.90 1
Offer_Price Offer_Size
0 27.42 1
1 27.47 1
2 27.47 1
3 27.47 1
4 0.00 0
- Import as ordinary memory table
The value of the partitioned parameter is False and imported as a normal memory table. Import has low memory requirements, but the calculation speed is slightly lower than the above import method:
df = orca.read_csv(DATA_DIR + "/quotes.csv", partitioned=False)
2.1.2 Import to Disk Table
DolphinDB's partition table can be stored on the local disk or on dfs. The difference between a disk partition table and a distributed table is that the database path of the distributed table starts with "dfs://", while the database path of the disk partition table Is the local path.
Example
We create a disk partition table on the DolphinDB server. In the script below,'YOUR_DIR' is the path to save the disk database:
dbPath=YOUR_DIR + "/demoOnDiskPartitionedDB"
login('admin', '123456')
if(existsDatabase(dbPath))
dropDatabase(dbPath)
db=database(dbPath, RANGE, datehour(2017.01.01 00:00:00+(0..24)*3600))
Please note: the above two scripts need to be executed on the DolphinDB server, and the script can be executed through the DolphinDB Python API in the Python client.
Call the read_csv
function of Orca in the Python client , specify the database db_handle as the disk partition database YOUR_DIR + "/demoOnDiskPartitionedDB", specify the table name table_name as "quotes" and the partition column partition_columns as "time", import the data to the disk of DolphinDB Partition the table and return an object representing the DolphinDB data table to df for subsequent calculations.
>>> df = orca.read_csv(path=DATA_DIR+"/quotes.csv", dtype={"Exchange": "SYMBOL", "SYMBOL": "SYMBOL"}, db_handle=YOUR_DIR + "/demoOnDiskPartitionedDB", table_name="quotes", partition_columns="time")
>>> df
# output
<'dolphindb.orca.core.frame.DataFrame' object representing a column in a DolphinDB segmented table>
>>> df.head()
# output
time Exchange Symbol Bid_Price Bid_Size \
0 2017-01-01 04:40:11.686699 T AAPL 0.00 0
1 2017-01-01 06:42:50.247631 P AAPL 26.70 10
2 2017-01-01 07:00:12.194786 P AAPL 26.75 5
3 2017-01-01 07:15:03.578071 P AAPL 26.70 10
4 2017-01-01 07:59:39.606882 K AAPL 26.90 1
Offer_Price Offer_Size
0 27.42 1
1 27.47 1
2 27.47 1
3 27.47 1
4 0.00 0
The executable script in Python that integrates the above process is as follows:
>>> s = orca.default_session()
>>> DATA_DIR = "/dolphindb/database" # e.g. data_dir
>>> YOUR_DIR = "/dolphindb/database" # e.g. database_dir
>>> create_onDiskPartitioned_database = """
dbPath="{YOUR_DIR}" + "/demoOnDiskPartitionedDB"
login('admin', '123456')
if(existsDatabase(dbPath))
dropDatabase(dbPath)
db=database(dbPath, RANGE, datehour(2017.01.01 00:00:00+(0..24)*3600))
""".format(YOUR_DIR=YOUR_DIR)
>>> s.run(create_onDiskPartitioned_database)
>>> df = orca.read_csv(path=DATA_DIR+"/quotes.csv", dtype={"Exchange": "SYMBOL", "SYMBOL": "SYMBOL"}, db_handle=YOUR_DIR + "/demoOnDiskPartitionedDB", table_name="quotes", partition_columns="time")
In the above script, we defalut_session
actually use the orca.connect
session created by the function. On the Python side, we can interact with the DolphinDB server through this session. For more functions, see DolphinDB Python API .
Note: Byread_csv
prior to the specified data into the database function, you need to make sure that the server has been created on DolphinDB corresponding database.read_csv
The function imports data into the DolphinDB database according to the specified database, table name and partition field. If the table exists, the data is appended, if the table does not exist, the table is created and the data is imported.
2.1.3 Import to distributed table
read_csv
If the function specifies the db_handle parameter as the dfs database path, the data will be directly imported into the dfs database of DolphinDB.
Example
Please note that only the cluster environment with enableDFS=1 or the DolphinDB singleton mode can use distributed tables.
Similar to the disk partition table, you first need to create a distributed table on the DolphinDB server. You only need to change the database path to a string starting with "dfs://".
dbPath="dfs://demoDB"
login('admin', '123456')
if(existsDatabase(dbPath))
dropDatabase(dbPath)
db=database(dbPath, RANGE, datehour(2017.01.01 00:00:00+(0..24)*3600))
Call the read_csv
function of Orca in the Python client , specify the database db_handle as the distributed database "dfs://demoDB", specify the table name table_name as "quotes" and the partition column partition_columns as "time", and import the data into DolphinDB Distributed table.
>>> df = orca.read_csv(path=DATA_DIR+"/quotes.csv", dtype={"Exchange": "SYMBOL", "SYMBOL": "SYMBOL"}, db_handle="dfs://demoDB", table_name="quotes", partition_columns="time")
>>> df
# output
<'dolphindb.orca.core.frame.DataFrame' object representing a column in a DolphinDB segmented table>
>>> df.head()
# output
time Exchange Symbol Bid_Price Bid_Size \
0 2017-01-01 04:40:11.686699 T AAPL 0.00 0
1 2017-01-01 06:42:50.247631 P AAPL 26.70 10
2 2017-01-01 07:00:12.194786 P AAPL 26.75 5
3 2017-01-01 07:15:03.578071 P AAPL 26.70 10
4 2017-01-01 07:59:39.606882 K AAPL 26.90 1
Offer_Price Offer_Size
0 27.42 1
1 27.47 1
2 27.47 1
3 27.47 1
4 0.00 0
The executable script in Python that integrates the above process is as follows:
>>> s = orca.default_session()
>>> DATA_DIR = "/dolphindb/database" # e.g. data_dir
>>> create_dfs_database = """
dbPath="dfs://demoDB"
login('admin', '123456')
if(existsDatabase(dbPath))
dropDatabase(dbPath)
db=database(dbPath, RANGE, datehour(2017.01.01 00:00:00+(0..24)*3600))
"""
>>> s.run(create_dfs_database)
>>> df = orca.read_csv(path=DATA_DIR+"/quotes.csv", dtype={"Exchange": "SYMBOL", "SYMBOL": "SYMBOL"}, db_handle="dfs://demoDB", table_name="quotes", partition_columns="time")
2.2 read_table function
Orca provides a read_table
function, through which the DolphinDB database and table name are specified to load the data of the DolphinDB data table, which can be used to load the disk table, disk partition table and distributed table of DolphinDB. If you have created databases and tables in DolphinDB, you can directly call this function in Orca to load the data stored in the DolphinDB server. read_table
The parameters supported by the function are as follows:
- database: database name
- table: table name
- partition: the partition to be imported, optional parameters
Load DolphinDB's disk table
read_table
The function can be used to load the disk table of DolphinDB. First, create a local disk table on the DolphinDB server:
>>> s = orca.default_session()
>>> YOUR_DIR = "/dolphindb/database" # e.g. database_dir
>>> create_onDisk_database="""
saveTable("{YOUR_DIR}"+"/demoOnDiskDB", table(2017.01.01..2017.01.10 as date, rand(10.0,10) as prices), "quotes")
""".format(YOUR_DIR=YOUR_DIR)
>>> s.run(create_onDisk_database)
read_table
Load the disk table through the function:
>>> df = orca.read_table(YOUR_DIR + "/demoOnDiskDB", "quotes")
>>> df.head()
# output
date prices
0 2017-01-01 8.065677
1 2017-01-02 2.969041
2 2017-01-03 3.688191
3 2017-01-04 4.773723
4 2017-01-05 5.567130
Please note: The read_table function requires that the database and table to be imported already exist on the DolphinDB server. If there is only a database and no table is created, the data cannot be successfully imported into Python.
Load DolphinDB disk partition table
For data tables that have been created on DolphinDB, they can be loaded directly through the read_table function. For example, load the disk partition table created in section 2.1.2:
>>> df = orca.read_table(YOUR_DIR + "/demoOnDiskPartitionedDB", "quotes")
Load DolphinDB distributed table
Distributed tables can also be loaded through the read_table function. For example, load the distributed table created in section 2.1.3:
>>> df = orca.read_table("dfs://demoDB", "quotes")
2.3 from_pandas function
Orca provides a from_pandas
function that accepts a pandas DataFrame as a parameter and returns an Orca DataFrame. In this way, Orca can directly load the data originally stored in the pandas DataFrame.
>>> import pandas as pd
>>> import numpy as np
>>> pdf = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), columns=['a', 'b', 'c'])
>>> odf = orca.from_pandas(pdf)
3 Support for other format files
For the import of other data formats, Orca also provides an interface similar to pandas. These methods include: read_pickle
, read_fwf
, read_msgpack
, read_clipboard
, read_excel
, read_json
, json_normalize
, build_table_schema
, read_html
, read_hdf
, read_feather
, read_parquet
, read_sas
, read_sql_table
, read_sql_query
, read_sql
, read_gbq
, read_stata
.