Dry goods丨Orca data loading tutorial

This article describes how to load data in Orca.

1 Establish a database connection

connectConnect to the DolphinDB server through the function in Orca :

>>> import dolphindb.orca as orca
>>> orca.connect(MY_HOST, MY_PORT, MY_USERNAME, MY_PASSWORD)

2 Import data

The following tutorial uses a data file: quotes.csv .

2.1 read_csv function

Orca provides read_csvfunctions for importing data sets. It should be noted that read_csvthe value of the engine parameter of the Orca function can be {'c','python','dolphindb'}, and the default value of this parameter is'dolphindb'. When the value is'dolphindb', the read_csvfunction will look for the data file to be imported in the DolphinDB server directory. When the value is'python' or'c', the read_csvfunction will look for the data file to be imported in the directory of the python client.

Note that, when the engine parameter is set to 'python' or 'c', Orca the read_csv function call corresponds to the pandas read_csv function introduced. This section is based on the premise Orca engine parameter value is 'dolphindb' the read_csv function explanation.

When the engine parameter is set to'dolphindb', read_csvthe parameters currently supported by the Orca function are as follows:

  • path: file path
  • sep: separator
  • delimiter: delimiter
  • names: Specify column names
  • index_col: Specify the column as the index
  • engine: the engine to import
  • usecols: Specify the columns to be imported
  • squeeze: When the data file has only one row or one column, whether to compress the DataFrame into Series
  • prefix: the prefix string added to each column
  • dtype: specify the data type to import
  • partitioned: Whether to allow data to be imported in a partitioned manner
  • db_handle: database path to be imported
  • table_name: the name of the table to be imported
  • partition_columns: column names for partitioning

The following describes in detail several parameters that differ between Orca and pandas.

  • dtype parameter

Orca automatically recognizes the data type of the file to be imported when importing csv, and supports various common time formats. The user can also force the specified data type through the dtype parameter.

It should be noted that Orca read_csvfunctions not only support specifying various numpy data types (np.bool, np.int8, np.float32, etc.), but also support specifying all data types provided by DolphinDB in the form of strings , Including all time types and string types.

E.g:

dfcsv = orca.read_csv("DATA_DIR/quotes.csv", dtype={"TIME": "NANOTIMESTAMP", "Exchange": "SYMBOL", "SYMBOL": "SYMBOL", "Bid_Price": np.float64, "Bid_Size": np.int32, "Offer_Price": np.float64, "Offer_Size": np.int32})
  • partitioned parameters

The bool type, the default is True. When this parameter is True, when the data size reaches a certain level, the data will be imported as a partitioned memory table. If it is set to False, csv will be directly imported as an unpartitioned DolphinDB ordinary memory table.

Please note: Compared with Orca's memory table, Orca's partition table has many differences in operation. For details, see the special differences of Orca's partition table . If your data volume is not very large, and the consistency between Orca and pandas is higher when using Orca, please try not to import the data in a partitioned manner. If you have a huge amount of data and have extremely high performance requirements, it is recommended that you import the data by partition.
  • db_handle, table_name and partition_columns parameters

Orca read_csvalso supports the three parameters db_handle, table_name and partition_columns. These parameters are used to import data into the partition table of DolphinDB by specifying the DolphinDB database and table and other related information when importing data.

DolphinDB supports importing data into the DolphinDB database in a variety of ways . Orca read_csvspecifies the db_handle, table_name and partition_columns parameters when calling the function, essentially calling the loadTextEx function of DolphinDB . In this way, we can directly import the data into the partition table of DolphinDB. .

2.1.1 Import to memory table

  • Import as a memory partition table

Call the read_csvfunction directly , and the data will be imported in parallel. Due to the parallel import, the import speed is fast, but the memory usage is twice that of a normal table. In the following example,'DATA_DIR' is the path where the data file is stored.

>>> DATA_DIR = "dolphindb/database" # e.g. data_dir
>>> df = orca.read_csv(DATA_DIR + "/quotes.csv")
>>> df.head()
# output

                        time Exchange Symbol  Bid_Price  Bid_Size  \
0 2017-01-01 04:40:11.686699        T   AAPL       0.00         0   
1 2017-01-01 06:42:50.247631        P   AAPL      26.70        10   
2 2017-01-01 07:00:12.194786        P   AAPL      26.75         5   
3 2017-01-01 07:15:03.578071        P   AAPL      26.70        10   
4 2017-01-01 07:59:39.606882        K   AAPL      26.90         1   
   Offer_Price  Offer_Size  
0        27.42           1  
1        27.47           1  
2        27.47           1  
3        27.47           1  
4         0.00           0  
  • Import as ordinary memory table

The value of the partitioned parameter is False and imported as a normal memory table. Import has low memory requirements, but the calculation speed is slightly lower than the above import method:

df = orca.read_csv(DATA_DIR + "/quotes.csv", partitioned=False)

2.1.2 Import to Disk Table

DolphinDB's partition table can be stored on the local disk or on dfs. The difference between a disk partition table and a distributed table is that the database path of the distributed table starts with "dfs://", while the database path of the disk partition table Is the local path.

Example

We create a disk partition table on the DolphinDB server. In the script below,'YOUR_DIR' is the path to save the disk database:

dbPath=YOUR_DIR + "/demoOnDiskPartitionedDB"
login('admin', '123456')
if(existsDatabase(dbPath))
   dropDatabase(dbPath)
db=database(dbPath, RANGE, datehour(2017.01.01 00:00:00+(0..24)*3600))
Please note: the above two scripts need to be executed on the DolphinDB server, and the script can be executed through the DolphinDB Python API in the Python client.

Call the read_csvfunction of Orca in the Python client , specify the database db_handle as the disk partition database YOUR_DIR + "/demoOnDiskPartitionedDB", specify the table name table_name as "quotes" and the partition column partition_columns as "time", import the data to the disk of DolphinDB Partition the table and return an object representing the DolphinDB data table to df for subsequent calculations.

>>> df = orca.read_csv(path=DATA_DIR+"/quotes.csv", dtype={"Exchange": "SYMBOL", "SYMBOL": "SYMBOL"}, db_handle=YOUR_DIR + "/demoOnDiskPartitionedDB", table_name="quotes", partition_columns="time")
>>> df
# output
<'dolphindb.orca.core.frame.DataFrame' object representing a column in a DolphinDB segmented table>
>>> df.head()
# output
                        time Exchange Symbol  Bid_Price  Bid_Size  \
0 2017-01-01 04:40:11.686699        T   AAPL       0.00         0   
1 2017-01-01 06:42:50.247631        P   AAPL      26.70        10   
2 2017-01-01 07:00:12.194786        P   AAPL      26.75         5   
3 2017-01-01 07:15:03.578071        P   AAPL      26.70        10   
4 2017-01-01 07:59:39.606882        K   AAPL      26.90         1   
   Offer_Price  Offer_Size  
0        27.42           1  
1        27.47           1  
2        27.47           1  
3        27.47           1  
4         0.00           0  

The executable script in Python that integrates the above process is as follows:

>>> s = orca.default_session()
>>> DATA_DIR = "/dolphindb/database" # e.g. data_dir
>>> YOUR_DIR = "/dolphindb/database" # e.g. database_dir
>>> create_onDiskPartitioned_database = """
dbPath="{YOUR_DIR}" + "/demoOnDiskPartitionedDB"
login('admin', '123456')
if(existsDatabase(dbPath))
   dropDatabase(dbPath)
db=database(dbPath, RANGE, datehour(2017.01.01 00:00:00+(0..24)*3600))
""".format(YOUR_DIR=YOUR_DIR)
>>> s.run(create_onDiskPartitioned_database)
>>> df = orca.read_csv(path=DATA_DIR+"/quotes.csv", dtype={"Exchange": "SYMBOL", "SYMBOL": "SYMBOL"}, db_handle=YOUR_DIR + "/demoOnDiskPartitionedDB", table_name="quotes", partition_columns="time")

In the above script, we defalut_sessionactually use the orca.connectsession created by the function. On the Python side, we can interact with the DolphinDB server through this session. For more functions, see DolphinDB Python API .

Note: By read_csv prior to the specified data into the database function, you need to make sure that the server has been created on DolphinDB corresponding database. read_csv The function imports data into the DolphinDB database according to the specified database, table name and partition field. If the table exists, the data is appended, if the table does not exist, the table is created and the data is imported.

2.1.3 Import to distributed table

read_csvIf the function specifies the db_handle parameter as the dfs database path, the data will be directly imported into the dfs database of DolphinDB.

Example

Please note that only the cluster environment with enableDFS=1 or the DolphinDB singleton mode can use distributed tables.

Similar to the disk partition table, you first need to create a distributed table on the DolphinDB server. You only need to change the database path to a string starting with "dfs://".

dbPath="dfs://demoDB"
login('admin', '123456')
if(existsDatabase(dbPath))
   dropDatabase(dbPath)
db=database(dbPath, RANGE, datehour(2017.01.01 00:00:00+(0..24)*3600))

Call the read_csvfunction of Orca in the Python client , specify the database db_handle as the distributed database "dfs://demoDB", specify the table name table_name as "quotes" and the partition column partition_columns as "time", and import the data into DolphinDB Distributed table.

>>> df = orca.read_csv(path=DATA_DIR+"/quotes.csv", dtype={"Exchange": "SYMBOL", "SYMBOL": "SYMBOL"}, db_handle="dfs://demoDB", table_name="quotes", partition_columns="time")
>>> df
# output
<'dolphindb.orca.core.frame.DataFrame' object representing a column in a DolphinDB segmented table>
>>> df.head()
# output
                        time Exchange Symbol  Bid_Price  Bid_Size  \
0 2017-01-01 04:40:11.686699        T   AAPL       0.00         0   
1 2017-01-01 06:42:50.247631        P   AAPL      26.70        10   
2 2017-01-01 07:00:12.194786        P   AAPL      26.75         5   
3 2017-01-01 07:15:03.578071        P   AAPL      26.70        10   
4 2017-01-01 07:59:39.606882        K   AAPL      26.90         1   
   Offer_Price  Offer_Size  
0        27.42           1  
1        27.47           1  
2        27.47           1  
3        27.47           1  
4         0.00           0  

The executable script in Python that integrates the above process is as follows:

>>> s = orca.default_session()
>>> DATA_DIR = "/dolphindb/database" # e.g. data_dir
>>> create_dfs_database = """
dbPath="dfs://demoDB"
login('admin', '123456')
if(existsDatabase(dbPath))
   dropDatabase(dbPath)
db=database(dbPath, RANGE, datehour(2017.01.01 00:00:00+(0..24)*3600))
"""
>>> s.run(create_dfs_database)
>>> df = orca.read_csv(path=DATA_DIR+"/quotes.csv", dtype={"Exchange": "SYMBOL", "SYMBOL": "SYMBOL"}, db_handle="dfs://demoDB", table_name="quotes", partition_columns="time")

2.2 read_table function

Orca provides a read_tablefunction, through which the DolphinDB database and table name are specified to load the data of the DolphinDB data table, which can be used to load the disk table, disk partition table and distributed table of DolphinDB. If you have created databases and tables in DolphinDB, you can directly call this function in Orca to load the data stored in the DolphinDB server. read_tableThe parameters supported by the function are as follows:

  • database: database name
  • table: table name
  • partition: the partition to be imported, optional parameters

Load DolphinDB's disk table

read_tableThe function can be used to load the disk table of DolphinDB. First, create a local disk table on the DolphinDB server:

>>> s = orca.default_session()
>>> YOUR_DIR = "/dolphindb/database" # e.g. database_dir
>>> create_onDisk_database="""
saveTable("{YOUR_DIR}"+"/demoOnDiskDB", table(2017.01.01..2017.01.10 as date, rand(10.0,10) as prices), "quotes")
""".format(YOUR_DIR=YOUR_DIR)
>>> s.run(create_onDisk_database)

read_tableLoad the disk table through the function:

>>> df = orca.read_table(YOUR_DIR + "/demoOnDiskDB", "quotes")
>>> df.head()
# output
      date    prices
0 2017-01-01  8.065677
1 2017-01-02  2.969041
2 2017-01-03  3.688191
3 2017-01-04  4.773723
4 2017-01-05  5.567130
Please note: The read_table function requires that the database and table to be imported already exist on the DolphinDB server. If there is only a database and no table is created, the data cannot be successfully imported into Python.

Load DolphinDB disk partition table

For data tables that have been created on DolphinDB, they can be loaded directly through the read_table function. For example, load the disk partition table created in section 2.1.2:

>>> df = orca.read_table(YOUR_DIR + "/demoOnDiskPartitionedDB", "quotes")

Load DolphinDB distributed table

Distributed tables can also be loaded through the read_table function. For example, load the distributed table created in section 2.1.3:

>>> df = orca.read_table("dfs://demoDB", "quotes")

2.3 from_pandas function

Orca provides a from_pandasfunction that accepts a pandas DataFrame as a parameter and returns an Orca DataFrame. In this way, Orca can directly load the data originally stored in the pandas DataFrame.

>>> import pandas as pd
>>> import numpy as np

>>> pdf = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),  columns=['a', 'b', 'c'])
>>> odf = orca.from_pandas(pdf)

3 Support for other format files

For the import of other data formats, Orca also provides an interface similar to pandas. These methods include: read_pickleread_fwfread_msgpackread_clipboardread_excelread_json,  json_normalize, build_table_schemaread_htmlread_hdfread_featherread_parquetread_sasread_sql_tableread_sql_queryread_sqlread_gbqread_stata.

Guess you like

Origin blog.csdn.net/qq_41996852/article/details/111831053