Dry goods丨DolphinDB data import tutorial for time series database

When companies use big data analysis platforms, they first need to migrate massive amounts of data from multiple data sources to the big data platform.


Before importing data, we need to understand the basic concepts and features of DolphinDB database.

DolphinDB data tables are divided into 3 types according to storage media:

  • Memory table: Data is only stored in the memory of this node, and the access speed is the fastest, but the data will be lost after the node is shut down.
  • Local disk table: The data is saved on the local disk. Even if the node restarts, the data can be easily loaded into the memory through scripts.
  • Distributed table: Data is physically distributed on different nodes. Through the distributed computing engine of DolphinDB, logically, it is still possible to perform unified queries like local tables.

DolphinDB data tables are divided into 2 types according to whether they are partitioned:

  • Ordinary table
  • Partition Table

In traditional databases, partitioning is for data tables, that is, each data table in the same database can have different partitioning schemes; while DolphinDB's partitioning is for databases, that is, a database can only use one partitioning scheme. If the partition schemes of the two tables are different, they cannot be placed in the same database.


DolphinDB provides 3 flexible data import methods:

  • Import via CSV text file
  • Import via HDF5 file
  • Import via ODBC


1. Import via CSV text file

Data transfer via CSV files is a more common way of data migration. DolphinDB provides three functions loadText , ploadText and loadTextEx to import CSV files. Below we use a sample CSV file candle_201801.csv to illustrate the usage of these three functions.

1.1 loadText

Syntax: loadText(filename, [delimiter=','], [schema])

parameter:

filename is the file name.

Both delimiter and schema are optional parameters.

delimiter is used to specify the delimiter of different fields, the default is ",".

The schema is used for the data type of each field after the data is imported. It is a table type. DolphinDB provides the field type automatic recognition function, but in some cases the data type automatically recognized by the system does not meet the requirements. For example, when we import the sample CSV candle_201801.csv , the volume field will be recognized as the INT type. In fact, we need the LONG type. , Then you need to use the schema parameter.

Script to create schema table:

nameCol = `symbol`exchange`cycle`tradingDay`date`time`open`high`low`close`volume`turnover`unixTime
typeCol = [SYMBOL,SYMBOL,INT,DATE,DATE,INT,DOUBLE,DOUBLE,DOUBLE,DOUBLE,INT,DOUBLE,LONG]
schemaTb = table(nameCol as name,typeCol as type)

When the table has a lot of fields, the script to create the schema table will be very verbose. In order to avoid this problem, DolphinDB provides the extractTextSchema function, which can extract the structure of the table from the text file. We only need to modify the field type that needs to be specified.

dataFilePath = "/home/data/candle_201801.csv"
schemaTb=extractTextSchema(dataFilePath)
update schemaTb set type=`LONG where name=`volume        
tt=loadText(dataFilePath,,schemaTb)


1.2 ploadText

ploadText loads data files into memory in parallel as a partition table. The syntax is exactly the same as loadText, but ploadText is faster. ploadText is mainly used to quickly load large files. It is designed to make full use of multiple cores to load files in parallel. The degree of parallelism depends on the number of cores of the server itself and the localExecutors configuration of the node.

Below we compare the performance of loadText and ploadText.

First, generate a 4G CSV file through the script:

filePath = "/home/data/testFile.csv"
appendRows = 100000000
dateRange = 2010.01.01..2018.12.30
ints = rand(100, appendRows)
symbols = take(string('A'..'Z'), appendRows)
dates = take(dateRange, appendRows)
floats = rand(float(100), appendRows)
times = 00:00:00.000 + rand(86400000, appendRows)
t = table(ints as int, symbols as symbol, dates as date, floats as float, times as time)
t.saveText(filePath)

Use loadText and ploadText to import files respectively. This node is a 4-core 8-thread CPU.

timer loadText(filePath);
//Time elapsed: 39728.393 ms
timer ploadText(filePath);
//Time elapsed: 10685.838 ms

The results show that the performance of ploadText is almost 4 times that of loadText.


1.3 loadTextEx

语法:loadTextEx(dbHandle, tableName, [partitionColumns], fileName, [delimiter=','], [schema])

parameter:

dbHandle is the database handle.

tableName is the name of the distributed table that holds the data.

partitionColumns , delimiter and schema are optional parameters.

When the partitioning scheme is not sequential partitioning, partitionColumns need to be specified to indicate partition columns.

fileName represents the name of the imported file.

delimiter is used to specify the delimiter of different fields, the default is ",".

The schema is used for the data type of each field after the data is imported. It is a table type.

The loadText function always imports data into memory. When the data file is very large, the memory of the working machine can easily become a bottleneck. loadTextEx can solve this problem well. It saves the static CSV file as a DolphinDB distributed table in a relatively smooth data flow mode by importing and saving, instead of importing all the memory and then saving it as The partition table method greatly reduces the memory usage requirements.

First create a distributed table for storing data:

dataFilePath = "/home/data/candle_201801.csv"
tb = loadText(dataFilePath)
db=database("dfs://dataImportCSVDB",VALUE,2018.01.01..2018.01.31)  
db.createPartitionedTable(tb, "cycle", "tradingDay")

Then import the file into the distributed table:

loadTextEx(db, "cycle", "tradingDay", dataFilePath)

When data needs to be used for analysis, the partition metadata is first loaded into the memory through the loadTable function. When the query is actually executed, DolphinDB will load the data into the memory as needed.

tb = database("dfs://dataImportCSVDB").loadTable("cycle")


2. Import via HDF5 file

HDF5 is a binary data file format that is more efficient than CSV and is widely used in the field of data analysis. DolphinDB also supports importing data through HDF5 format files.

DolphinDB uses HDF5 plug-in to access HDF5 files. The plug-in provides the following methods:

  • hdf5::ls: List all Group and Dataset objects in the h5 file.
  • hdf5::lsTable: List all Dataset objects in the h5 file.
  • hdf5::hdf5DS: Returns the metadata of the Dataset in the h5 file.
  • hdf5::loadHdf5: Import the h5 file into the memory table.
  • hdf5::loadHdf5Ex: Import the h5 file into the partition table.
  • hdf5::extractHdf5Schema: Extract the table structure from the h5 file.

When calling a plugin method, you need to provide a namespace in front of the method, such as hdf5::loadHdf5 when calling loadHdf5. If you don’t want to use the namespace every time you call it, you can use the use keyword:

use hdf5
loadHdf5(filePath,tableName)

To use the DolphinDB plug-in, you first need to download the HDF5 plug-in, and then deploy the plug-in to the plugins directory of the node. Load the plug-in before using it. Use the following script:

loadPlugin("plugins/hdf5/PluginHdf5.txt")

The import of HDF5 files is similar to CSV files. For example, we want to import the sample HDF5 file candle_201801.h5 , which contains a Dataset: candle_201801, then the simplest import method is as follows:

dataFilePath = "/home/data/candle_201801.h5"
datasetName = "candle_201801"
tmpTB = hdf5::loadHdf5(dataFilePath,datasetName)

If you need to specify the data type to import, you can use hdf5::extractHdf5Schema, the script is as follows:

dataFilePath = "/home/data/candle_201801.h5"
datasetName = "candle_201801"
schema=hdf5::extractHdf5Schema(dataFilePath,datasetName)
update schema set type=`LONG where name=`volume        
tt=hdf5::loadHdf5(dataFilePath,datasetName,schema)

If the HDF5 file is very large and the working machine memory cannot support full loading, you can use hdf5::loadHdf5Ex to load the data.

First create a distributed table for storing data:

dataFilePath = "/home/data/candle_201801.h5"
datasetName = "candle_201801"
dfsPath = "dfs://dataImportHDF5DB"
tb = hdf5::loadHdf5(dataFilePath,datasetName)
db=database(dfsPath,VALUE,2018.01.01..2018.01.31)  
db.createPartitionedTable(tb, "cycle", "tradingDay")

Then import the HDF5 file through the hdf5::loadHdf5Ex function:

hdf5::loadHdf5Ex(db, "cycle", "tradingDay", dataFilePath,datasetName)


3. Import via ODBC interface

DolphinDB supports ODBC interface to connect to third-party databases, and read tables directly from the database into DolphinDB memory data tables. Use the ODBC plug-in provided by DolphinDB to easily migrate data from ODBC-supported databases to DolphinDB.

The ODBC plug-in provides the following four methods for operating data from third-party data sources:

  • odbc::connect: Open the connection.
  • odbc::close: Close the connection.
  • odbc::query: Query data according to the given SQL statement and return it to the DolphinDB memory table.
  • odbc::execute: execute the given SQL statement in the third-party database without returning data.

Before using the ODBC plug-in, you need to install the ODBC driver, please refer to the ODBC plug-in tutorial .

The following is to connect to SQL Server as an example, the specific configuration of the existing database is:

server:172.18.0.15

Default port: 1433

Connection user name: sa

Password: 123456

Database name: SZ_TAQ

The database table selects the data on January 1, 2016, the table name is candle_201801, and the fields are the same as the CSV file.

To use the ODBC plug-in to connect to the SQL Server database, the first step is to download the plug-in, decompress and copy all files in the plugins\odbc directory to the plugins/odbc directory of DolphinDB Server, and complete the plug-in initialization through the following script:

//载入插件
loadPlugin("plugins/odbc/odbc.cfg")
//连接 SQL Server
conn=odbc::connect("Driver=ODBC Driver 17 for SQL Server;Server=172.18.0.15;Database=SZ_TAQ;Uid=sa;
Pwd=123456;")

Before importing data, create a distributed disk database to save the data:

//Get the table structure from SQL Server as the template for DolphinDB import table 
tb = odbc::query(conn,"select top 1 * from candle_201801") 
db=database("dfs://dataImportODBC",VALUE,2018.01. 01..2018.01.31) 
db.createPartitionedTable(tb, "cycle", "tradingDay")

Import data from SQL Server and save it as a DolphinDB partition table:

data = odbc::query(conn,"select * from candle_201801")
tb = database("dfs://dataImportODBC").loadTable("cycle")
tb.append!(data);

Importing data through ODBC avoids the process of file export and import, and through DolphinDB's timing operation mechanism, it can also be used as a data channel for timing synchronization of timing data.


4. Financial data import case

The following is an example of importing the daily K-line chart data file of the securities market. The data is saved on the disk in CSV file format. A total of 10 years of data is saved according to the annual sub-directory. A total of about 100G of data. The path example is as follows:

2008

---- 000001.csv

---- 000002.csv

---- 000003.csv

---- 000004.csv

---- ...

2009

...

2018

The structure of each file is consistent, as shown in the figure:

edcac7665c7b5c30c4316c15c573e93e.png


4.1 Zoning planning

Before importing data, you must first plan the partition of the data, which involves two considerations:

  • Determine the partition field.
  • Determine the granularity of the partition.

First of all, according to the execution frequency of daily query statements, we use trading and symbol two fields for combined range (RANGE) partitioning. By partitioning commonly used search fields, the efficiency of data retrieval and analysis can be greatly improved.

The next thing to do is to define the granularity of the two partitions respectively.

The time span of the existing data is from 2008 to 2018, so the data is divided in time according to the year. When planning the time partition, it is necessary to consider leaving enough space for the subsequent incoming data, so the time range is set here as 2008-2030.

yearRange =date(2008.01M + 12*0..22)

There are several thousand stock codes here. If the stock codes are partitioned by value (VALUE), then each partition is only a few megabytes in size, and the number of partitions is large. When a distributed system executes a query, it divides the query statement into multiple subtasks and distributes them to different partitions for execution. Therefore, the partitioning method by value will result in a very large number of tasks, and the task execution time is extremely short, resulting in the system spending on management tasks Instead, the time is greater than the execution time of the task itself. Such a partitioning method is obviously unreasonable. Here we divide all stock codes into 100 intervals according to the range, and each interval is used as a partition. The final partition size is about 100M. Considering that new stock data will come in later, a virtual code 999999 is added to form a partition with the last stock code to save the data of subsequent new stocks.

Get the partition range of the symbol field through the following script:

//Traverse all the annual directories, sort out the stock code list, and divide it into 100 intervals through cutPoint 
symbols = array(SYMBOL, 0, 100) 
yearDirs = files(rootDir)[`filename] 
for(yearDir in yearDirs){ 
	path = rootDir + "/" + yearDir 
	symbols.append!(files(path)[`filename].upper().strReplace(".CSV","")) 
} 
// Remove duplication and increase expansion space: 
symbols = symbols.distinct().sort!().append!("999999"); 
// 
Equally divided into 100 parts symRanges = symbols.cutPoints(100) 
Use the following script to define two dimensional combination (COMPO) partitions, create Database and Partition table: 
columns=`symbol`exchange`cycle`tradingDay`date`time`open`high`low`close`volume`turnover`unixTime 
types = [SYMBOL,SYMBOL,INT,DATE,DATE,TIME,DOUBLE,DOUBLE, DOUBLE,DOUBLE,LONG,DOUBLE,LONG] 

dbDate=database("", RANGE, yearRange) 
dbID=database("", RANGE, symRanges)
db = database(dbPath, COMPO, [dbDate, dbID])

pt=db.createPartitionedTable(table(1000000:0,columns,types), tableName, `tradingDay`symbol)

It should be noted that the partition is the smallest unit of DolphinDB to store data, and DolphinDB writes to the partition exclusively. When tasks are performed in parallel, it is necessary to avoid multiple tasks simultaneously writing data to a partition. In this case, each year's data is handed over to a separate task, and the data boundaries of each task operation do not overlap, so it is impossible to write multiple tasks to the same partition.


4.2 Import data

The main idea of ​​the data import script is very simple. It reads and writes all the CSV files one by one into the distributed database table dfs://SAMPLE_TRDDB by looping the directory tree. However, there are still many details in the specific import process.

The first problem encountered is that the data format saved in the CSV file is different from the internal data format of DolphinDB, such as the time field. In the file, "9390100000" indicates the time accurate to the millisecond. If it is directly read, it will be recognized as a numeric type. , Instead of the time type, so here you need to use the data conversion function datetimeParse combined with the formatting function format to convert when the data is imported. The key script is as follows:

datetimeParse(format(time,"000000000"),"HHmmssSSS")

Although it is very simple to implement through circular import, in fact, 100G data is composed of a very large number of small files of about 5M. If single-threaded operation will wait a long time, in order to make full use of the resources of the cluster, we import and split the data according to the year. Divide into multiple subtasks and send them to the task queue of each node in turn to execute in parallel to improve the efficiency of import. This process is implemented in the following two steps:

(1) Define a custom function. The main function of the function is to import all files in the specified annual directory:

//循环处理年度目录下的所有数据文件
def loadCsvFromYearPath(path, dbPath, tableName){
	symbols = files(path)[`filename]
	for(sym in symbols){
		filePath = path + "/" + sym
		t=loadText(filePath)
		database(dbPath).loadTable(tableName).append!(select symbol, exchange,cycle, tradingDay,date,datetimeParse(format(time,"000000000"),"HHmmssSSS"),open,high,low,close,volume,turnover,unixTime from t )			
	}
}

(2) Submit the function defined above to each node for execution through the rpc function combined with the submitJob function:

nodesAlias="NODE" + string(1..4)
years= files(rootDir)[`filename]

index = 0;
for(year in years){	
	yearPath = rootDir + "/" + year
	des = "loadCsv_" + year
	rpc(nodesAlias[index%nodesAlias.size()],submitJob,des,des,loadCsvFromYearPath,yearPath,dbPath,tableName)
	index=index+1
}

During the data import process, you can use pnodeRun (getRecentJobs) to observe the completion of background tasks.

Case complete script


Guess you like

Origin blog.51cto.com/15022783/2562266