Dry goods丨DolphinDB text data loading tutorial

DolphinDB provides the following 4 functions to import text data into memory or database:

  • loadText: Import a text file as a memory table.
  • ploadText: Parallel import text files as partition memory tables. Compared with loadText function, it is faster.
  • loadTextEx: Import text files into databases, including distributed databases, local disk databases or memory databases.
  • textChunkDS: Divide the text file into multiple small data sources, and then use the mr function for flexible data processing.

DolphinDB's text data import is not only flexible, rich in functions, and very fast. Compared with popular systems in the industry such as Clickhouse, MemSQL, Druid, Pandas, DolphinDB, single-threaded import is faster, up to an order of magnitude advantage; in the case of multi-threaded parallel import, the speed advantage is more obvious.

This tutorial introduces common problems, corresponding solutions and precautions when importing text data.

1. Automatically identify data format

In most other systems, when importing text data, the user needs to specify the format of the data. For the convenience of users, DolphinDB can automatically recognize the data format when importing data.

The automatic identification data format includes two parts: field name identification and data type identification. If no column of the first line of the file starts with a number, the system considers the first line to be the file header and contains the field names. DolphinDB will extract a small amount of data as samples, and automatically infer the data type of each column. Because it is based on partial data, the identification of the data type of some columns may be wrong. But for most text files, you do not need to manually specify the field name and data type of each column, you can correctly import it into DolphinDB.

Please note: Versions before 1.20.0 do not support importing the three data types INT128, UUID and IPADDR. If these three data types are included in the csv file, please ensure that the version used is not lower than 1.20.0.

loadTextFunction is used to import data into DolphinDB memory table. The following example calls the loadText function to import data and view the structure of the generated data table. Please refer to the appendix for the data files involved in the example .

dataFilePath="/home/data/candle_201801.csv"
tmpTB=loadText(filename=dataFilePath);

View the first 5 rows of data in the data table:

select top 5 * from tmpTB;

symbol exchange cycle tradingDay date       time     open  high  low   close volume  turnover   unixTime
------ -------- ----- ---------- ---------- -------- ----- ----- ----- ----- ------- ---------- -------------
000001 SZSE     1     2018.01.02 2018.01.02 93100000 13.35 13.39 13.35 13.38 2003635 2.678558E7 1514856660000
000001 SZSE     1     2018.01.02 2018.01.02 93200000 13.37 13.38 13.33 13.33 867181  1.158757E7 1514856720000
000001 SZSE     1     2018.01.02 2018.01.02 93300000 13.32 13.35 13.32 13.35 903894  1.204971E7 1514856780000
000001 SZSE     1     2018.01.02 2018.01.02 93400000 13.35 13.38 13.35 13.35 1012000 1.352286E7 1514856840000
000001 SZSE     1     2018.01.02 2018.01.02 93500000 13.35 13.37 13.35 13.37 1601939 2.140652E7 1514856900000

Call the schemafunction to view the table structure (field name, data type, etc.):

tmpTB.schema().colDefs;

name       typeString typeInt comment
---------- ---------- ------- -------
symbol     SYMBOL     17
exchange   SYMBOL     17
cycle      INT        4
tradingDay DATE       6
date       DATE       6
time       INT        4
open       DOUBLE     16
high       DOUBLE     16
low        DOUBLE     16
close      DOUBLE     16
volume     INT        4
turnover   DOUBLE     16
unixTime   LONG       5


2. Specify the data import format

In the 4 data loading functions described in this tutorial, you can use the schema parameter to specify a table, which contains the name, type, format, and columns to be imported for each field. The table can contain the following 4 columns:

Among them, the two columns name and type are required, and must be the first two columns. The two columns format and col are optional, and there is no requirement for prioritization.

For example, we can use the following data table as a schema parameter:


2.1 Extract the schema of the text file

extractTextSchemaThe function is used to obtain the schema of the text file, including information such as field names and data types.

For example, use the extractTextSchema function to get the table structure of the sample file in this tutorial:

dataFilePath="/home/data/candle_201801.csv"
schemaTB=extractTextSchema(dataFilePath)
schemaTB;

name       type
---------- ------
symbol     SYMBOL
exchange   SYMBOL
cycle      INT
tradingDay DATE
date       DATE
time       INT
open       DOUBLE
high       DOUBLE
low        DOUBLE
close      DOUBLE
volume     INT
turnover   DOUBLE
unixTime   LONG


2.2 Specify the field name and type

When the field name or data type automatically recognized by the system does not meet expectations or requirements, you can modify the schema table generated by extractTextSchema or directly create a schema table to specify the field name and data type for each column in the text file.

For example, if the volume column of the imported data is automatically recognized as the INT type, and the required volume type is the LONG type, you need to modify the schema table and specify the volume column type as LONG.

dataFilePath="/home/data/candle_201801.csv"
schemaTB=extractTextSchema(dataFilePath)
update schemaTB set type="LONG" where name="volume";

Use the loadText function to import a text file, and import the data into the database according to the field data type specified by schemaTB.

tmpTB=loadText(filename=dataFilePath,schema=schemaTB);

The above example introduces the modification of the data type. If you want to modify the field name in the table, you can also use the same method.

Please note that if the automatic parsing of date and time-related data types does not meet expectations, you need to solve it through the method in section 2.3 of this tutorial.


2.3 Specify the format of date and time types

For date column or time column data, if the automatically recognized data type does not meet expectations, not only need to specify the data type in the type column of the schema, but also specify the format (indicated by a string) in the format column, such as "MM/dd /yyyy". Please refer to the adjustment and format of date and time for how to display the date and time format .

The following is an example of how to specify data types for date and time columns.

Execute the following script in DolphinDB to generate the data file required for this example.

dataFilePath="/home/data/timeData.csv"
t=table(["20190623 14:54:57","20190623 15:54:23","20190623 16:30:25"] as time,`AAPL`MS`IBM as sym,2200 5400 8670 as qty,54.78 59.64 65.23 as price)
saveText(t,dataFilePath);

Before loading the data, use the extractTextSchema function to get the schema of the data file:

schemaTB=extractTextSchema(dataFilePath)
schemaTB;

name  type
----- ------
time  SECOND
sym   SYMBOL
qty   INT
price DOUBLE

Obviously, the system recognizes that the data type of the time column does not meet expectations. If you load the file directly, the data in the time column will be empty. In order to load the data of the time column of the file correctly, you need to specify the data type of the time column as DATETIME, and specify the format of the column as "yyyyMMdd HH:mm:ss".

update schemaTB set type="DATETIME" where name="time"
schemaTB[`format]=["yyyyMMdd HH:mm:ss",,,];

Import the data and view, the data is displayed correctly:

tmpTB=loadText(dataFilePath,,schemaTB)
tmpTB;

time                sym  qty  price
------------------- ---- ---- -----
2019.06.23T14:54:57 AAPL 2200 54.78
2019.06.23T15:54:23 MS   5400 59.64
2019.06.23T16:30:25 IBM  8670 65.23


2.4 Import the specified column

When importing data, you can specify to import only certain columns in the text file through the schema parameter.

In the following example, you only need to load the 7 columns of symbol, date, open, high, close, volume, and turnover in the text file.

First, call the extractTextSchema function to get the table structure of the target text file.

dataFilePath="/home/data/candle_201801.csv"
schemaTB=extractTextSchema(dataFilePath);

Use the rowNo function to generate a column number for each column, assign it to the col column in the schema table, and then modify the schema table to keep only the rows that represent the fields that need to be imported.

update schemaTB set col = rowNo(name)
schemaTB=select * from schemaTB where name in `symbol`date`open`high`close`volume`turnover;

caution:

  1. The column number starts from 0. In the above example, the column number corresponding to the symbol column in the first column is 0.
  2. When importing data, the sequence of the columns cannot be changed. If you need to adjust the order of the columns, you can use the reorderColumns!function after loading the data file .

Finally, use the loadText function and configure the schema parameters to import the specified columns in the text file.

tmpTB=loadText(filename=dataFilePath,schema=schemaTB);

Looking at the first 5 rows in the table, only the required columns are imported:

select top 5 * from tmpTB

symbol date       open  high  close volume  turnover  
------ ---------- ----- ----- ----- ------- ----------
000001 2018.01.02 13.35 13.39 13.38 2003635 2.678558E7
000001 2018.01.02 13.37 13.38 13.33 867181  1.158757E7
000001 2018.01.02 13.32 13.35 13.35 903894  1.204971E7
000001 2018.01.02 13.35 13.38 13.35 1012000 1.352286E7
000001 2018.01.02 13.35 13.37 13.37 1601939 2.140652E7

2.5 Skip the first few lines of text data

When importing data, if you need to skip the first n lines of the file (which may be a file description), you can specify the skipRows parameter as n. Since the description of the description file is usually not very lengthy, the maximum value of this parameter is 1024. The 4 data loading functions described in this tutorial all support skipRows parameters.

In the following example, import a data file through the loadText function, and view the total number of rows in the table after the file is imported, and the contents of the first 5 rows.

dataFilePath="/home/data/candle_201801.csv"
tmpTB=loadText(filename=dataFilePath)
select count(*) from tmpTB;

count
-----
5040

select top 5 * from tmpTB;

symbol exchange cycle tradingDay date       time     open  high  low   close volume  turnover   unixTime
------ -------- ----- ---------- ---------- -------- ----- ----- ----- ----- ------- ---------- -------------
000001 SZSE     1     2018.01.02 2018.01.02 93100000 13.35 13.39 13.35 13.38 2003635 2.678558E7 1514856660000
000001 SZSE     1     2018.01.02 2018.01.02 93200000 13.37 13.38 13.33 13.33 867181  1.158757E7 1514856720000
000001 SZSE     1     2018.01.02 2018.01.02 93300000 13.32 13.35 13.32 13.35 903894  1.204971E7 1514856780000
000001 SZSE     1     2018.01.02 2018.01.02 93400000 13.35 13.38 13.35 13.35 1012000 1.352286E7 1514856840000
000001 SZSE     1     2018.01.02 2018.01.02 93500000 13.35 13.37 13.35 13.37 1601939 2.140652E7 1514856900000

Specify the skipRows parameter to be 1000, skip the first 1000 lines of the text file to import the file:

tmpTB=loadText(filename=dataFilePath,skipRows=1000)
select count(*) from tmpTB;

count
-----
4041

select top 5 * from tmpTB;

col0   col1 col2 col3       col4       col5      col6  col7  col8  col9  col10  col11      col12
------ ---- ---- ---------- ---------- --------- ----- ----- ----- ----- ------ ---------- -------------
000001 SZSE 1    2018.01.08 2018.01.08 101000000 13.13 13.14 13.12 13.14 646912 8.48962E6  1515377400000
000001 SZSE 1    2018.01.08 2018.01.08 101100000 13.13 13.14 13.13 13.14 453647 5.958462E6 1515377460000
000001 SZSE 1    2018.01.08 2018.01.08 101200000 13.13 13.14 13.12 13.13 700853 9.200605E6 1515377520000
000001 SZSE 1    2018.01.08 2018.01.08 101300000 13.13 13.14 13.12 13.12 738920 9.697166E6 1515377580000
000001 SZSE 1    2018.01.08 2018.01.08 101400000 13.13 13.14 13.12 13.13 469800 6.168286E6 1515377640000
Please note: as shown in the above example, when skipping the first n rows for import, if the first row of the data file is a column name, that row will be skipped as the first row.

In the above example, after the text file specifies the skipRows parameter to import, because the first row of column names is skipped, the column names become the default column names: col0, col1, col2, etc. If you need to keep the column names and specify to skip the first n rows, you can first obtain the schema of the text file through the extractTextSchema function, and specify the schema parameter when importing:

schema=extractTextSchema(dataFilePath)
tmpTB=loadText(filename=dataFilePath,schema=schema,skipRows=1000)
select count(*) from tmpTB;

count
-----
4041

select top 5 * from tmpTB;

symbol exchange cycle tradingDay date       time      open  high  low   close volume turnover   unixTime
------ -------- ----- ---------- ---------- --------- ----- ----- ----- ----- ------ ---------- -------------
000001 SZSE     1     2018.01.08 2018.01.08 101000000 13.13 13.14 13.12 13.14 646912 8.48962E6  1515377400000
000001 SZSE     1     2018.01.08 2018.01.08 101100000 13.13 13.14 13.13 13.14 453647 5.958462E6 1515377460000
000001 SZSE     1     2018.01.08 2018.01.08 101200000 13.13 13.14 13.12 13.13 700853 9.200605E6 1515377520000
000001 SZSE     1     2018.01.08 2018.01.08 101300000 13.13 13.14 13.12 13.12 738920 9.697166E6 1515377580000
000001 SZSE     1     2018.01.08 2018.01.08 101400000 13.13 13.14 13.12 13.13 469800 6.168286E6 1515377640000


3. Import data in parallel


3.1 A single file is loaded into memory in multiple threads

The ploadText function can load a text file into memory in a multi-threaded manner. The syntax of this function is the same as that of the loadText function. The difference is that the ploadText function can quickly load large files and generate a memory partition table. It makes full use of multi-core CPUs to load files in parallel. The degree of parallelism depends on the number of CPU cores in the server and the localExecutors configuration of the node.

The following compare the performance of loadText function and ploadText function importing the same file.

First, generate a text file of about 4GB through the script:

filePath="/home/data/testFile.csv"
appendRows=100000000
t=table(rand(100,appendRows) as int,take(string('A'..'Z'),appendRows) as symbol,take(2010.01.01..2018.12.30,appendRows) as date,rand(float(100),appendRows) as float,00:00:00.000 + rand(86400000,appendRows) as time)
t.saveText(filePath);

Load the file through loadText and ploadText respectively. The node used in this example is a CPU with 6 cores and 12 hyperthreads.

timer loadText(filePath);
Time elapsed: 12629.492 ms

timer ploadText(filePath);
Time elapsed: 2669.702 ms

The results show that under this configuration, the performance of ploadText is about 4.5 times that of loadText.

3.2 Parallel import of multiple files

In the field of big data applications, data import is often not just the import of one or two files, but the batch import of dozens or even hundreds of large files. In order to achieve better import performance, it is recommended to import batch data files in parallel.

The loadTextEx function can import text files into a specified database, including distributed databases, local disk databases or memory databases. Because DolphinDB's partition table supports concurrent read and write, it can support multi-threaded data import.

Use loadTextEx to import text data into a distributed database. The concrete realization is that the data is first imported into the memory, and then written into the database from the memory. These two steps are completed by the same function to ensure high efficiency.

The following example shows how to batch write multiple files on the disk to the DolphinDB partition table. First, execute the following script in DolphinDB to generate 100 files, a total of about 778MB, including 10 million records.

n=100000
dataFilePath="/home/data/multi/multiImport_"+string(1..100)+".csv"
for (i in 0..99){
    trades=table(sort(take(100*i+1..100,n)) as id,rand(`IBM`MSFT`GM`C`FB`GOOG`V`F`XOM`AMZN`TSLA`PG`S,n) as sym,take(2000.01.01..2000.06.30,n) as date,10.0+rand(2.0,n) as price1,100.0+rand(20.0,n) as price2,1000.0+rand(200.0,n) as price3,10000.0+rand(2000.0,n) as price4,10000.0+rand(3000.0,n) as price5)
    trades.saveText(dataFilePath[i])
};

Create the database and table:

login(`admin,`123456)
dbPath="dfs://DolphinDBdatabase"
db=database(dbPath,VALUE,1..10000)
tb=db.createPartitionedTable(trades,`tb,`id);

DolphinDB's cutfunctions can group elements in a vector. Next, call the cut function to group the file paths to be imported, and then call the submitJobfunction to assign writing tasks to each thread and import data in batches.

def writeData(db,file){
   loop(loadTextEx{db,`tb,`id,},file)
}
parallelLevel=10
for(x in dataFilePath.cut(100/parallelLevel)){
    submitJob("loadData"+parallelLevel,"loadData",writeData{db,x})
};
Please note: DolphinDB's partition table does not allow multiple threads to write data to a partition at the same time. In the above example, the value of the partition column (id column) in each file is different, so it will not cause multiple threads to write to the same partition. When designing concurrent reads and writes of the partition table, make sure that no multiple threads write to the same partition at the same time.

Through the getRecentJobsfunction, the status of the last n batch jobs on the current local node can be obtained. Using the select statement to calculate the time required to import batch files in parallel, it took about 1.59 seconds to get on a CPU with 6 cores and 12 hyperthreads.

select max(endTime) - min(startTime) from getRecentJobs() where jobId like "loadData"+string(parallelLevel)+"%";

max_endTime_sub
---------------
1590

Execute the following script to import 100 files into the database sequentially in a single thread, and record the required time, which takes about 8.65 seconds.

timer writeData(db, dataFilePath);
Time elapsed: 8647.645 ms

The results show that in this configuration, the import speed of 10 threads in parallel is about 5.5 times that of single thread import.

View the number of records in the data table:

select count(*) from loadTable("dfs://DolphinDBdatabase", `tb);

count
------
10000000


4. Preprocessing before importing the database

Before importing the data into the database, if you need to preprocess the data, such as converting date and time data types, filling in empty values, etc., you can loadTextExspecify the transform parameter when calling the function. The tansform parameter accepts a function as a parameter, and requires the function to accept only one parameter. The input of the function is an unpartitioned memory table, and the output is also an unpartitioned memory table. It should be noted that only the loadTextEx function provides the transform parameter.


4.1 Specify the data type of date and time data


4.1.1 Convert the date and time represented by the numeric type to the specified type

The data representing the time in the data file may be integer or long integer, and during data analysis, it is often necessary to force such data to be converted into a time-type format and imported and stored in the database. For this scenario, you can specify the corresponding data type for the date and time columns in the text file through the transform parameter of the loadTextEx function.

First, create a distributed database and tables.

login(`admin,`123456)
dataFilePath="/home/data/candle_201801.csv"
dbPath="dfs://DolphinDBdatabase"
db=database(dbPath,VALUE,2018.01.02..2018.01.30)
schemaTB=extractTextSchema(dataFilePath)
update schemaTB set type="TIME" where name="time"
tb=table(1:0,schemaTB.name,schemaTB.type)
tb=db.createPartitionedTable(tb,`tb1,`date);

The custom function i2t is used to preprocess the data and return the processed data table.

def i2t(mutable t){
    return t.replaceColumn!(`time,time(t.time/10))
}
Please note: When processing data in the body of a custom function, please try to use local modifications (functions ending in !) to improve performance.

Call the loadTextEx function and specify the transform parameter as the i2t function. The system will execute the i2t function on the data in the text file and save the result to the database.

tmpTB=loadTextEx(dbHandle=db,tableName=`tb1,partitionColumns=`date,filename=dataFilePath,transform=i2t);

View the first 5 rows of data in the table. It can be seen that the time column is stored in TIME type, not the INT type in the text file:

select top 5 * from loadTable(dbPath,`tb1);

symbol exchange cycle tradingDay date       time               open  high  low   close volume  turnover   unixTime
------ -------- ----- ---------- ---------- ------------------ ----- ----- ----- ----- ------- ---------- -------------
000001 SZSE     1     2018.01.02 2018.01.02 02:35:10.000000000 13.35 13.39 13.35 13.38 2003635 2.678558E7 1514856660000
000001 SZSE     1     2018.01.02 2018.01.02 02:35:20.000000000 13.37 13.38 13.33 13.33 867181  1.158757E7 1514856720000
000001 SZSE     1     2018.01.02 2018.01.02 02:35:30.000000000 13.32 13.35 13.32 13.35 903894  1.204971E7 1514856780000
000001 SZSE     1     2018.01.02 2018.01.02 02:35:40.000000000 13.35 13.38 13.35 13.35 1012000 1.352286E7 1514856840000
000001 SZSE     1     2018.01.02 2018.01.02 02:35:50.000000000 13.35 13.37 13.35 13.37 1601939 2.140652E7 1514856900000


4.1.2 Conversion between date or time data types

If the date in the text file is stored in the DATE type, you want to store it in the form of MONTH when importing the database. In this case, you can also convert the data type of the date column through the transform parameter of the loadTextEx function. The steps are the same as the previous section.

login(`admin,`123456)
dbPath="dfs://DolphinDBdatabase"
db=database(dbPath,VALUE,2018.01.02..2018.01.30)
schemaTB=extractTextSchema(dataFilePath)
update schemaTB set type="MONTH" where name="tradingDay"
tb=table(1:0,schemaTB.name,schemaTB.type)
tb=db.createPartitionedTable(tb,`tb1,`date)
def d2m(mutable t){
    return t.replaceColumn!(`tradingDay,month(t.tradingDay))
}
tmpTB=loadTextEx(dbHandle=db,tableName=`tb1,partitionColumns=`date,filename=dataFilePath,transform=d2m);

View the first 5 rows of data in the table. It can be seen that the tradingDay column is stored in the MONTH type, not the DATE type in the text file:

select top 5 * from loadTable(dbPath,`tb1);

symbol exchange cycle tradingDay date       time     open  high  low   close volume  turnover   unixTime
------ -------- ----- ---------- ---------- -------- ----- ----- ----- ----- ------- ---------- -------------
000001 SZSE     1     2018.01M   2018.01.02 93100000 13.35 13.39 13.35 13.38 2003635 2.678558E7 1514856660000
000001 SZSE     1     2018.01M   2018.01.02 93200000 13.37 13.38 13.33 13.33 867181  1.158757E7 1514856720000
000001 SZSE     1     2018.01M   2018.01.02 93300000 13.32 13.35 13.32 13.35 903894  1.204971E7 1514856780000
000001 SZSE     1     2018.01M   2018.01.02 93400000 13.35 13.38 13.35 13.35 1012000 1.352286E7 1514856840000
000001 SZSE     1     2018.01M   2018.01.02 93500000 13.35 13.37 13.35 13.37 1601939 2.140652E7 1514856900000


4.2 Fill in empty values

The transform parameter can call DolphinDB's built-in functions. When the built-in function requires multiple parameters, we can use some applications to convert the multi-parameter function into a one-parameter function. For example, call a nullFill!function to fill in empty values ​​in a text file.

db=database(dbPath,VALUE,2018.01.02..2018.01.30)
tb=db.createPartitionedTable(tb,`tb1,`date)
tmpTB=loadTextEx(dbHandle=db,tableName=`pt,partitionColumns=`date,filename=dataFilePath,transform=nullFill!{,0});


5. Use Map-Reduce to import custom data

DolphinDB supports custom data importing using Map-Reduce, dividing the data into rows, and importing the divided data into DolphinDB through Map-Reduce.

You can use textChunkDSfunctions to divide files into multiple small file data sources, and then mrwrite them to the database through the function. Before calling the mr function to store the data in the database, users can also perform flexible data processing to achieve more complex import requirements.


5.1 Store the stock and futures data in the file into two different data tables

Execute the following script in DolphinDB to generate a data file of approximately 1GB in size, including stock data and futures data.

n=10000000
dataFilePath="/home/data/chunkText.csv"
trades=table(rand(`stock`futures,n) as type, rand(`IBM`MSFT`GM`C`FB`GOOG`V`F`XOM`AMZN`TSLA`PG`S,n) as sym,take(2000.01.01..2000.06.30,n) as date,10.0+rand(2.0,n) as price1,100.0+rand(20.0,n) as price2,1000.0+rand(200.0,n) as price3,10000.0+rand(2000.0,n) as price4,10000.0+rand(3000.0,n) as price5,10000.0+rand(4000.0,n) as price6,rand(10,n) as qty1,rand(100,n) as qty2,rand(1000,n) as qty3,rand(10000,n) as qty4,rand(10000,n) as qty5,rand(10000,n) as qty6)
trades.saveText(dataFilePath);

Create distributed databases and tables for storing stock data and futures data respectively:

login(`admin,`123456)
dbPath1="dfs://stocksDatabase"
dbPath2="dfs://futuresDatabase"
db1=database(dbPath1,VALUE,`IBM`MSFT`GM`C`FB`GOOG`V`F`XOM`AMZN`TSLA`PG`S)
db2=database(dbPath2,VALUE,2000.01.01..2000.06.30)
tb1=db1.createPartitionedTable(trades,`stock,`sym)
tb2=db2.createPartitionedTable(trades,`futures,`date);

Define the following functions to divide data and write data to different databases.

def divideImport(tb, mutable stockTB, mutable futuresTB)
{
	tdata1=select * from tb where type="stock"
	tdata2=select * from tb where type="futures"
	append!(stockTB, tdata1)
	append!(futuresTB, tdata2)
}

Then divide the text file by the textChunkDS function, with 300MB as the unit to divide the file into 4 parts.

ds=textChunkDS(dataFilePath,300)
ds;

(DataSource<readTableFromFileSegment, DataSource<readTableFromFileSegment, DataSource<readTableFromFileSegment, DataSource<readTableFromFileSegment)

Call the mr function, specify the result of the textChunkDS function as the data source, and import the file into the database. Since the map function (specified by the mapFunc parameter) only accepts a table as a parameter, here we use some applications to convert a multi-parameter function into a one-parameter function.

mr(ds=ds, mapFunc=divideImport{,tb1,tb2}, parallel=false);
Please note that different small file data sources may contain data from the same partition. DolphinDB do not allow multiple threads simultaneously write to the same partition, so to mr parallel function parameter set to false, otherwise it will throw an exception.

View the first 5 rows of the tables in the two databases. The stock database is all stock data, and the futures database is all futures data.

stock table:

select top 5 * from loadTable("dfs://DolphinDBTickDatabase", `stock);

type  sym  date       price1    price2     price3      price4       price5       price6       qty1 qty2 qty3 qty4 qty5 qty6
----- ---- ---------- --------- ---------- ----------- ------------ ------------ ------------ ---- ---- ---- ---- ---- ----
stock AMZN 2000.02.14 11.224234 112.26763  1160.926836 11661.418403 11902.403305 11636.093467 4    53   450  2072 9116 12
stock AMZN 2000.03.29 10.119057 111.132165 1031.171855 10655.048121 12682.656303 11182.317321 6    21   651  2078 7971 6207
stock AMZN 2000.06.16 11.61637  101.943971 1019.122963 10768.996906 11091.395164 11239.242307 0    91   857  3129 3829 811
stock AMZN 2000.02.20 11.69517  114.607763 1005.724332 10548.273754 12548.185724 12750.524002 1    39   270  4216 8607 6578
stock AMZN 2000.02.23 11.534805 106.040664 1085.913295 11461.783565 12496.932604 12995.461331 4    35   488  4042 6500 4826

futures表:

select top 5 * from loadTable("dfs://DolphinDBFuturesDatabase", `futures);

type    sym  date       price1    price2     price3      price4       price5       price6       qty1 qty2 qty3 qty4 qty5 ...
------- ---- ---------- --------- ---------- ----------- ------------ ------------ ------------ ---- ---- ---- ---- ---- ---
futures MSFT 2000.01.01 11.894442 106.494131 1000.600933 10927.639217 10648.298313 11680.875797 9    10   241  524  8325 ...
futures S    2000.01.01 10.13728  115.907379 1140.10161  11222.057315 10909.352983 13535.931446 3    69   461  4560 2583 ...
futures GM   2000.01.01 10.339581 112.602729 1097.198543 10938.208083 10761.688725 11121.888288 1    1    714  6701 9203 ...
futures IBM  2000.01.01 10.45422  112.229537 1087.366764 10356.28124  11829.206165 11724.680443 0    47   741  7794 5529 ...
futures TSLA 2000.01.01 11.901426 106.127109 1144.022732 10465.529256 12831.721586 10621.111858 4    43   136  9858 8487 ...

 

5.2 Quickly load some data at the beginning and end of a large file

You can use textChunkDS to divide a large file into multiple small data sources (chunks), and then load the first and last data sources. Execute the following script in DolphinDB to generate data files:

n=10000000
dataFilePath="/home/data/chunkText.csv"
trades=table(rand(`IBM`MSFT`GM`C`FB`GOOG`V`F`XOM`AMZN`TSLA`PG`S,n) as sym,sort(take(2000.01.01..2000.06.30,n)) as date,10.0+rand(2.0,n) as price1,100.0+rand(20.0,n) as price2,1000.0+rand(200.0,n) as price3,10000.0+rand(2000.0,n) as price4,10000.0+rand(3000.0,n) as price5,10000.0+rand(4000.0,n) as price6,rand(10,n) as qty1,rand(100,n) as qty2,rand(1000,n) as qty3,rand(10000,n) as qty4, rand(10000,n) as qty5, rand(1000,n) as qty6)
trades.saveText(dataFilePath);

Then divide the text file by the textChunkDS function, and divide it in units of 10MB.

ds=textChunkDS(dataFilePath, 10);

Call the mr function to load the data of the first and last two chunks. Because the data of these two chunks is very small, the loading speed is very fast.

head_tail_tb = mr(ds=[ds.head(), ds.tail()], mapFunc=x->x, finalFunc=unionAll{,false});

View the number of records in the head_tail_tb table:

select count(*) from head_tail_tb;

count
------
192262


6. Other matters needing attention


6.1 Processing of different coded data

Since DolphinDB strings are encoded in UTF-8, if the loaded file is not encoded in UTF-8, it needs to be converted after import. DolphinDB provides convertEncode, fromUTF8and toUTF8functions for converting the string encoding after importing data.

For example, use the convertEncode function to convert the encoding of the exchange column in the tmpTB table:

dataFilePath="/home/data/candle_201801.csv"
tmpTB=loadText(filename=dataFilePath, skipRows=0)
tmpTB.replaceColumn!(`exchange, convertEncode(tmpTB.exchange,"gbk","utf-8"));


6.2 Analysis of numeric types

The first section of this tutorial introduces DolphinDB's automatic data type analysis mechanism when importing data. This section explains the analysis of numeric data (including CHAR, SHORT, INT, LONG, FLOAT and DOUBLE) data. The system can recognize the following forms of numerical data:

  • The numerical value represented by a number, for example: 123
  • Numerical value with thousands separator, for example: 100,000
  • A value with a decimal point, that is, a floating point number, for example: 1.231
  • Value expressed in scientific notation, for example: 1.23E5

If the specified data type is a numeric type, DolphinDB will automatically ignore the letters and other symbols before and after the number when importing. If no number appears, it will be parsed as a NULL value. The following is a specific description with examples.

First, execute the following script to create a text file.

dataFilePath="/home/data/testSym.csv"
prices1=["2131","$2,131", "N/A"]
prices2=["213.1","$213.1", "N/A"]
totals=["2.658E7","-2.658e7","2.658e-7"]
tt=table(1..3 as id, prices1 as price1, prices2 as price2, totals as total)
saveText(tt,dataFilePath);

In the created text file, there are both numbers and characters in the price1 and price2 columns. If the schema parameter is not specified when importing data, the system will recognize both columns as SYMBOL type:

tmpTB=loadText(dataFilePath)
tmpTB;

id price1 price2 total
-- ------ ------ --------
1  2131   213.1  2.658E7
2  $2,131 $213.1 -2.658E7
3  N/A    N/A    2.658E-7

tmpTB.schema().colDefs;

name   typeString typeInt comment
------ ---------- ------- -------
id     INT        4
price1 SYMBOL     17
price2 SYMBOL     17
total  DOUBLE     16

If you specify the price1 column as the INT type and the price2 column as the DOUBLE type, the system will ignore the letters and other symbols before and after the number when importing. If no number appears, it will be resolved as a NULL value.

schemaTB=table(`id`price1`price2`total as name, `INT`INT`DOUBLE`DOUBLE as type) 
tmpTB=loadText(dataFilePath,,schemaTB)
tmpTB;

id price1 price2 total
-- ------ ------ --------
1  2131   213.1  2.658E7
2  2131   213.1  -2.658E7
3                2.658E-7


6.3 Automatically remove double quotes

In CSV files, double quotes are sometimes used to deal with fields containing special characters (such as thousands separators) in values. When DolphinDB processes such data, it will automatically remove double quotes outside the text. The following is a specific description with examples.

In the data file used in the following example, num is listed as a numerical value expressed in thousands of sections.

dataFilePath="/home/data/test.csv"
tt=table(1..3 as id,  ["\"500\"","\"3,500\"","\"9,000,000\""] as num)
saveText(tt,dataFilePath);

Import data and view the data in the table. DolphinDB automatically removes the double quotes outside the text.

tmpTB=loadText(dataFilePath)
tmpTB;

id num
-- -------
1  500
2  3500
3  9000000


appendix

The data file used in the examples in this tutorial: candle_201801.csv .

Guess you like

Origin blog.csdn.net/qq_41996852/article/details/112916410