Dry goods丨How to migrate MySQL data to DolphinDB at high speed

DolphinDB provides two methods to import MySQL data: ODBC plug-in and MySQL plug-in. We recommend using the MySQL plug-in to import MySQL data, because it is faster than ODBC import, importing 6.5G data, the MySQL plug-in is 4 times faster than the ODBC plug-in, and no configuration is required to use the MySQL plug-in, while the ODBC plug-in needs to configure the data source .

Before using the MySQL plugin, please refer to the DolphinDB Installation Guide to install DolphinDB.

1. Download the plugin

The DolphinDB installation directory/server/plugins/mysql already contains the MySQL plugin, and users can directly use the plugin. If users need to compile by themselves, they can refer to https:// github.com/dolphindb/Do lphinDBPlugin/blob/master/mysql/README_CN.md .

2. Load the plugin

In the GUI, use the loadPlugin function to load the MySQL plugin:

loadPlugin(server_dir+"/plugins/mysql/PluginMySQL.txt")

3. Interface function

DolphinDB's MySQL plugin provides the following interface functions:

  • connect
  • showTables
  • extractSchema
  • load
  • loadEx

 

We can call plug-in interface functions in the following two ways:

(1) moduleName::apiFunction. For example, call the connect method of the MySQL plugin.

mysql::connect(host, port, user, password, db)

(2) Use moduleName, and then directly call the interface function. As long as the use statement is executed once, subsequent calls to the interface function do not need to re-execute the use function. Therefore, we generally recommend this calling method.

use mysqlconnect(host, port, user, password, db)

 

3.1 connect

grammar

connect(host, port, user, password, db)

parameter

host is the host name of the MySQL server.

port is the port number of the MySQL server, and the default is 3306.

user is the username in the MySQL server.

password is the password corresponding to user.

db is the database name in MySQL.

Details

Create a MySQL connection and return the MySQL connection handle. We recommend that the Authentication Type for MySQL users is mysql_native_password.

example

Connect to the employees database in the local MySQL server.

conn=connect("127.0.0.1",3306,"root","123456","employees")

 

3.2 showTables

grammar

showTables(connection)

parameter

connection is the connection handle returned by the connect function.

Details

Returns a DolphinDB type data table, including the names of all tables in the MySQL database.

example

View the tables in the employees database.

showTables(conn);Tables_in_employeescurrent_dept_empdepartmentsdept_empdept_emp_latest_datedept_manageremployeessalariestest_datatypestitles

 

3.2 extractSchema

grammar

extractSchema(connection, tableName)

parameter

connection is the connection handle returned by the connect function.

tableName is the name of the data table in the MySQL database.

Details

The return result is a DolphinDB type table. The first column is the field name in the MySQL data table, the second column is the data type after the data is imported into DolphinDB, and the third column is the data type in MySQL.

example

View the data type of each column in the employees table.

extractSchema(conn,`employees);

name	        type	  MySQL describe type	emp_no	        LONG	  int(11)	                
birth_date	DATE	  date	                first_name	STRING	  varchar(14)	        
last_name	STRING	  varchar(16)	        
gender	        SYMBOL	  enum('M','F')	        
hire_date	DATE	  date	

3.3 load

grammar

load(connection, table|query, [schema], [startRow], [rowNum])

parameter

connection is the connection handle returned by the connect function.

table is the name of the table in the MySQL server.

query is a query statement in MySQL.

The schema is a DolphinDB type table, which contains two columns, the first column is the field name, and the second column is the data type. It is an optional parameter. Users can modify the data type when loading data into DolphinDB by specifying this parameter.

startRow is a positive integer, indicating the starting row number of the data to be read. It is an optional parameter, the default value is 0, which means to start reading data from the first record.

rowNum is a positive integer, indicating the number of rows of records read. It is an optional parameter, if not specified, it means to read all the data. If the second parameter is query, then the startRow and rowNum parameters are invalid.

Details

Load MySQL data into DolphinDB's memory table.

example

  1. Load all data in the employees table into DolphinDB's memory table.
t=load(conn,"employees");emp_no	birth_date	first_name	last_name	gender	hire_date10,001	1953.09.02	Georgi	        Facello	        M	1986.06.2610,002	1964.06.02	Bezalel	        Simmel	        F	1985.11.2110,003	1959.12.03	Parto	        Bamford	        M	1986.08.2810,004	1954.05.01	Chirstian	Koblick	        M	1986.12.0110,005	1955.01.21	Kyoichi	        Maliniak	M	1989.09.1210,006	1953.04.20	Anneke	        Preusig	        F	1989.06.0210,007	1957.05.23	Tzvetan	        Zielinski	F	1989.02.1010,008	1958.02.19	Saniya	        Kalloufi	M	1994.09.1510,009	1952.04.19	Sumant	        Peac	        F	1985.02.1810,010	1963.06.01	Duangkaew	Piveteau	F	1989.08.24...

2. 把employees表中的前10行数据加载到DolphinDB的内存表中。

t=load(conn,"select * from employees limit 10");

emp_no	birth_date	first_name	last_name	gender	hire_date
10,001	1953.09.02	Georgi	        Facello	        M	1986.06.26
10,002	1964.06.02	Bezalel	        Simmel	        F	1985.11.21
10,003	1959.12.03	Parto	        Bamford	        M	1986.08.28
10,004	1954.05.01	Chirstian	Koblick	        M	1986.12.01
10,005	1955.01.21	Kyoichi	        Maliniak	M	1989.09.12
10,006	1953.04.20	Anneke	        Preusig	        F	1989.06.02
10,007	1957.05.23	Tzvetan	        Zielinski	F	1989.02.10
10,008	1958.02.19	Saniya	        Kalloufi	M	1994.09.15
10,009	1952.04.19	Sumant	        Peac	        F	1985.02.18
10,010	1963.06.01	Duangkaew	Piveteau	F	1989.08.24

3. 加载时把last_name的数据类型修改为SYMBOL。

schema=select name,type from extractSchema(conn,`employees)update schema set type="SYMBOL" where name="last_name"t=load(conn,"employees",schema)//查看表t的结构schema(t);chunkPath->partitionColumnIndex->-1colDefs->name       typeString typeInt---------- ---------- -------emp_no     LONG       5      birth_date DATE       6      first_name STRING     18     last_name  SYMBOL     18gender     SYMBOL     17     hire_date  DATE       6

3.4 loadEx

语法

loadEx(connection, dbHandle, tableName, partitionColumns, table|query, [schema], [startRow], [rowNum])

参数

connection是connect函数返回的连接句柄。

dbHandle是DolphinDB的数据库句柄,通常是database函数返回的对象。

tableName是DolphinDB数据库中的表名。

partitionColumns是字符串标量或向量,表示分区列。

table是字符串,表示MySQL服务器中表的名称。

query是MySQL中的查询语句。

schema是DolphinDB类型的表,它包含两列,第一列是字段名称,第二列是数据类型。它是可选参数。用户可以通过指定该参数来修改数据加载到DolphinDB时的数据类型。

startRow是正整数,表示读取数据的起始行数。它是可选参数,默认值为0,表示从第一条记录开始读取数据。

rowNum是正整数,表示读取的记录行数。它是可选参数,如果没有指定,表示读取所有的数据。如果第二个参数为query,那么startRow和rowNum参数无效。

详情

把MySQL中的数据加载到DolphinDB的分区表中。loadEx不支持把数据加载到DolphinDB的顺序分区表中。

例子

把employees表加载到DolphinDB的磁盘VALUE分区表中。

db=database("H:/DolphinDB/Data/mysql",VALUE,`F`M)
pt=loadEx(conn,db,"pt","gender","employees")
select count(*) from loadTable(db,"pt");

count
300,024

如果需要把数据加载到内存分区表,只需要把database的路径改为空字符串;如果需要把数据加载到分布式表,只需要把database路径修改为以“dfs://”开头的路径,比如“dfs://mysql”。分布式表需要在集群中才能使用。集群部署请参考单服务器集群部署多服务器集群部署

4. 数据类型转换

使用MySQL插件把数据导入到DolphinDB时,会做相应的类型转换。具体转换规则如下表所示:

26c9d1fdffbac816110862e9430e4044.png

说明:

(1)DolphinDB中的整型(SHORT, INT, LONG)都是有符号的,为了防止溢出,MySQL中的无符号类型在DolphinDB中都会被转换为高一阶的有符号类型。例如,无符号的tinyint转换为short,无符号的smallint转换为short等。目前,MySQL插件不支持64位无符号类型转换。

(2)在DolphinDB中,整型的最小值表示NULL:CHAR类型的-128,SHORT类型的-32,768,INT类型的-2,147,483,648,LONG类型的-9,223,372,036,854,775,808都表示NULL。

(3)对于MySQL中的bigint unsigned类型,默认会转换成DolphinDB的LONG类型。如果出现溢出的情况,需要用户使用schema参数,指定类型为DOUBLE或FLOAT。

(4)MySQL中的char和varchar类型,如果长度小于等于10,会被转换成DolphinDB的SYMBOL类型,如果长度大于10,会被转换成DolphinDB的STRING类型。SYMBOL类型在DolphinDB内部存储为整数,因此数据排序和比较的效率会更高,同时也可以节省存储空间。但是将字符串映射到整数需要时间,映射表也会占用内存。用户可以根据下面的规则来决定某列是否采用SYMBOL类型:如果该字段的值会大量重复出现,使用SYMBOL类型。如金融数据中的股票代码、交易所、合约代码等,物联网数据中的设备编号等都是使用SYMBOL类型的典型场景。

5. 性能测试

我们在普通PC上(16G内存,4核8线程,使用SSD)进行了性能测试。使用的数据集为美国股票市场1990年到2016年的每日报价数据,数据量为6.5G,包含22个字段,50,591,907行记录,在MySQL数据库中磁盘占用为7.2G。使用loadEx函数把数据从MySQL导入到DolphinDB的分区数据库耗时仅160.5秒,读取速度达到了41.4M/s,在 DolphinDB database 中磁盘占用为1.3G。在同样的PC上,由于使用ODBC一次性导入数据会造成MySQL内存不足,因此每次导入100万条数据,总耗时660秒。将同样的数据导入clickhouse耗时171.9秒,读取速度为37.8M/s。DolphinDB在时间序列数据的处理和分区管理上比clickhouse更加方便。


Guess you like

Origin blog.51cto.com/15022783/2575501