[Sequoia] database SequoiaDB Sequoia Tech | Sequoia database data into high-performance data migration practice

SequoiaDB a self-study class financial products distributed database supports standard SQL and distributed transaction capabilities to support complex queries index, compatible with MySQL, PGSQL, SparkSQL such as SQL access. SequoiaDB on distributed memory function, more general products of large data provide more data segmentation rules, comprising: a horizontal segmentation, segmentation range, cutting and Duo Weiqie master table division manner, not in accordance with the user can select a scene segmentation corresponding manner, to increase the storage capacity and operating performance of the system.

In order to provide a simple and convenient data migration import function, while more easily with conventional database in the data layer abutting, redwood data import database supports a variety of ways, the user can load the data according to their needs to choose the most suitable way.

This paper describes the centralized database redwood common method of introducing high-performance data, including redwood Sdbimprt import tool matrix tools and the use SparkSQL, MySQL native API interface and data import, a total of four ways.

 

Sdbimprt tool imports

sdbimprt is SequoiaDB data import tool is an important component redwood database tool matrix, which can be introduced or JSON-formatted data to a CSV format SequoiaDB database.

Description and parameters introduced on the tools, please refer to: http: //doc.sequoiadb.com/cn/sequoiadb-cat_id-1479195620-edition_id-0.

First, an example

The following outlines how to use the sdbimprt tool to import a csv file into a collection space SequoiaDB collection site in user_info: 1 data file name is "user.csv", reads as follows:

“Jack”,18,”China”“Mike”,20,”USA”

2. Import command

sdbimprt --hosts=localhost:11810 --type=csv --file=user.csv -c site -l user_info --fields='name string default "Anonymous", age int, country'
  • the --hosts : Specifies the host address (hostname: svcname)

  • --type : import data format, or may be a csv json

  • --file : data file name to be imported

  • -c (- csname) : the name of the collection space

  • the -l (- clname) : the name of the collection

  • --fields : Specifies the imported data field name, type, default values

 

Second, the performance optimization introduced
following describes how to improve the performance when using the import tool sdbimprt: 1 using a plurality of nodes specified --hosts import data, a plurality of possible specified address coord nodes, with "," separated by a plurality of addresses, the tool sdbimprt coord random data will be sent to the different machines for load balancing (see fig. 1).

 

 

 2. --insertnum (-n) parameter when import data using --insertnum (-n) parameter can be introduced in batches, the number of network interaction during data transmission is reduced, thus speeding up the rate of introduction of the data. In the range of 1 to 100,000, the default value is 100. 3. --jobs (-j) parameter specifies the number of introduced (per connection a thread) connections, in order to achieve multithreading introduced. 4. segmentation file sdbimprt support when importing data import multi-threaded, single-threaded but read data is read, as the number of threads to import data read becomes a performance bottleneck. In this case, a large data file into several smaller files, each file corresponding to start a little complicated process sdbimprt introduced, so as to enhance the performance introduced. If there are multiple coordinator nodes in the cluster, distributed on different machines, each process can be initiated sdbimprt on multiple machines, and each connection machine sdbimprt local coordinator node, so that data transmission is avoided when the network node transmits to the coordinator (FIG. 2).

 

 

 The data is loaded and then indexed to import large volumes of data, and the index number of the table, it is recommended to first delete the index, until the data import is complete and then rebuild the index, this will help speed up data import. In the process of imported data, if the target table exists a large number of indexes, in addition to write data to a database, the index file also needs to be written, which reduces the performance of the imported data. This method is also applicable in other ways to enhance the data rate of introduction.

SparkSQL import

SparkSQL 可以方便的读取多种数据源,通过 SequoiaDB 提供的 Spark 连接器,可以使用 SparkSQL 向 SequoiaDB 中写入数据或从中读取数据

关于 SparkSQL 如何与 SequoiaDB 连接,请参考:http://doc.sequoiadb.com/cn/sequoiadb-cat_id-1432190712-edition_id-0。

一、示例

下面举例说明如何将 HDFS 中的 csv 文件通过 SparkSQL 导入 SequoiaDB 集合中,以及如何优化导入性能。 1、将 HDFS 中 csv 文件映射成 spark 的临时表

CREATE TABLE   hdfstable           USING  org.apache.spark.sql.execution.datasources.csv.CSVFileFormatOPTIONS (  path "hdfs://usr/local/data/test.csv",   header "true")

2. 将 SDB 的集合映射成 spark 的临时表

create temporary table sdbtable (a string,b int,c date) using com.sequoiadb.spark OPTIONS ( host 'sdbserver1:11810,sdbserver2:11810,sdbserver3:11810', username 'sdbadmin',password 'sdbadmin',collectionspace 'sample', collection 'employee',bulksize '500');

3. 导入

sparkSession.sql("insert into sdbtable select * from hdfstable");

 

二、导入性能优化
SparkSQL 数据写入有以下两个参数可以优化:

  • host

尽量指定多个 coord 节点的地址,用“,”分隔多个地址,数据会随机发到不同 coord 节点上,起到负载均衡的作用。

  • bulksize

该参数默认值为500,代表连接器向 SequoiaDB 写入数据时,以 500 条记录组成一个网络包,再向 SequoiaDB 发送写入请求,可以根据数据的实际大小调整 bulksize 的值。

MySQL 导入

SequoiaDB 以存储引擎的方式与 MySQL 对接,使得用户可以通过 MySQL 的 SQL 接口访问 SequoiaDB 中的数据,并进行增、删、改、查等操作。

关于如何与MySQL对接,请参考:

http://doc.sequoiadb.com/cn/sequoiadb-cat_id-1521595283-edition_id-302。

 

一、示例

使用 mysql 向 SequoiaDB 导入数据有以下几种方式:1. SQL 文件导入

mysql> source /opt/table1.sql

2. CSV 文件导入。mysql 中提供了 load data infile 语句来插入数据:

mysql> load data local infile '/opt/table2.csv' into table table2 fields terminated by ',' enclosed by '"' lines terminated by '\n';

 

二、导入性能优化

提升MySQL的导入性能有如下建议:1. sequoiadb_conn_addr 指定多个地址引擎配置参数“sequoiadb_conn_addr”尽量指定多个coord节点的地址,用“,”分隔多个地址,数据会随机发到不同coord节点上,起到负载均衡的作用。
2. 开启 bulkinsert引擎配置参数“sequoiadb_use_bulk_insert”指定是否启用批量插入,默认值为“ON”,表示启用。配置参数“sequoiadb_bulk_insert_size”指定批量插入时每批的插入记录数,默认值2000。可以通过调整bulkinsert size提高插入性能。
3. 切分文件可以将一个大的数据文件切分为若干个小文件,然后为每个小文件启动一个导入进程,多个文件并发导入,提高导入速度。

API 接口导入

SequoiaDB 提供了插入数据的 API 接口,即“insert”接口。insert 接口会根据传入的参数不同而使用不同的插入方式,如果每次只传入一条记录,则接口也是将记录逐条的发送到数据库引擎,如果每次传入一个包含多条记录的集合或数组,则接口会一次性把这批记录发送到数据库引擎,最后通过引擎一条一条写入数据库中。
因此,insert 接口的两种插入方式的区别在于发送数据到数据库引擎这一过程,一次传入多条记录这种方式称为“bulkinsert”,相对来说会减少数据发送时的网络交互的次数,插入性能更佳。
小结如何达到最大数据加载速度,是数据库迁移/数据导入中常遇到的问题,本文从以下四个方面分别介绍了 SequoiaDB 数据迁移/导入过程中性能最优化的方法:1)基于巨杉工具矩阵 sdbimprt 导入可以采用修改参数 host 指定多个节点、修改连接数、切分文件、修改参数 insertnum、重建索引等等对数据导入速度进行优化。2)基于 MySQL 导入可以采用修改参数 host 地址及 bulksize 进行优化。3)基于 Spark 导入可以采用指定多个协调节点IP、设置 bulkinsert 参数、切分文件进行优化。      4)基于API接口进行优化可以采用 bulkinsert 批量插入数据,减少网络交互。
大家可以参考本文的数据导入方法进行实践验证,从传统数据库迁移到巨杉数据库SequoiaDB。

Guess you like

Origin www.cnblogs.com/sequoiadbsql/p/11942498.html