026-Kettle Table Input Output table lift 50 times tips

This is insist on technical writing programs (including translation) of 26, set a small target 999, a minimum of two per week.

Recent work required, the data required to guide from Oracle Mysql, and must be appropriately cleaned, conversion.
The amount of data of about 500 million, the hardware environment Winserver 2008R2 64 bit, 64G, 48 nuclear, 1T hdd, kettle is 8.2, from Oracle (11G, linux server, a LAN connection) drawn mysql (5.7, native, win server ).
Speed read before optimization is about 1000r / s (Oracle), write 1000r / s or so.
The optimized reading speed is about 8Wr / s (Oracle), write 4Wr / s or so.
Because the field size and type of the table and whether the index have a relationship, so overall, improved about 20-50 times.

mysql optimization

mysql here just to migrate data, in fact, use csv, or clickhouse required. But fear csv may be a problem when dealing with date and clickhouse not run in the win, while limited conditions, no extra resources linux, mysql and third party open source framework (whether imported hdfs), or as the appearance clickhouse, or data show (supserset, metabase, etc.), or migrate to tidb, very convenient. So in the end we decided to use mysql.

Note: mysql data here only temporary migration, and therefore can easily restart mysqld with modified parameters. If is mixed with business, you need to consult dba, to ensure that does not affect other businesses.

mysql installation and configuration optimization

  • From  dev.mysql.com/downloads/m...  Download 64 zip mysql
  • Mysql-5.7.26-winx64.zip to extract the directory, such as D: \ mysql-5.7.26-winx64
  • Create D: \ mysql-5.7.26-winx64 \ my.ini
[mysqld]
port=3306
basedir=D:\mysql-5.7.26-winx64\
datadir=D:\mysql-5.7.26-winx64\data
net_buffer_length=5242880
max_allowed_packet=104857600
bulk_insert_buffer_size=104857600
max_connections = 1000
innodb_flush_log_at_trx_commit = 2
# 本场景下测试MyISAM比InnoDB 提升1倍左右
default-storage-engine=MyISAM
general_log = 1
general_log_file=D:\mysql-5.7.26-winx64\logs\mysql.log
innodb_buffer_pool_size = 36G
innodb_log_files_in_group=2
innodb_log_file_size = 500M
innodb_log_buffer_size = 50M
sync_binlog=1
innodb_lock_wait_timeout = 50
innodb_thread_concurrency = 16
key_buffer_size=82M
read_buffer_size=64K
read_rnd_buffer_size=256K
sort_buffer_size=256K
myisam_max_sort_file_size=100G
myisam_sort_buffer_size=100M
transaction_isolation=READ-COMMITTED
复制代码

Kettle optimization

Start parameter optimization

This larger memory unit, in order to prevent the OOM, so large memory transfer parameters create an environment variable PENTAHO_DI_JAVA_OPTIONS = -Xms20480m -Xmx30720m -XX:MaxPermSize=1024m initial 20G, 30G maximum

Table input and output tables open multiple threads

If the input table open multiple threads, it can lead to data duplication. For example select * from test , from three threads, it will check three times, the last data is 3 parts. Certainly not, did not achieve the purpose of optimization.
Because the source is the oracle, by using the characteristics oracle: rownum and functions: mod and the parameters of the kettle:Internal.Step.Unique.Count,Internal.Step.Unique.Number

select * from (SELECT test.*,rownum rn FROM test ) where mod(rn,${Internal.Step.Unique.Count}) = ${Internal.Step.Unique.Number}
复制代码

explain

  • oracle rownum dispensing system is numbered sequentially from the query returns a row, the first row is assigned a return, the second row is 2, which means, if the sort field changes or if the data, also become rownum (i.e. with physical data do not correspond, if you want to correspondence relationship, you should use rowid, rowid but not numbers, but similar to the coding AAAR3sAAEAAAACXAAA), so rownum need to be solidified, and so SELECT test.*,rownum rn FROM test as a sub-query
  • oracle mod is modulo operation function, for example, mod(5,3)  which means 5%3=2 , that is, 5/3=1...2 2, is obtained if the total number of threads, and the current number of threads, modulus, can be split the result set.mod(行号,总线程数)=当前线程序号
  • kettle built-in functions ${Internal.Step.Unique.Count} and ${Internal.Step.Unique.Number} represent the total number of threads and the current thread

The table output does not matter, how many threads open, kettle will then seek equal shares of the total number.


Right selection table input or output table, select the 改变开始复制的数量...  note, not blindly turn up we will be able to enhance efficiency, to be tested.
Table enter it, check variable replace

  • Modify the number of submissions (100 by default, but not the bigger the better)
  • Remove the cutting table, because it is multi-threaded, you certainly do not want to, A thread just inserted, B to delete.
  • You must specify the database field, because when the input table, will be one more line number field. It will result in the insertion fails. Of course, if you create a table, line number field to pay more, as the increment id, then it does not require this step.
  • Open bulk insert

Note: by opening multiple threads, the speed can improve more than five times.

Open thread pool and parameter optimization jdbc



Run observations


Note that the adjustment of the different parameters (number of threads, number of submission), observe speed.

The remaining room for improvement

  1. In other ssd
  2. Continue to optimize the parameters mysql
  3. In other engine, for example, tokudb
  4. In other extraction tools, such as streamsets, datax
  5. In other databases, such as clickhouse, tidb, Cassandra
  6. kettle cluster

Recruitment small ads

Jinan little friends to vote resume ah welcome to join us , do things together.
Long-term recruitment, Java programmers, engineers, big data, operation and maintenance engineers, front-end engineers.

Reproduced in: https: //juejin.im/post/5d0087f9f265da1bd6059cc1

Guess you like

Origin blog.csdn.net/weixin_33912246/article/details/93165178