StreamSets 025- Big Data ETL tool of installation and subscription mysql binlog

This is insist on technical writing programs (including translation) of 25, set a small target 999, a minimum of two per week.

This paper describes CDH6.2 + StreamSets3.9.

StreamSets is a large data acquisition and data processing tools. You can drag and drop the visualization, data pipeline (Pipelines) design and scheduling. Its characteristics are:

  • Visual interface drag and drop operation, quick.
  • Better support for common data processing (data source, data manipulation, data output).
  • Built-in monitoring, the data stream can be observed.

Similar open-source products are the Apache NiFi  , online comparison on NiFi and StreamSets of  Open Source ETL: the Apache NiFi VS Streamsets  (online Chinese version translated version)

More domestic contacts ETL tool, may be  DataX  ,  Kettle  , Sqoop . Here is a simple comparison, the comparison of data integration kettle, sqoop, datax, streamSets 

Installation StreamSets 3.9

Download the installation package parcel

From  archives.streamsets.com/index.html  download 3.9


And upload it to the server at http www directory, for example paper centos7.6

wget -P /var/www/html/streamsets3.9.0/ https://archives.streamsets.com/datacollector/3.9.0/parcel/manifest.json
wget -P /var/www/html/streamsets3.9.0/ https://archives.streamsets.com/datacollector/3.9.0/parcel/STREAMSETS_DATACOLLECTOR-3.9.0-el7.parcel.sha
wget -P /var/www/html/streamsets3.9.0/ https://archives.streamsets.com/datacollector/3.9.0/parcel/STREAMSETS_DATACOLLECTOR-3.9.0-el7.parcel
复制代码

Configuration csd

From  streamsets.com/opensource  Download

wget -P /opt/cloudera/csd/ https://archives.streamsets.com/datacollector/3.9.0/csd/STREAMSETS-3.9.0.jar
cd /opt/cloudera/csd/
sudo chown cloudera-scm:cloudera-scm STREAMSETS-3.9.0.jar && sudo chmod 644 STREAMSETS-3.9.0.jar
systemctl restart cloudera-scm-server
复制代码

Download Parcel distribution package





Download and activated, but when I actually tested, the total size, 4.6G, after the actual download, 5.2G, resulting in sha1sum check fails, report

Cm where the host, ls -lah /opt/cloudera/parcel-repo  

The downloaded  archives.streamsets.com/datacollect...  copied to / opt / cloudera / parcel-repo under


If you have nothing and tried to download, and report to the hash error, after a direct replacement, or prompt hash this page, click to download again during this time, it will become distribution.
After activation as follows



Is created

streamsets simple to use

Open streamsets, the default username and password admin / admin



The official tutorial, reference  Basic Tutorial

This article explains subscribe mysql binlog data synchronization

mysql binlog

Open binlog

Mysql modify the configuration file, my.cnf, increase in mysqld (note 5.7 without server-id not start properly)

server-id=1
log-bin=mysql-bin
binlog_format=ROW
复制代码

Create and configure synchronization account

GRANT ALL on slave_test.* to 'slave_test'@'%' identified by 'slave_test';
GRANT SELECT, REPLICATION CLIENT, REPLICATION SLAVE on *.* to 'slave_test'@'%';
FLUSH PRIVILEGES;
复制代码

Install mysql jdbc driver

wget -P /opt/cloudera/parcels/STREAMSETS_DATACOLLECTOR/streamsets-libs/streamsets-datacollector-mysql-binlog-lib/lib/ https://repo1.maven.org/maven2/mysql/mysql-connector-java/5.1.47/mysql-connector-java-5.1.47.jar
复制代码

Restart streamsets

Create a pipeline

Configure mysql binlog analysis and processing






Configure the target end

run

test

Here comes the pressure measured using mysql tools  mysqlslap.exe for testing

bin/mysqlslap --user=root --password=xxxxxx --concurrency=50 --number-int-cols=5 --number-char-cols=20 --auto-generate-sql --number-of-queries=100000 --auto-generate-sql-load-type=write --host=192.168.0.123 --port=3306
--user 用户(需要有建库建表权限)
--password 密码
--concurrency 并发数
--number-int-cols 表内有5个数字列
--number-char-cols 表内有20个字符串列
--auto-generate-sql 自动生成脚本
--number-of-queries 总执行次数
--auto-generate-sql-load-type=write 只执行写入操作
--host mysql 主机
--port 端口
复制代码

There are monitoring reports below

Common Errors

    ![image.png](https://cdn.nlark.com/yuque/0/2019/png/226273/1561021775509-fa60a34d-8e71-4e30-aa65-88a23521fb26.png)
复制代码


Inconsistent synchronization errors from manual


Set the offset

If the error Pipeline Status: RUNNING_ERROR: For input string: ""xxxx"   , change the my.cnf

server-id=1
log-bin=mysql-bin
binlog_format=ROW
sync_binlog=1
binlog_gtid_simple_recovery=ON
log_slave_updates=ON
gtid_mode=ON
enforce_gtid_consistency=ON
复制代码

Reference material

Reproduced in: https: //juejin.im/post/5d0b5bbcf265da1b7f29850a

Guess you like

Origin blog.csdn.net/weixin_34148340/article/details/93165174