This is insist on technical writing programs (including translation) of 25, set a small target 999, a minimum of two per week.
This paper describes CDH6.2 + StreamSets3.9.
StreamSets is a large data acquisition and data processing tools. You can drag and drop the visualization, data pipeline (Pipelines) design and scheduling. Its characteristics are:
- Visual interface drag and drop operation, quick.
- Better support for common data processing (data source, data manipulation, data output).
- Built-in monitoring, the data stream can be observed.
Similar open-source products are the Apache NiFi , online comparison on NiFi and StreamSets of Open Source ETL: the Apache NiFi VS Streamsets (online Chinese version translated version)
More domestic contacts ETL tool, may be DataX , Kettle , Sqoop . Here is a simple comparison, the comparison of data integration kettle, sqoop, datax, streamSets
Installation StreamSets 3.9
Download the installation package parcel
From archives.streamsets.com/index.html download 3.9
And upload it to the server at http www directory, for example paper centos7.6
wget -P /var/www/html/streamsets3.9.0/ https://archives.streamsets.com/datacollector/3.9.0/parcel/manifest.json
wget -P /var/www/html/streamsets3.9.0/ https://archives.streamsets.com/datacollector/3.9.0/parcel/STREAMSETS_DATACOLLECTOR-3.9.0-el7.parcel.sha
wget -P /var/www/html/streamsets3.9.0/ https://archives.streamsets.com/datacollector/3.9.0/parcel/STREAMSETS_DATACOLLECTOR-3.9.0-el7.parcel
复制代码
Configuration csd
From streamsets.com/opensource Download
wget -P /opt/cloudera/csd/ https://archives.streamsets.com/datacollector/3.9.0/csd/STREAMSETS-3.9.0.jar
cd /opt/cloudera/csd/
sudo chown cloudera-scm:cloudera-scm STREAMSETS-3.9.0.jar && sudo chmod 644 STREAMSETS-3.9.0.jar
systemctl restart cloudera-scm-server
复制代码
Download Parcel distribution package
Download and activated, but when I actually tested, the total size, 4.6G, after the actual download, 5.2G, resulting in sha1sum check fails, report
Cm where the host, ls -lah /opt/cloudera/parcel-repo
The downloaded archives.streamsets.com/datacollect... copied to / opt / cloudera / parcel-repo under
If you have nothing and tried to download, and report to the hash error, after a direct replacement, or prompt hash this page, click to download again during this time, it will become distribution.
After activation as follows
Is created
streamsets simple to use
Open streamsets, the default username and password admin / admin
The official tutorial, reference Basic Tutorial
This article explains subscribe mysql binlog data synchronization
mysql binlog
Open binlog
Mysql modify the configuration file, my.cnf, increase in mysqld (note 5.7 without server-id not start properly)
server-id=1
log-bin=mysql-bin
binlog_format=ROW
复制代码
Create and configure synchronization account
GRANT ALL on slave_test.* to 'slave_test'@'%' identified by 'slave_test';
GRANT SELECT, REPLICATION CLIENT, REPLICATION SLAVE on *.* to 'slave_test'@'%';
FLUSH PRIVILEGES;
复制代码
Install mysql jdbc driver
wget -P /opt/cloudera/parcels/STREAMSETS_DATACOLLECTOR/streamsets-libs/streamsets-datacollector-mysql-binlog-lib/lib/ https://repo1.maven.org/maven2/mysql/mysql-connector-java/5.1.47/mysql-connector-java-5.1.47.jar
复制代码
Restart streamsets
Create a pipeline
Configure mysql binlog analysis and processing
Configure the target end
run
test
Here comes the pressure measured using mysql tools mysqlslap.exe
for testing
bin/mysqlslap --user=root --password=xxxxxx --concurrency=50 --number-int-cols=5 --number-char-cols=20 --auto-generate-sql --number-of-queries=100000 --auto-generate-sql-load-type=write --host=192.168.0.123 --port=3306
--user 用户(需要有建库建表权限)
--password 密码
--concurrency 并发数
--number-int-cols 表内有5个数字列
--number-char-cols 表内有20个字符串列
--auto-generate-sql 自动生成脚本
--number-of-queries 总执行次数
--auto-generate-sql-load-type=write 只执行写入操作
--host mysql 主机
--port 端口
复制代码
There are monitoring reports below
Common Errors
![image.png](https://cdn.nlark.com/yuque/0/2019/png/226273/1561021775509-fa60a34d-8e71-4e30-aa65-88a23521fb26.png)
复制代码
Inconsistent synchronization errors from manual
Set the offset
If the error Pipeline Status: RUNNING_ERROR: For input string: ""xxxx"
, change the my.cnf
server-id=1
log-bin=mysql-bin
binlog_format=ROW
sync_binlog=1
binlog_gtid_simple_recovery=ON
log_slave_updates=ON
gtid_mode=ON
enforce_gtid_consistency=ON
复制代码
Reference material
Reproduced in: https: //juejin.im/post/5d0b5bbcf265da1b7f29850a