Flume实时采集MySQL增量数据

01-简介

相关项目计划实时增量采集MySQL数据库存储的数据到Kafka中,供消费者消费。在网上了解了一种方案,使用Flume第三方Source插件“flume-ng-sql-source”,可以实时增量采集MySQL数据

flume-ng-sql-source的GitHub地址:https://github.com/keedio/flume-ng-sql-source

02-环境信息

1.基础环境信息

  • 集群环境
    • CDH 5.8.3
    • CDH提供的Flume
    • CDH提供的Kafka
  • 数据源
    • MySQL 5.6
  • Source插件
    • flume-ng-sql-source 1.4.3
  • MySQL JDBC
    • mysql-connector-java-5.1.35.jar

2.插件版本选择

 根据官网提供的说明,Flume1.7以及更早的版本,使用1.4.3版本的插件

03-测试搭建

1.插件编译打包

使用Maven进行编译打包,java默认版本1.7

2.主机创建相关目录并上传jar包

 按照CDH集群Flume配置的主目录以及插件目录,在agent运行的节点创建如下两个目录:

  • /var/lib/flume-ng/plugins.d/sql-source/lib
    • 存储flume-ng-sql-source的jar
  • /var/lib/flume-ng/plugins.d/sql-source/libext
    • 存储MySQL JDBC的jar

由于需要通过状态文件来时间增量采集,配置文件中会指定状态文件名以及存放的目录。提前创建好目录以及状态文件。

注意:由于Flume配置的执行用户是flume,所以状态目录以及文件要给flume用户赋权,或者将属主属组改成flume:flume;Flume会在该文件更新增量

 

 3.配置文件

 1 tier1.sources = sqlSource
 2 tier1.channels = sqlChannel
 3 tier1.sinks = kafkaSink
 4 #Source配置
 5 # For each one of the sources, the type is defined
 6 tier1.sources.sqlSource.type = org.keedio.flume.source.SQLSource
 7 tier1.sources.sqlSource.hibernate.connection.url = jdbc:mysql://132.35.231.131:3306/zhaoxj_db
 8 # Hibernate Database connection properties
 9 tier1.sources.sqlSource.hibernate.connection.user = zhaoxj
10 tier1.sources.sqlSource.hibernate.connection.password = *********
11 tier1.sources.sqlSource.hibernate.connection.autocommit = true
12 tier1.sources.sqlSource.hibernate.dialect = org.hibernate.dialect.MySQL5Dialect
13 tier1.sources.sqlSource.hibernate.connection.driver_class = com.mysql.jdbc.Driver
14 #agent.sources.sqlSource.table = employee1
15 # Columns to import to kafka (default * import entire row)
16 #agent.sources.sqlSource.columns.to.select = *
17 # Query delay, each configured milisecond the query will be sent
18 tier1.sources.sqlSource.run.query.delay = 10000
19 #Source查询状态配置
20 tier1.sources.sqlSource.status.file.path = /var/log/flume
21 tier1.sources.sqlSource.status.file.name = sqlSource.status
22 # Source查询配置
23 #增量起始值
24 tier1.sources.sqlSource.start.from = 0
25 #这里引号需要用转移字符\转义,$@$会被替换为:1.当状态文件中增量标识不存在,也就是第一次查询的时候,替换为start.from指定的值;2.当增量标识有值,则替换为状态文件里存储的值
26 tier1.sources.sqlSource.custom.query = select * from test_line_feed where t_id > \'$@$\'
27 tier1.sources.sqlSource.batch.size = 1000
28 #每次查询1000条数据,查询的语句会自动拼接 “limit 1000”
29 tier1.sources.sqlSource.max.rows = 1000
30 #字段分隔符
31 tier1.sources.sqlSource.delimiter.entry = |
32 #连接池的配置
33 tier1.sources.sqlSource.hibernate.connection.provider_class = org.hibernate.connection.C3P0ConnectionProvider
34 tier1.sources.sqlSource.hibernate.c3p0.min_size = 1
35 tier1.sources.sqlSource.hibernate.c3p0.max_size = 10
36 
37 #channel配置
38 # The channel can be defined as follows.
39 tier1.sources.sqlSource.channels = sqlChannel
40 
41 tier1.channels.sqlChannel.type = memory
42 tier1.channels.sqlChannel.capacity = 500
43 
44 #sink配置
45 #使用的flume自带的kafka sink
46 tier1.sinks.kafkaSink.type = org.apache.flume.sink.kafka.KafkaSink
47 tier1.sinks.kafkaSink.flumeBatchSize = 640
48 tier1.sinks.kafkaSink.kafka.bootstrap.servers = 132.35.231.160:9092,132.35.231.161:9092,132.35.231.162:9092
49 tier1.sinks.kafkaSink.kafka.topic = zhaoxj_test
50 
51 #关联配置
52 tier1.sources.sqlSource.channels = sqlChannel
53 tier1.sinks.kafkaSink.channel = sqlChannel

 注意:无法指定字段为判断增量字段,代码中默认查询的第一个字段为增量字段。解决方案:可以在查询的时候,将想要作为判断增量依据的字段放在第一个位置。

猜你喜欢

转载自www.cnblogs.com/zhaoxj127/p/12021176.html