大概在2年前,我们的数据库越来越大,每天通过Sqoop完全同步数据库到HIVE的方式时间明显拉长,当时考虑使用canal来解析日志,然后通过impala+kudu做增量更新,这样不仅解决了数据同步问题,还能实时,而且还能做消息中心,但是这个项目并没有做完。 现在的maxwell因为解析日志成json格式,在处理方面更加方便一点,因此有想来开发一个maxwell版的增量同步。
增量同步实际上没有太好的办法,有人说用时间或者主键作为增量,这个可以,但是要处理几个问题,第一个是老数据更新了,去重的问题,第二个是既然是通过时间来做增量,我可以简单的认为,实现方法是根据时间来查询最新数据,然后进行同步,那如果有1亿条,每天这样干,好像也不见得是多么高明。
那么如果是用数据库日志来进行同步,怎么处理呢? 如果是HIVE,不可行,HIVE并不支持DML,DDL。 impala+kudu是支持的,解析日志,再转换成SQL,通过impala来更新,
那就意味着你只能使用impala来做前端的查询和分析,因为hive并不支持kudu格式。
如果把表数据抽取到HBASE,然后解析日志来更新HBASE,这个方法好像是可行的,但是前端只能建立外部表链接到HBASE表,速度实际会慢一点点,但是优点是仍然可以使用HIVE, SPARK SQL,impala来做前端查询和分析。
上面3种方法其实都不是很好,但是如果数据量真的比较大,我有可能会选择最后一个方法。既保证了增量,又保证了前端的工具使用。
接下来看看maxwell,下载什么也不需要,就可以跑起来,并解析日志。 原理实际比较简单,maxwell会在对应的数据库上建立一个maxwell数据库用来记录binlog的position, 如果maxwell因为某个原因失败了,下次根据数据库里保存的position去重新获取binlog,第一次连接是从最新的log position获取数据,当然,你也可以手动去更新数据库position位置。
获取数据之后直接解析成JSON,并发送到目的地。 目的地有MQ, KAFKA, STDOUT 等等等。 具体操作如下:
启动maxwell,具体含义看参数就明白鸟。
[root@10-10-192-88 maxwell-1.4.2]# bin/maxwell --user='admin' --password='internal' --host='localhost' --producer=kafka --kafka.bootstrap.servers=namenode01.isesol.com:9092,namenode02.isesol.com:9092,datanode03.isesol.com:9092,datanode04.isesol.com:9092 --output_ddl=true --ddl_kafka_topic=maxwell_ddl Thu Dec 14 17:31:49 CST 2017 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification. Thu Dec 14 17:31:49 CST 2017 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification. Thu Dec 14 17:31:49 CST 2017 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification. 17:31:49,555 INFO ProducerConfig - ProducerConfig values: compression.type = gzip metric.reporters = [] metadata.max.age.ms = 300000 metadata.fetch.timeout.ms = 5000 reconnect.backoff.ms = 50 sasl.kerberos.ticket.renew.window.factor = 0.8 bootstrap.servers = [namenode01.isesol.com:9092, namenode02.isesol.com:9092, datanode03.isesol.com:9092, datanode04.isesol.com:9092] retry.backoff.ms = 100 sasl.kerberos.kinit.cmd = /usr/bin/kinit buffer.memory = 33554432 timeout.ms = 30000 key.serializer = class org.apache.kafka.common.serialization.StringSerializer sasl.kerberos.service.name = null sasl.kerberos.ticket.renew.jitter = 0.05 ssl.keystore.type = JKS ssl.trustmanager.algorithm = PKIX block.on.buffer.full = false ssl.key.password = null max.block.ms = 60000 sasl.kerberos.min.time.before.relogin = 60000 connections.max.idle.ms = 540000 ssl.truststore.password = null max.in.flight.requests.per.connection = 5 metrics.num.samples = 2 client.id = ssl.endpoint.identification.algorithm = null ssl.protocol = TLS request.timeout.ms = 30000 ssl.provider = null ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1] acks = 1 batch.size = 16384 ssl.keystore.location = null receive.buffer.bytes = 32768 ssl.cipher.suites = null ssl.truststore.type = JKS security.protocol = PLAINTEXT retries = 1 max.request.size = 1048576 value.serializer = class org.apache.kafka.common.serialization.StringSerializer ssl.truststore.location = null ssl.keystore.password = null ssl.keymanager.algorithm = SunX509 metrics.sample.window.ms = 30000 partitioner.class = class org.apache.kafka.clients.producer.internals.DefaultPartitioner send.buffer.bytes = 131072 linger.ms = 0 17:31:49,561 WARN KafkaProducer - metadata.fetch.timeout.ms config is deprecated and will be removed soon. Please use max.block.ms 17:31:49,714 INFO AppInfoParser - Kafka version : 0.9.0.1 17:31:49,715 INFO AppInfoParser - Kafka commitId : 23c69d62a0cabf06 17:31:49,733 INFO Maxwell - Maxwell v1.4.2 is booting (MaxwellKafkaProducer), starting at BinlogPosition[mysql-bin.000004:828606] 17:31:49,822 INFO OpenReplicator - starting replication at mysql-bin.000004:828606 Thu Dec 14 17:31:49 CST 2017 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification. 17:31:49,957 INFO MysqlSavedSchema - Restoring schema id 5 (last modified at BinlogPosition[mysql-bin.000004:827505]) 17:31:50,068 INFO MysqlSavedSchema - Restoring schema id 1 (last modified at BinlogPosition[mysql-bin.000003:4493]) 17:31:50,123 INFO MysqlSavedSchema - beginning to play deltas... 17:31:50,125 INFO MysqlSavedSchema - played 4 deltas in 2ms
kafka DML输出如下:
{"database":"test","table":"test","type":"insert","ts":1513238989,"xid":5102,"commit":true,"data":{"id":11,"name":"fdfdfdd"}} {"database":"test","table":"test1","type":"insert","ts":1513239969,"xid":5129,"commit":true,"data":{"id":10}} {"database":"test","table":"test1","type":"delete","ts":1513239981,"xid":5141,"commit":true,"data":{"id":10}} {"database":"test","table":"test1","type":"insert","ts":1513239996,"xid":5155,"data":{"id":10}} {"database":"test","table":"test1","type":"insert","ts":1513239998,"xid":5155,"commit":true,"data":{"id":11}} {"database":"test","table":"test1","type":"insert","ts":1513240035,"xid":5171,"data":{"id":12}} {"database":"test","table":"test1","type":"insert","ts":1513240038,"xid":5171,"data":{"id":13}} {"database":"test","table":"test1","type":"insert","ts":1513240041,"xid":5171,"commit":true,"data":{"id":14}} {"database":"test","table":"test1","type":"delete","ts":1513240067,"xid":5186,"data":{"id":10}} {"database":"test","table":"test1","type":"delete","ts":1513240067,"xid":5186,"data":{"id":11}} {"database":"test","table":"test1","type":"delete","ts":1513240067,"xid":5186,"data":{"id":12}} {"database":"test","table":"test1","type":"delete","ts":1513240067,"xid":5186,"data":{"id":13}} {"database":"test","table":"test1","type":"delete","ts":1513240067,"xid":5186,"commit":true,"data":{"id":14}} {"database":"test","table":"test","type":"insert","ts":1513240300,"xid":5201,"commit":true,"data":{"id":1,"name":"fdfd"}} {"database":"test","table":"test","type":"insert","ts":1513240317,"xid":5212,"commit":true,"data":{"id":2,"name":"fdfdafsfdfdfd"}}
kafka DDL输出如下:
{"type":"table-create","database":"test","table":"test2","def":{"database":"test","charset":"utf8mb4","table":"test2","columns":[{"type":"int","name":"id","signed":true}],"primary-key":[]},"ts":1513240757000,"sql":"create table test2(id int)"} {"type":"table-alter","database":"test","table":"test2","old":{"database":"test","charset":"utf8mb4","table":"test2","columns":[{"type":"int","name":"id","signed":true}],"primary-key":[]},"def":{"database":"test","charset":"utf8mb4","table":"test2","columns":[{"type":"int","name":"id","signed":true},{"type":"varchar","name":"name","charset":"utf8mb4"}],"primary-key":[]},"ts":1513241096000,"sql":"alter table test2 add column name varchar(10)"}
然后再写一个中间件,用来转义这些JSON,就可以用于其它任何系统了。 实际上我这里有个很大的疑问,kafka partition只能保证单个partition有序,多个partition实际上是无序的,但是在应用binlog的时候,大家知道,这个SQL顺序极其重要,我先update再delete和先delete再update是决然不同的结果。
因此如果kafka是多个partition, 如何保证SQL的应用是有顺序的呢? 我实际上只能想到,只能用一个partition来操作。