MongoDB -> kafka performance real-time synchronization (acquisition) solution MongoDB data kafka

The purpose of writing this blog

Ali let more people know about open source MongoShake can well meet mongodb to kafka availability of high-performance real-time synchronization requirements (Project Address: https://github.com/alibaba/MongoShake, Download: https://github.com/alibaba/MongoShake/releases). So far blog is over, you can happily chew on this project. Or with a look at the official description:

MongoShake is a universal data replication platform based on MongoDB's oplog. Redundant replication and active-active replication are two most important functions. Mongodb oplog replication tools based clusters to meet the needs of migration and synchronization, further implement disaster recovery and multi-function live.

No title title

Haha, I was interested to hear can be long-winded down. Recently, there is a real-time incremental data acquisition mongodb ( 数据量在每天10亿条左右) needs, we need to research about the solutions. I respectively Baidu, google the mongodb kafka sync 同步 采集 实时keywords such as writing this blog at the top of the time comes kafka-connect(official has realized https://github.com/mongodb/mongo-kafka, in fact, have achieved unofficial) that a program, I am relatively familiar kafka-connect point (not estimated build deployment familiar with the case to spend some time), I felt before I did not test may not meet the performance requirements of the collection, and she also does not meet the needs of measurements down. Later, also we saw https://github.com/rwynn/route81compiler deployment is also more trouble, the acquisition does not meet the same performance requirements. Searching for something when I do not turn down too much in general, did not find, most of them will try to change keyword (including English) Soso, 这次可能也提醒我下次要多往下找找,说不定有些好东西未必排在最前面几个.

After searching on github in:readme mongodb kafka sync, let me themselves.

Search mongodb, kafka, sync keyword results on github

点进去快速读了一下readme,正是我想要的(后面自己实际测下来确实高性能、高可用,满足我的需求),官方也提供了MongoShake的性能测试报告

这篇博客不讲(也很大可能是笔者技术太渣,无法参透领会(●´ω`●))MongoShake的架构、原理、实现,如何高性能的,如何高可用的等等。就一个目的,希望其他朋友在搜索实时同步mongodb数据时候,MongoShake的解决方案可以排在最前面(实力所归,谁用谁知道,独乐乐不如众乐乐,故作此博客),避免走弯路、绕路。

初次使用MongoShake值得注意的地方

数据处理流程

v2.2.1之前的MongoShake版本处理数据的流程:

MongoDB(数据源端,待同步的数据)
-->MongoShake(对应的是collector.linux进程,作用是采集)
-->Kafka(raw格式,未解析的带有header+body的数据)
-->receiver(对应的是receiver.linux进程,作用是解析,这样下游组件就能拿到比如解析好的一条一条的json格式的数据)
-->下游组件(拿到mongodb中的数据用于自己的业务处理)

v2.2.1之前MongoShake的版本解析入kafka,需要分别启collector.linux和receiver.linux进程,而且receiver.linux需要自己根据你的业务逻辑填充完整,然后编译出来,默认只是把解析出来的数据打个log而已

src/mongoshake/receiver/replayer.go中的代码如图:

Filled receiver needs its own logic place

详情见:https://github.com/alibaba/MongoShake/wiki/FAQ#q-how-to-connect-to-different-tunnel-except-direct

v2.2.1版本MongoShake的collector.conf有一个配置项tunnel.message

# the message format in the tunnel, used when tunnel is kafka.
# "raw": batched raw data format which has good performance but encoded so that users
# should parse it by receiver.
# "json": single oplog format by json.
# "bson": single oplog format by bson.
# 通道数据的类型,只用于kafka和file通道类型。
# raw是默认的类型,其采用聚合的模式进行写入和
# 读取,但是由于携带了一些控制信息,所以需要专门用receiver进行解析。
# json以json的格式写入kafka,便于用户直接读取。
# bson以bson二进制的格式写入kafka。
tunnel.message = json
  • 如果选择的raw格式,那么数据处理流程和上面之前的一致(MongoDB->MongoShake->Kafka->receiver->下游组件)
  • If the choice is json, bsonthe process

Advantage v2.2.1 version is set to json is to deal with the butt of a receiver previously required format, to direct docking, thus less of a receiver, users do not need additional development, reduce the cost of open source users.

Simply boils down to this:
RAW format to the greatest degree of improved performance, but requires the user to have an extra cost of deploying receiver. json format and bson users can reduce deployment costs, you can direct docking kafka consumption, raw terms relative to the performance cost for most users is acceptable.

High Availability deployment scenarios

I use the version v2.2.1, high-availability deployment is very simple. collector.confOpen elections to the master:

# high availability option.
# enable master election if set true. only one mongoshake can become master
# and do sync, the others will wait and at most one of them become master once
# previous master die. The master information stores in the `mongoshake` db in the source
# database by default.
# 如果开启主备mongoshake拉取同一个源端,此参数需要开启。
master_quorum = true

# checkpoint存储的地址,database表示存储到MongoDB中,api表示提供http的接口写入checkpoint。
context.storage = database

At the same time I checkpoint memory address of the default use the database, by default stored in mongoshakethe db in. We can query some information checkpoint records.

rs0:PRIMARY> use mongoshake
switched to db mongoshake
rs0:PRIMARY> show collections;
ckpt_default
ckpt_default_oplog
election
rs0:PRIMARY> db.election.find()
{ "_id" : ObjectId("5204af979955496907000001"), "pid" : 6545, "host" : "192.168.31.175", "heartbeat" : NumberLong(1582045562) }

I started a total of three MongoShake instances on 192.168.31.174,192.168.31.175,192.168.31.176, we can see now is work on the machine 192.168.31.175 process. Self-test process, high-speed writing of data to mongodb manually kill off the collector processes on 192.168.31.175, etc. after 192.168.31.174 become a master, I manually kill it off, eventually leaving only the process of working on 192.168.31.176, final statistics data discovery, data mining to heavy phenomenon, there are examples guess it was not enough time to kill off the checkpoint.

Guess you like

Origin www.cnblogs.com/itwild/p/12329601.html