Walk into MongoShake

Introduction: Alibaba Cloud database NoSQL team technical expert Zhu Zhao introduced the basic principles of MongoShake in the live broadcast of the special column "Tuesday Open Source Day" in the Alibaba Cloud Developer Community, and introduced the application scenarios of MongoShake in combination with typical application cases. This article is the text of the live broadcast content. To watch the live broadcast, please click the link at the beginning of the article~

Check out the wonderful replay: https://developer.aliyun.com/live/45078

Brief content

1. Background

Second, the basic principle of MongoShake

Three, classic case

 

1. Background

(1) Basic concepts of MongoDB

There are three main basic concepts of MongoDB: Oplog, ReplicaSet and Replication.

 

  1. Oplog
    Oplog is used to store MongoDB incremental data, similar to MySQL's Binlog, each write operation will generate a corresponding Oplog, for example:

1.png

 

Above is the basic format of Oplog.

 

  1. ReplicaSet
    replica set form, including Primary, Secondary, Hidden and other roles, Secondary is copied from Primary.

1.png

 

For example, there is a Hidden for one master and one slave. By default, the secondary is copied from the primary, which means that the data is pulled from the primary. There will be a heartbeat between the nodes. If the Primary fails, the Secondary will re-elect to become the primary node, that is, become a Primary, and then provide services.

Generally speaking, Primary can provide read and write services, and Secondary provide read services.

 

  1. Replication
    : Including full data replication and incremental data replication.

 

  • Full copy: Scan the Primary table and write to the Secondary.
  • Incremental replication: Pull and write to Secondary based on Oplog.

 

(2) Replica set disaster recovery

1.png

When a single node failure occurs in ReplicaSet, the work will proceed normally due to the disaster tolerance mechanism.

 

For example, if the Primary node goes Down, a Secondary will be re-elected and then become a new Primary node to ensure normal service provision. Similarly, if the Secondary or Secondary (Hidden) node is down, it will not affect the operation. No matter which one of the three nodes separately hangs down, it will not affect the normal operation of the business.

 

But if ReplicaSet hangs more than half of the nodes, it cannot continue to provide services.

1.png

Generally speaking, databases are deployed in the same computer room. If some unconventional situations occur in the computer room, such as power outages, fires, fiber cuts, or even earthquakes, there is a higher risk. Therefore, remote disaster recovery is very necessary. If the computer room in a certain place is unavailable, it can be quickly switched to the remote disaster recovery computer room as a whole to ensure stable business development.

 

Second, the basic principle of MongoShake

(1) Connection between full volume and incremental volume

1.png

As shown in the figure above, for example, when the user starts MongoShake at 8 o'clock, full synchronization will be performed at this time. Assuming that the user completes the full synchronization at 9 o'clock, the user's incremental synchronization starts at 9 o'clock, but the user's synchronized Oplog will still be pulled from 8 o'clock.

 

This is because the pulled data is not SnapShot mirrored data for MoangoDB, and the data may change during the period, so Oplog needs to pull from 8 o'clock to achieve the data connection.

 

(2) Data flow diagram

1.png

Above is a diagram of the internal structure of the data flow.

 

The full amount of synchronization in the box directly downloads the full amount of data dump, and Fetcher pushes it to the Write process for writing. When writing, multiple pieces of data are written together in batches, thereby increasing the throughput of writing, and the operation of building indexes (Build Indexes) can be performed before or after writing.

 

In the incremental process, the tail Oplog continuously pulls the data. After the pull, it passes through several streams such as Fetcher, Bather, and Worker, and finally the Replayer writes it to MoangoDB. In addition, users can write not only to MoangoDB, but also Kafka, Tcp, Rpc and other modes, and user-defined modes are also provided.

 

The external process synchronization tool has monitoring to monitor the synchronization status, HA is used for real-time switching, traffic monitoring, full verification, and the breakpoint resuming transfer manager can solve the situation of breakpoints during incremental synchronization.

 

(3) Total abduction

1.png

The principle of full synchronization is that multiple threads pull and write concurrently.

 

First, pull concurrently by table. Users can pull multiple tables at once, and multiple documents in the table will be aggregated into a batch and then written to the queue. Multiple Write threads will concurrently pull from the queue and then write to the destination library. An index can be created before and after the full pull.

 

(4) Incremental synchronization

1.png

Compared with full synchronization, incremental synchronization has more details, so the implementation process will be more complicated.

 

First, Fetcher is responsible for pulling Oplog from the source. The number in the above figure represents the Oplog serial number. After the data is fetched, it will be pushed to different decoders for analysis. During the analysis process, the order of the data needs to be preserved to prevent data confusion. After the data is parsed, it is passed to Batcher for sequential aggregation.

 

After Batcher is completed, the user can choose to use id or ns to hash, and then send it to multiple worker threads. After that, the data can be checked and compressed according to requirements, and then written to different channels. Replayer is responsible for pulling the different data written before, and then performing symmetric operations, and then writing it to the destination library through batch write.

 

(5) Hash principle

As mentioned above, there are id and ns hash methods from Batcher to Worker. Both methods have their own advantages and disadvantages. The default is to follow the ns method.

1.png

1. Hash in ns mode

 Hash by namespace: db.collection

 crc32(ns) % n

 Advantages

1) Order within the same table is guaranteed;

2) In the case of a large number of tables, better concurrency control can be ensured.

 Disadvantages

Single table tilt, synchronization performance will be degraded.

 

2. Hash by id

 Hash by document primary id in Oplog._id

 _id % n

 Advantages

1) The load of all workers is relatively balanced;

2) There will be no problem of tilting the big watch;

3) The operation of the same _id document is guaranteed to be orderly.

 Disadvantages

1) Documents with different _id cannot be in order, and there are phantom readings;

2) There is no way to concurrency when there is a unique index.

Regarding the second point of the shortcomings, here is an example.

1.png

 

As shown in the figure above, the user has created a unique index on a. At this time, if the concurrency is performed according to _ id, because the _id of a:1 may be different, in the concurrency process, the rear a:1 may run in front of the front a:1, and the last output data is only one.

 

v.2.4.12 version optimization: some tables are hashed by _id, and some tables are hashed by namespace.

 

(6) DDL processing

When a user builds a database, builds an index, or deletes a table, the index cannot be synchronized with the original document, and it is completely concurrent.

For example, if you first build a table and then insert data, and first insert data and then build a table, the final result is completely different.

1.png

As shown in the figure above, when C appears, it means command operation, and it means DDL operation in MangoDB. At this time, we need to add a global barrier behind C to ensure that the previous data synchronization is completed, and then remove the global barrier behind C. Add a global barrier in front of C, and then synchronize the data behind. This can improve concurrency while ensuring correctness.

 

(7) Checkpoint principle

 Checkpoint

1) Record the synchronized position, which is used to restart the synchronization from the last recorded position after the restart;

2) For DML is idempotent, there will be no problem with repeated playback;

3) For DDL, checkpoint will be forced to refresh each time synchronization;

4) In order to ensure performance, checkpoints are stored regularly.

1.png

Checkpoint is implemented based on LSN. As shown in the figure above, a box represents an Oplog.

 LSN

1) LSN: Oplog serial number, parsed by ts timestamp field, guaranteed to be unique in the same replica set;

2) LSN_ACK: Mark the largest LSN that has been successfully written;

3) LSN_UNACK: The mark has been successfully transmitted in the tunnel, but has not been written to MongoDB;

 

  • Sometimes, for example, for non-direct channels, it is necessary to retransmit the Oplog from LSN_ACK to LSN_UNACK.
    4) LSN_CKPT marks Checkpoint, which is persisted in the mongoshake.ckpt_default table of the source library MongoDB by default;
  • Once restart occurs, all data starting with LSN_CKPT will be retransmitted
    5) Default constraint: LSN_UNACK >= LSN_ACK >= LSN_CKPT

 

(8) Cluster version architecture-synchronous mode

The cluster version has multiple shards in parallel, and there are two modes for synchronization, one is Oplog mode, you need to close Balancer. The other is Change Stream mode, no need to close Balancer.

 

  1. Oplog mode synchronization (need to close Balancer)

1.png

Oplog mode starts multiple threads, and each thread corresponds to a shard for pulling. One of its constraints is that Balancer needs to be closed.

 

  1. Change Stream mode synchronization (no need to close Balancer)

1.png

The Change Stream mode pulls data directly from MangoS without closing the Balancer. It is recommended that users of the cluster version use it during incremental synchronization.

 

What are the problems with the synchronization of MangoShake when the balancer is turned on in the cluster version? The following is an example.

1.png

As shown in the figure above, there are Shard1 and Shard2. A piece of data is updated on Shard1 and then moved to Shard2 through Move Chunk. Later, a piece of data is updated on Shard2, at this time a=3.

 

The expected result is Shard1 first, then Shard2, and finally a=3.

 

If you do not close the Balancer, the actual result may be that the oplog on Shard2 is executed first, and then the oplog on Shard1 is executed, and finally a=1 is obtained, which breaks the consistency.

 

(9) Solve the problem of the cluster version Move Chunk

1.png

Why can't Balancer be closed for Change Stream pull?

 

Change Stream is a feature introduced by MongoDB after version 3.6. As shown in the figure above, for this scenario, it can solve the sequential problem. For example, there are 3 Chunks on Shard1, and there may be many Oplogs in them. The same is true for Shard2 and Shard3. It is the solution to this kind of scenario. In the current scenario, it can solve the problem of sequence.

 

Change Stream will create 3 corresponding Cursors for 3 Shards, and Cursor will pull data from the corresponding Shards. If Chunk1 of Shard1 is moved to Chunk1 of Shard2, Change Stream can preserve the order of Move Chunk during concurrent pull.

 

(10) Delayed synchronization

1.png

In addition, MongoShake supports delayed synchronization. For example, if a delay controller is added and Oplog3 is set to be delayed by 1 hour, it needs to wait 1 hour before proceeding.

 

(11) Multi-Tunnel docking

1.png

As shown in the figure above, MongoShake can be written to multiple Tunnels after being pulled.

 

At present, most users use Direct Tunnel, and write directly to the destination after pulling. Some users require docking through Receiver, such as writing RPC, TCP and other modes, the current docking process is relatively cumbersome. Collector writes directly to Kafka, and users can directly make the next round of consumption from it. Each piece of Kafka data is in a single json format.

 

(12) Monitoring-How does incremental synchronization know the current bottleneck

How to monitor MongoShake? The following is an example of monitoring the incremental synchronization bottleneck.

1.png

As shown in the figure above, Syncer, Worker, and Executor are internal multiple threads. The threads are written to the queue, and the user can judge the synchronization bottleneck through the internal queue.

 

If the user finds that queue 1 is often not full, and the subsequent queues are full, the bottleneck is pull. If queue 2 is full, it means that there is a bottleneck in the data parsing between Syncer and Worker. If queue 3 is full, it means the process from internal Worker to Executor. If queue 4 is full, it means that the write side will become a bottleneck, and usually the bottleneck will occur in queue 4.

 

For specific monitoring indicators, please check the github wiki monitoring document.

 

Three, application scenarios

(1) Function overview

 Data synchronization

1) Disaster tolerance: same city disaster tolerance, remote disaster tolerance

2) Migration: cross-version, cross-form, cross-cloud, cross-region

3) Multi-activity: data is copied in multiple directions and can be written everywhere

4) Analysis: real-time analysis, monitoring, warning, etc. based on data

5) Heterogeneous: MoangoDB data is converted to other database forms

6) Hybrid cloud: building a hybrid cloud platform

 

(2) Case

1. Gaode map

1.png

AutoNavi Map has deployed three computer rooms in North China, East China and South China. The three databases have built a 6-way MongoShake replication link, and each place is writable. For example, data written to North China will be synchronized to East China and South China, and data written to East China will be synchronized to North China and South China. Similarly, data written to South China will be synchronized to East China and North China. The intermediate link is implemented by MongoShake.

 

The following describes how to do if a disaster recovery scenario occurs.

1.png

As shown in the above figure, the two computer rooms are synchronized in two directions through MongoShake. There is a northbound routing layer on it, which determines which data will be sent to which computer room for routing hashing.

1.png

Assuming that the computer room on the left is down or unserviceable, the routing link can directly cut the traffic sent to the computer room on the left to the computer room on the right, thereby achieving data disaster recovery.

 

2. Alibaba Cloud BLS service

1.png

The above is the BLS service previously sold on Alibaba Cloud. It is built based on MongoShake. Users can purchase two databases themselves, and then purchase the BLS service, and then directly help users build a set of two-way back and forth links.

 

The principle of implementation is that there is a Collector process for each cluster, responsible for pulling, and each Collector has a disaster recovery backup process. After the data is pulled, it is written into the Kafka channel, and the Receiver pulls the data from Kafka and writes it to the destination. There is a central control Manager in the middle, which will regularly keep alive, HA mechanism, and monitor the service.

 

3. Data analysis of an e-commerce customer

1.png

The picture above shows a one-way synchronization link built by an e-commerce customer based on MongoShake based on data analysis scenarios. The user deploys a MoangoDB cluster in the United States to provide read and write services. The user pulls MoangoDB data to China, and then performs read services and data analysis.

 

4. Disaster recovery scenario for a game customer

1.png

As shown in the figure above, a game customer user builds a set of databases and services in two computer rooms. The entire set of services and databases are in the same data center. Then the data is distributed according to the northbound routing layer for access and cut flow. .

 

During data synchronization, if the computer room on the left hangs up, the traffic can be directly cut to the computer room on the right through the SLB to achieve disaster recovery effects.

 

5. A game client delays synchronization and rolls back the scene in real time

1.png

The picture above is a scene where a game client uses MongoShake for delayed synchronization.

 

Under normal circumstances, if a user request is written to the source library, MongoShake can construct a delayed synchronization, which is delayed by one or two hours. If the source database on the left is down, the user can switch the traffic directly to the delayed database, and roll back the data to before the delay, which is equivalent to the second-level RTO rollback of the data.

 

6. Global cascading synchronization scenario

1.png

The picture above is a global customer who uses MongoShake to build a global cascade scene, from Singapore to Beijing, from Beijing to Mumbai, and then from Mumbai to Frankfurt, to achieve a global cascade scene.

 

7. Monitoring and analysis scenarios

1.png

The above is the monitoring and analysis scenario. The user uses MongoShake to pull the data out of the database and write it to Kafka. After reading it with Receiver, it is pushed to some downstream monitoring platforms or analysis platforms.

Guess you like

Origin blog.csdn.net/weixin_43970890/article/details/115310666