Building a highly available mongodb cluster (3) - in-depth replica set internal mechanism

In the previous article "Building a highly available mongodb cluster (2) - replica set"  , the configuration of the replica set was introduced. This article takes a deep dive into the internal mechanism of the replica set. Let's take a look at the question of the replica set!

  • Replica set failover, how is the primary node elected? Whether you can manually interfere with the delisting of a master node.
  • Officially, the number of replica sets is preferably an odd number. Why?
  • How are mongodb replica sets synchronized? What happens if the synchronization is not timely? Will there be inconsistencies?
  • Will mongodb failover happen automatically for no reason? What conditions will trigger? Frequent triggering may increase system load?

The Bully algorithm  mongodb replica set failover function benefits from its election mechanism. The election mechanism adopts the Bully algorithm, which can easily select the master node from the distributed nodes. In a distributed cluster architecture, there is generally a so-called master node, which can be used for many purposes, such as caching machine node metadata, as an access entry for the cluster, and so on. There is a master node, so why do we need a Bully algorithm? To understand this, let's first look at these two architectures:

  1. Specify the architecture of the master node. This architecture generally declares that one node is the master node, and other nodes are slave nodes, such as our commonly used mysql. However, in this architecture, we said in the first section that if the master node hangs up, the entire cluster has to be manually operated. It is not very flexible to list a new master node or restore data from the slave node.

    mongodb4

  2. Without specifying the master node, any node in the cluster can become the master node. mongodb also adopts this architecture, once the master node hangs other slave nodes automatically take over and become the master node. As shown below:

    mongodb failover

Well, the problem is here, since all the nodes are the same, once the master node hangs, how to choose the next node to be the master node? This is the problem that Bully's algorithm solves.

So what is Bully algorithm, Bully algorithm is a coordinator (master node) election algorithm, the main idea is that each member of the cluster can declare that it is the master node and notify other nodes. Other nodes can choose to accept the claim or reject it and enter the masternode competition. The node accepted by all other nodes can become the master node. Nodes judge who should win according to some properties. This property can be a static ID, or it can be an updated metric like the last transaction ID (the newest node wins). For details, please refer to the NoSQL Database Distributed Algorithm Coordinator Election and Wikipedia's explanation  .

How does mongodb conduct elections  ? The official description is as follows:

We use a consensus protocol to pick a primary. Exact details will be spared here but that basic process is:

  1. get maxLocalOpOrdinal from each server.
  2. if a majority of servers are not up (from this server’s POV), remain in Secondary mode and stop.
  3. if the last op time seems very old, stop and await human intervention.
  4. else, using a consensus protocol, pick the server with the highest maxLocalOpOrdinal as the Primary.

Roughly translated as choosing a master node using a consensus protocol. The basic steps are:

  1. Get the last operation timestamp for each server node. Each mongodb has an oplog mechanism to record the operation of the machine, which is convenient for comparing with the master server. Whether the data is synchronized can also be used for error recovery.
  2. If most of the servers in the cluster are down, the remaining nodes will be in the secondary state and stop, and will not be elected.
  3. If the last synchronization time of the elected master node or all slave nodes in the cluster seems very old, stop the election and wait for the operation.
  4. If there is no problem with the above, select the server node with the latest timestamp of the last operation (to ensure that the data is the latest) as the master node.

A consensus protocol (in fact, the bully algorithm) is mentioned here, which is somewhat different from the database consensus protocol. The consensus protocol mainly emphasizes ensuring that everyone reaches a consensus through some mechanisms; while the consensus protocol emphasizes the sequential consistency of operations. For example, whether reading and writing a data at the same time will cause dirty data. Consensus protocol has a classic algorithm called "Paxos algorithm" in distributed, which will be introduced later.

There is a problem above, that is, what if the last operation time of all slave nodes is the same? That is, whoever becomes the master node first will choose whoever has the fastest time.

Election triggering conditions  Elections are not triggered at all times, but can be triggered in the following situations.

  1. When initializing a replica set.
  2. The replica set is disconnected from the primary, possibly a network issue.
  3. The master node is down.

There is also a prerequisite for the election. The number of nodes participating in the election must be greater than half of the total number of nodes in the replica set. If it is less than half, all nodes remain read-only.
The log will appear:

can't see a majority of the set, relinquishing primary

Can the master node hang up with human intervention? The answer is yes.

  1. The master node can be delisted through the replSetStepDown command. This command can be used to log in to the master node
    db.adminCommand({replSetStepDown : 1})

    If you can't kill it, you can use the force switch

    db.adminCommand({replSetStepDown : 1, force : true})

    Or use rs.stepDown(120) to achieve the same effect. The number in the middle means that the node cannot become the master node during the time when the service is stopped, and the unit is seconds.

  2. 设置一个从节点有比主节点有更高的优先级。
    先查看当前集群中优先级,通过rs.conf()命令,默认优先级为1是不显示的,这里标示出来。

     

    rs.conf();
    {
            "_id" : "rs0",
            "version" : 9,
            "members" : [
                    {
                            "_id" : 0,
                            "host" : "192.168.1.136:27017"                },
                    {
                            "_id" : 1,
                            "host" : "192.168.1.137:27017"                },
                    {
                            "_id" : 2,
                            "host" : "192.168.1.138:27017"                }
            ]
            }

    我们来设置,让id为1的主机可以优先成为主节点。

    cfg = rs.conf()
    cfg.members[0].priority = 1
    cfg.members[1].priority = 2
    cfg.members[2].priority = 1
    rs.reconfig(cfg)

    然后再执行rs.conf()命令查看优先级已经设置成功,主节点选举也会触发。

    {
            "_id" : "rs0",
            "version" : 9,
            "members" : [
                    {
                            "_id" : 0,
                            "host" : "192.168.1.136:27017"                },
                    {
                            "_id" : 1,
                            "host" : "192.168.1.137:27017",
                            "priority" : 2
                    },
                    {
                            "_id" : 2,
                            "host" : "192.168.1.138:27017"                }
              ]
             }

    如果不想让一个从节点成为主节点可以怎么操作?
    a、使用rs.freeze(120)冻结指定的秒数不能选举成为主节点。
    b、按照上一篇设置节点为Non-Voting类型。

  3. 当主节点不能和大部分从节点通讯。把主机节点网线拔掉,嘿嘿:)

    优先级还可以这么用,如果我们不想设置什么hidden节点,就用secondary类型作为备份节点也不想让他成为主节点怎么办?看下图,共三个节点分布在两个数据中心,数据中心2的节点设置优先级为0不能成为主节点,但是可以参与选举、数据复制。架构还是很灵活吧!

    deeprepset1

奇数 官方推荐副本集的成员数量为奇数,最多12个副本集节点,最多7个节点参与选举。最多12个副本集节点是因为没必要一份数据复制那么多份,备份太多反而增加了网络负载和拖慢了集群性能;而最多7个节点参与选举是因为内部选举机制节点数量太多就会导致1分钟内还选不出主节点,凡事只要适当就好。这个“12”、“7”数字还好,通过他们官方经过性能测试定义出来可以理解。具体还有哪些限制参考官方文档《 MongoDB Limits and Thresholds 》。 但是这里一直没搞懂整个集群为什么要奇数,通过测试集群的数量为偶数也是可以运行的,参考这个文章http://www.itpub.net/thread-1740982-1-1.html。后来突然看了一篇stackoverflow的文章终于顿悟了,mongodb本身设计的就是一个可以跨IDC的分布式数据库,所以我们应该把它放到大的环境来看。

假设四个节点被分成两个IDC,每个IDC各两台机器,如下图。但这样就出现了个问题,如果两个IDC网络断掉,这在广域网上很容易出现的问题,在上面选举中提到只要主节点和集群中大部分节点断开链接就会开始一轮新的选举操作,不过mongodb副本集两边都只有两个节点,但是选举要求参与的节点数量必须大于一半,这样所有集群节点都没办法参与选举,只会处于只读状态。但是如果是奇数节点就不会出现这个问题,假设3个节点,只要有2个节点活着就可以选举,5个中的3个,7个中的4个。。。

deeprepset2

心跳 综上所述,整个集群需要保持一定的通信才能知道哪些节点活着哪些节点挂掉。mongodb节点会向副本集中的其他节点每两秒就会发送一次pings包,如果其他节点在10秒钟之内没有返回就标示为不能访问。每个节点内部都会维护一个状态映射表,表明当前每个节点是什么角色、日志时间戳等关键信息。如果是主节点,除了维护映射表外还需要检查自己能否和集群中内大部分节点通讯,如果不能则把自己降级为secondary只读节点。

同步,副本集同步分为初始化同步和keep复制。初始化同步指全量从主节点同步数据,如果主节点数据量比较大同步时间会比较长。而keep复制指初始化同步过后,节点之间的实时同步一般是增量同步。初始化同步不只是在第一次才会被处罚,有以下两种情况会触发:

  1. secondary第一次加入,这个是肯定的。
  2. secondary落后的数据量超过了oplog的大小,这样也会被全量复制。

So what is the size of the oplog? As mentioned earlier, the oplog saves the operation records of the data, and the secondary replicates the oplog and executes the operations in the secondary. But oplog is also a collection of mongodb, which is stored in local.oplog.rs, but this oplog is a capped collection, that is, a fixed-size collection, and new data that exceeds the size of the collection will be overwritten. Therefore, it should be noted here that the appropriate oplogSize should be set for cross-IDC replication to avoid frequent full replication in the production environment. The oplogSize can be set by --oplogSize. For linux and windows 64-bit, the oplog size defaults to 5% of the remaining disk space.

Synchronization is not only possible from the master node. Assume that there are 3 nodes in the cluster, node 1 is the master node in IDC1, node 2 and node 3 are in IDC2, and the initialization node 2 and node 3 will synchronize data from node 1. Later, node 2 and node 3 will use the principle of proximity to replicate from the current IDC replica set, as long as there is one node to replicate data from node 1 of IDC1.

Also note the following when setting up synchronization:

  1. Secondary does not replicate data from delayed and hidden members.
  2. As long as synchronization is required, the buildindexes of the two members must be the same regardless of whether it is true or false. buildindexes is mainly used to set whether the data of this node is used for query, the default is true.
  3. If the synchronization operation does not respond for 30 seconds, a node will be re-selected for synchronization.

At this point, the problems mentioned earlier in this chapter have all been solved, and I have to say that the design of mongodb is really powerful!

Continue to solve these problems in the previous section:

  • Can the master node automatically switch connections when it hangs? Manual switching is currently required.
  • How to solve the excessive read and write pressure on the master node?

There are two more problems to be solved later:

  • The data on each slave node is a full copy of the database. Will the pressure on the slave node be too high?
  • Can automatic expansion be achieved when the data pressure is so great that the machine cannot support it?
 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326103668&siteId=291194637