Introduction MongoDB election

This article describes MongoDB about when to begin the election? How the election? As well as introduce some parameters.

When the replication set initialization (initiating) or primary failure would trigger elections (election), this process does not require manual intervention.

    Election process consumes some time, during which the cluster will not receive write operations (even if the old primary is still alive, but because the "network partition" problem cause it can not be other secondaries communicate with), all members (including the old primary) are in read-only state, if the cluster is not the "majority" of members is active, it will be unable to elect a primary, until enough member to join, then copy the entire set is read-only, can not be received write request.

     . 1 ) Heartbeats : replication between all members are set each heartbeat connection establishment, sending a heartbeat every two seconds and, if no reply is received within 10 seconds, this member will be marked as "not available."

    2 ) Priority compare : Each member has the right to re-value, the default is to have 1, members tend to vote the highest weight. Mentioned above, priority is member 0 can not be elected and can not initiate primary election; when secondaries as long as the current right to hold the highest weight primary or oplog latest data is not heavier than its higher weight, the cluster will not trigger elections. If a member has a weight higher weight added to a cluster, you will first catch up the current primary, and then re-launched round of elections, and elected member of this primary.

    . 3 ) Optime : timestamp of the current applied from the oplog member has a primary operation of the last (primary has generated this timestamp, each record in the operation oplog in both); a first condition to become a primary member is in all effective members it holds the highest optime.

    4 ) Connections : a member to be the primary, it must establish a connection with other members "majority" of, if not establish a connection with a sufficient number of member, in fact, it itself can not be elected as a primary; but for who will become priamry, the majority referred to as "the total number of votes", rather than the number of member, because we can give each member a different set of "votes." For example, there are three member, each member holds one vote (vote), as long as there are at least two member can communicate with each other (at least the majority of 3 to 2), to the election; but if there are two member is not available, not elections, if the rest of the member is primary, then it will also step down and become secondary, the entire set is read-only copy.

    5 ) Network partition ( Network Partitions ) : That part of the cluster members can not communicate with other members, divided into regions; because the election is based on the "majority" (a member needs to establish a connection with the majority, and the majority of votes is ), so when a network partition, not an election when the majority (not form). To avoid this problem, we recommend that more members of the cluster are deployed across multiple physical network segment.

 

Election Trigger timing :

1) replication set in the initialization and no primary

2) a secondary and primary connection to lose, this time it will initiate elections; if the majority of secondaries have lost connection with the primary, the election can be carried out successfully, otherwise the elections will be normal members Otherwise (veto).

3) primary off (step down).

requires attention:

"Priority 0" will not initiate the election, even if it loses connection with the primary, it can only be to someone else's election "vote" or "veto."

We can be disabled by rs.stepDown () method or replSetStepDown primary command, or when there is a higher priority secondary to join, or when the current primary can not establish a connection with most members, primary will be closed, after which will trigger a new election wheel.

In addition, we have modified the configuration set after replication, but also may trigger election by rs.reconfig (). When the primary is closed, it will close all client connections, then the client will be unable to write.

 

Election rules:

    Each member has a priority property, they become eligible to represent the primary cluster highest priority election will be the primary member, by default, a member of the heavy weights of all are 1, that is, they have the same opportunity to become primary . If you tend to become a member of a higher primary, you can set the priority of its value. By default, each member holds a vote (votes), if the majority of member elections are a member, it will be the primary; "Non-voting" member's vote is 0, and all current member of vote not be greater than 1 . (That is, each member one vote, or not vote)

 

Election veto:

    All the member can veto the election, including the "Non-voting". We will veto the election in the following cases (veto proposer):

1) If the proposal does not belong to the current member election replication set (not declared in the configuration file)

2) the proposed election of member data is stale  

3) the proposed election of member priority than other known member

4) If the current primary election proposal than those who hold higher optime 

If the highest weight member does not hold the highest optime, it will first follow the highest secondary optime, then in order to become primary, implying the highest priority rights will eventually be primary

 

Rolled back

    As already mentioned "rolled back": before priamry failure, unable to copy write operations to the most secondaires, after which the election of a new primary, then the new priamry also accepted part of the write request, then added again when the old priamry cluster, there are some data on it with the current priamry is "not aligned", then the old priamry you will need those data "misaligned" rollback and consistent data with the new priamry. Under normal circumstances, we should avoid "rolled back", because it is a more dangerous "inconsistent data" situation. It rolled back rollback data is stored in the directory "dbPath" path.

    By default, clients write Concern is {w: 1} i.e. the write operation of writing will be successful in the primary return results to the client, in this case, if the primary fails and the data is not replicated to Secondaries, occurs rolled back. To avoid rollback, the client can use the {w: majority}, when the write operation of writing the most successful in the secondaries will result returned to the client.

 

Guess you like

Origin www.cnblogs.com/gentlemanhai/p/11672225.html