automatic failover
In a failover scenario, our concerns are:
How does the standby node perceive that the primary node has failed?
How to reduce the impact of failover on business?
A factor that affects the detection mechanism is the heartbeat. After the replica set is established, each member node will start a timer and continue to send heartbeats to other members. The parameter involved
here is heartbeatIntervalMillis, which is the heartbeat interval, and the default value is 2s. If the heartbeat is successful, the heartbeat will continue to be sent at a frequency of 2s; if the heartbeat fails, the heartbeat will be retried immediately until the heartbeat resumes successfully.
Another important factor is the election timeout detection, a heartbeat detection failure does not immediately trigger a re-election.
actual
In addition to the heartbeat, the member nodes will also start an election timeout detection timer, which is executed at an interval of 10s by default, which can be specified by the electionTimeoutMillis parameter: If the heartbeat response is successful, the last electionTimeout scheduling will be canceled (guaranteed not to initiate election), and initiate a new round of electionTimeout scheduling. If the heartbeat response fails for a long time, the electionTimeout task is triggered, causing the standby node to initiate an election and become the new primary node. In the implementation of MongoDB, the period of election timeout detection is slightly longer than the electionTimeoutMillis setting. A random offset will be added to this period, which is about 10-11.5s. This design is to stagger the time of active election of multiple standby nodes and improve the success rate.
Therefore, the following conditions must be met to trigger an election in the electionTimeout task:
(1) The current node is the standby node.
(2) The current node has election authority.
(3) There is still no success in heartbeating with the master node within the detection period.
Business Impact Assessment
In the case of active/standby node switching in the replica set, there will be a short period of no active node, and business write operations cannot be accepted at this time.
If the switch is caused by the failure of the primary node, all read and write operations on the node will time out. If you use the driver of MongoDB 3.6 and above, you can reduce the impact by enabling retryWrite.
# MongoDB Drivers 启用可重试写入
mongodb://localhost/?retryWrites=true
# mongo shell
mongo ‐‐retryWrites
If the master node is forced to power down, the entire failover process will be longer
, and it may be detected and recovered by other nodes after the election timer expires. This time window is generally within 12s. However, in practice, the consideration of service call loss should also include the client or mongos monitoring and perception of the role of the replica set (the real situation may take up to 30s or more).
For very important businesses, it is recommended to implement some protection strategies at the business level, such as designing a retry mechanism.
Thinking: How to gracefully restart the replica set?
If you want to restart the replica set without losing data, a more elegant way to open it should be as follows:
1. Restart all the Secondary nodes in the replica set one by one
2. Send the rs.stepDown() command to the Primary and wait for the Primary to be downgraded to Secondary
3. Restart the downgraded Primary
Replica set data synchronization mechanism
In the replica set architecture, data is synchronized between the primary node and the standby node through the oplog
. The oplog here is a special fixed set. When a write operation on the primary node is completed, a corresponding record will be written to the oplog set. logs, while the standby node continuously pulls new logs through this oplog, and plays them back locally to achieve data synchronization
what is oplog
MongoDB o
plog is a collection under the Local library, which is used to save incremental logs generated by write operations
(similar to Binlog in MySQL).
It is a Capped Collection (fixed collection)
, that is, when the configured maximum value is exceeded, the oldest historical data will be automatically deleted. MongoDB
has a special optimization for o plog deletion to improve deletion efficiency.
The master node generates a new o
plog Entry, and the slave node
maintains the same state as the master node by copying
the o plog and applying it;
view oplog
use local
db.oplog.rs.find().sort({$natural:‐1}).pretty()
local.system.replset: Used to record the members of the current replica set.
local.startup_log: Used to record the startup log information of the local database.
local.replset.minvalid: Used to record the tracking information of the replica set, such as the fields required for initial synchronization.
ts: operation time, current timestamp + counter, the counter is reset every second
v: oplog version information
op: operation type:
i: insert operation
u: update operation
d: delete operation
c: Execute commands (such as createDatabase, dropDatabase)
n: no operation, special purpose
ns: the set to operate on
o: operation content
o2: operation query condition, only update operation contains this field
The ts field describes the timestamp generated by the oplog, which can be called optime.
Optime is the key for the standby node to realize incremental log synchronization
. It ensures that the oplog is in order. It consists of two parts:
The current system time, that is, the number of seconds from the UNIX time to the present, 32 bits.
Integer timer, different time values will reset the counter, 32 bits
Optime belongs to the Timestamp type of BSON, which is generally used internally by MongoDB. Since
the oplog guarantees the node-level order, the standby node can be pulled by polling
, and the tailable cursor technology will be used here.
Each standby node maintains its own offset, that is, the optime of the last log pulled from the primary node. When performing synchronization, it uses this optime to query the oplog set of the primary node.
In order to avoid constantly initiating new query links, you can hang the cursor after starting the first query (by setting the cursor as tailable). In this way, as long as new records are generated in the oplog, the standby node can use the same request channel to obtain these data. The tailable cursor is only allowed to be enabled when the queried collection is a fixed collection.
The size of the oplog collection
The size of the oplog collection can be set by the parameter replication.oplogSizeMB. For 64-bit systems,
The default value of oplog is:
oplogSizeMB = min(磁盘可用空间*5%,50GB)
For most business scenarios, it is difficult to evaluate an appropriate oplogSize at the beginning, fortunately
MongoDB provides the replSetResizeOplog command after version 4.0, which can dynamically modify the oplogSize
without restarting the server.
# 将复制集成员的oplog大小修改为60g 指定大小必须大于990M
db.adminCommand({replSetResizeOplog: 1, size: 60000})
# 查看oplog大小
use local
db.oplog.rs.stats().maxSize
idempotence
Each oplog record describes an atomic change of data.
For oplog, it must be guaranteed to be idempotent
. That is to say, for the same oplog, no matter how many playback operations are performed, the final state of the data will remain unchanged. The current value of the x field of a certain document is 100, and the user sends an {$inc: {x: 1}} to the Primary, which will be converted into a {$set: {x: 101} operation when recording the oplog, so as to ensure idempotence.
The price of idempotence
For the operation of simple elements, converting $inc to $set has no effect, and the execution cost is similar, but when it comes to array element operations, the situation is different.
test
db.coll.insert({_id:1,x:[1,2,3]})
Push 2 elements at the end of the array, check the oplog and find that the $push operation has been converted into a $set operation (set the array to specify
element at position is some value)
rs0:PRIMARY> db.coll.update({_id: 1}, {$push: {x: { $each: [4, 5] }}})
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })
rs0:PRIMARY> db.coll.find()
{ "_id" : 1, "x" : [ 1, 2, 3, 4, 5 ] }
rs0:PRIMARY> use local
switched to db local
rs0:PRIMARY> db.oplog.rs.find({ns:"test.coll"}).sort({$natural:‐1}).prett
()
{
"op" : "u",
"ns" : "test.coll",
"ui" : UUID("69c871e8‐8f99‐4734‐be5f‐c9c5d8565198"),
"o" : {
"$v" : 1,
"$set" : {
"x.3" : 4,
"x.4" : 5
}
},
"o2" : {
"_id" : 1
},
"ts" : Timestamp(1646223051, 1),
"t" : NumberLong(4),
"v" : NumberLong(2),
"wall" : ISODate("2022‐03‐02T12:10:51.882Z")
}
The cost of converting $push to $set with a specific position is similar, but let's look at adding 2 elements to the head of the array
It can be found that
when adding elements to the head of the array
, the $set operation in the oplog is no longer setting the value of a certain position in the array (because basically all the element positions have been adjusted), but the final result of the $set array, That is, the contents of the entire array must be written to the oplog. When the push operation specifies the $slice or $sort parameter, the oplog recording method is the same, and
the entire array content will be used as the parameter of $set.
Update operators such as $pull and $addToSet are similar. After updating the array, the oplog will be converted into the final content of the $set array to ensure idempotence.
The writing of oplog is amplified, causing synchronization to fail to catch up - large array update
When the array is very large, a small update to the array may require the contents of the entire array to be recorded in the oplog. I encountered an actual production environment case where the user's document contained a large array field, 1000 The total size of the elements is about 64KB. The elements in this array are stored in reverse order of time, and the newly inserted elements will be placed at the front of the array.
($position: 0), then keep the first 1000 elements of the array ($slice: 1000).
As a result of the above scenario, every time a new element is inserted into the array on the Primary (the request is about a few hundred bytes), the contents of the entire array must be recorded in the oplog. The Secondary will pull the oplog and replay it during synchronization, and the Primary will synchronize the oplog to the Secondary. The traffic is hundreds of times that of the client-to-Primary network traffic, resulting in full traffic between the primary and backup NICs. Moreover, due to the large amount of oplog, the old content is quickly deleted, which eventually causes the Secondary to fail to catch up and converts to the RECOVERING state. . When using arrays in documents, you must pay attention to the above problems to avoid the problem that the update of the array will cause the synchronization overhead to be infinitely magnified. When using arrays, try to pay attention to:
1. The number of elements in the array should not be too many, and the total size should not be too large
2. Try to avoid updating the array
3. If you must update, try to only insert elements at the end. Complex logic can be considered to support at the business level
replication delay
Since the oplog collection has a fixed size, the oplog stored in it may be flushed by new records at any time.
If the replication of the standby node is not fast enough, it cannot keep up with the pace of the primary node, resulting in a replication lag problem
. This cannot be ignored. Once the delay of the standby node is too large, there will be a risk of replication breakage at any time, which means that the optime (the latest synchronization record) of the standby node has been aged out by the primary node, so the standby node will not be able to continue data synchronization.
In order to avoid the risk of replication delay as much as possible, we can take some measures, such as:
Increase the size of the oplog and keep monitoring the replication window.
Reduce the write speed of the primary node through some scaling means.
Optimize the network between the active and standby nodes.
Avoid using too large arrays for fields (may cause oplog bloat)
.