MongoDB replication (copy) set actual combat and its principle analysis-04

MongoDB replica set

Replica Set Architecture

In a production environment, it is not recommended to use a stand-alone MongoDB server. The reasons are as follows:
       The stand-alone version of MongoDB cannot guarantee reliability. Once the process fails or the server goes down, the business will be directly unavailable.
        Once the disk on the server is damaged, the data will be lost directly, and there is no copy available at this time. 
The Mongodb replication set (Replication Set) consists of a group of Mongod instances (processes), including a Primary node and multiple Secondary nodes. All data of the Mongodb Driver (client) is written to the Primary, and the Secondary synchronizes the written data from the Primary. In order to keep all members in the replica set storing the same data set, high availability of data is provided. Replica sets provide redundancy and high availability and are the basis for all production deployments. Its reality depends on the functionality of two aspects:
       1. When data is written, the data is quickly copied to another independent node
       2. When the node receiving the write fails, a new replacement node is automatically elected
While achieving high availability, the replica set realizes several other additional functions:
Data distribution : copy data from one region to another, reducing read latency in another region
Read-write separation : Different types of pressure are executed on different nodes
Off-site disaster recovery : Quickly switch to off-site when the data center fails
Three-node replica set mode
A common replica set architecture consists of 3 member nodes, of which there are several different modes.
PSS mode (official recommended mode)
The PSS mode consists of one primary node and two standby nodes, namely Primary+Secondary+Secondary.

This mode always provides two complete copies of the data set. If the primary node is unavailable, the replica set chooses the standby node as the primary node
Click and continue normal operation. The old primary node rejoins the replica set when it becomes available.

PSA mode
The PSA mode consists of a master node, a standby node, and an arbitrator node, namely
Primary+Secondary+Arbiter

Among them, Arbiter nodes do not store data copies, nor provide business read and write operations. Arbiter node failure does not affect
affect business, only electoral votes. This mode provides only one full copy of the data, and if the primary node is unavailable, the
The set will select the standby node as the primary node

Typical three-node replica set environment setup
        Even if there is only one server for now, start the replica set in single node mode
        Single-machine multi-instance startup replication set
        Single node start replica set
Replica Set Considerations
        About hardware:
                Because normal replica set nodes may become master nodes, their status is the same, so the hardware configuration must be consistent;
                In order to ensure that the nodes will not go down at the same time, the hardware used by each node must be independent.
        About software:
                The software version of each node in the replica set must be consistent to avoid unpredictable problems.
                Adding nodes will not increase system write performance
Environmental preparation
        Install MongoDB and configure environment variables
        Make sure there is more than 10GB of hard disk space
Prepare configuration files
        Each mongod process of a replica set should reside on a different server. We are now running 3 processes on one machine, so we have to configure each of them: Different ports (28017/28018/28019) Different data directories
mkdir ‐p /data/db{1,2,3}
Different log file paths (for example: /data/db1/mongod.log)
Create a configuration file /data/db1/mongod.conf with the following content:
systemLog:
  destination: file
  path: /data/db1/mongod.log # log path
  logAppend: true
storage:
  dbPath: /data/db1 # data directory
net:
  bindIp: 0.0.0.0
  port: 28017 # port
replication:
  replSetName: rs0
processManagement:
  fork: true
Refer to the above configuration to modify the port and path, and configure db2 and db3 in turn. Note that it must be in yaml format
Start the MongoDB process
mongod ‐f /data/db1/mongod.conf
mongod ‐f /data/db2/mongod.conf
mongod ‐f /data/db3/mongod.conf
Note: If SELinux is enabled, it may prevent the above process from starting. For simplicity, turn off SELinux
# 永久关闭,将SELINUX=enforcing改为SELINUX=disabled,设置后需要重启才能生效
 vim /etc/selinux/config
# 查看SELINUX
 /usr/sbin/sestatus ‐v
Configure replica set
The replica set is initialized through the replSetInitiate command or rs.initiate() of the mongo shell. After initialization, each member starts to send heartbeat messages, and initiates a Priamry election operation. The node that has been voted by "most" members will become Primary, and the remaining nodes Become a Secondary.
method 1
# mongo ‐‐port 28017
# 初始化复制集
> rs.initiate()
# 将其余成员添加到复制集
> rs.add("192.168.65.174:28018")
> rs.add("192.168.65.174:28019")
Method 2
# mongo ‐‐port 28017
 # 初始化复制集
 > rs.initiate({
     _id: "rs0",
     members: [{
             _id: 0,
             host: "192.168.65.174:28017"
         },{
             _id: 1,
             host: "192.168.65.174:28018"
         },{
             _id: 2,
             host: "192.168.65.174:28019"
     }]
 })
verify
MongoDB primary node for writes
# mongo ‐‐port 28017
rs0:PRIMARY> db.user.insert([{name:"fox"},{name:"monkey"}])
MongoDB reads from the node
# mongo ‐‐port 28018
# 指定从节点可读
rs0:SECONDARY> rs.secondaryOk()
rs0:SECONDARY> db.user.find()
Replica set status query
View the overall status of the replica set:
rs.status()
You can view the current status of each member, including whether it is healthy, whether it is in full synchronization, heartbeat information, incremental synchronization information, election information, last heartbeat time, etc.
The members column reflects the status of all replica set members, mainly as follows:
health: Whether the member is healthy or not, detected by heartbeat. state/stateStr: the state of the member, PRIMARY means the primary node, and SECONDARY means the standby node, if the node goes out
If there is a failure, some other status may appear, such as RECOVERY.
uptime: The uptime of the member.
optime/optimeDate: The time of the last synchronous oplog of the member.
optimeDurable/optimeDurableDate: The time of the last synchronous oplog of the member.
pingMs: The ping delay between the member and the current node.
syncingTo: The member's synchronization source.

View current node roles:

db.isMaster()
In addition to the current node role information, it is a more streamlined information, and also returns the member list of the entire replica set, who is the real primary, protocol-related configuration information, etc., and the Driver will send this command when connecting to the replica set for the first time.

Mongo  Shell Replica Set Commands

command

describe

rs.add()

Add new nodes

rs.addArb()

Add an

rs.conf()

Return the replica set configuration information

rs.freeze()

Prevent the current node from being elected as the master node for a period of time

rs.help()

Return  command help for replica set

rs.initiate()

Initialize a new replica set

rs.printReplicationInfo()

Returns a status report of the replication from the perspective of the master node

rs.printSecondaryReplicationInfo()

Returns a replication status report from the perspective of the slave node

rs.reconfig()

Update configuration for replica set by

rs.remove()

Remove a node from a replica set

rs.secondaryOk()

Sets the slave node to be readable for the current connection

rs.status()

Returns replica set status information.

rs.stepDown()

Let the current primary become a slave node and trigger election

rs.syncFrom()

Set the node from which the replica set node synchronizes data, which will override the default  selection logic

 Replica set connection method

 

Method 2 (strongly recommended) : Connect to MongoDB through a highly available Uri. When the Primary fails over, the MongoDB Driver can automatically sense and route the traffic to the new Primary node

 springboot operation replica set configuration

spring:
  data:
  mongodb:
  uri:
mongodb://yanqiuxiang:[email protected]:192.168.30.130:192.168.30.130:28019/test?authSource=admin&replicaSet=rs0
Replica Set Member Roles
There are multiple nodes in the replica set, and each node has different responsibilities. Before looking at member roles, understand two important attributes:
Attribute 1: Priority = 0
When Priority is equal to 0, it cannot be elected as master by the replica set . The higher the value of Priority, the greater the probability of being elected as master. Usually, this feature can be used to deploy replica sets across computer rooms. Assuming that computer room A and computer room B are used, since the main business is closer to computer room A, you can set the Priority of the replication set member of computer room B to 0, so that the master node must be a member of computer room A.
Attribute 2: Vote = 0
It cannot participate in election voting, and the Priority of this node must also be 0 at this time, that is, it cannot be elected as the master. Since there are only 7 voting members in a replica set at most, the extra members must set their vote attribute value to 0, that is, these members will not be able to participate in voting.
member role
        Primary: The primary node, which receives all write requests, and then synchronizes the modification to all standby nodes. A replica set can only have one master node. When the master node "hangs", other nodes will re-elect a master node.
        Secondary: The standby node maintains the same data set as the primary node. When the master node "hangs", it participates in the election of the master node. Divided into the following three different types: Hidden = false: normal read-only node, whether it can be selected as the main node, and whether it can vote depends on the value of Priority and Vote;
        Hidden = true: Hidden node, invisible to the client, can participate in the election, but the Priority must be 0 , that is, it cannot be promoted to the master. Since hidden nodes will not accept business access, some data backup and offline computing tasks can be done through hidden nodes, which will not affect the entire replication set.
        Delayed: Delayed nodes must have the characteristics of hidden nodes and Priority0 at the same time, and will be delayed for a certain period of time (
SlaveDelay configuration decision) replicates increments from upstream, often used in fast rollback scenarios.
        Arbiter: Arbitration node, which is only used to participate in election voting, does not carry any data itself, and only serves as a voting role. For example, if you deploy a replica set with 2 nodes, 1 primary and 1 secondary, if any node goes down, the replica set will not be able to provide services (primary cannot be selected), then you can add an Arbiter node to the replica set, Even if a node is down, the Primary can still be selected. Arbiter itself does not store data and is a very lightweight service. When the number of replica set members is even, it is best to add ㇐ Arbiter nodes to improve the availability of the replica set

 

Configure hidden nodes

In many cases setting nodes as hidden nodes is used to assist delayed members . If we only need
To prevent this node from becoming the master node, we can achieve it through priority 0 member .
cfg = rs.conf()
cfg.members[1].priority = 0
cfg.members[1].hidden = true
rs.reconfig(cfg)
Once set, the priority of the slave node will be changed to 0 to prevent it from being promoted to the master node, and it is also invisible to the application. Executing db.isMaster() on other nodes will not show hidden nodes.
Configure delay nodes
When we configure a delayed node, the replication process and the oplog of the node will be delayed. The data set in the delayed node will be later than the data set in the primary node in the replica set. For example, it is 09:52 now, if the delay node is delayed by 1 hour, then there will be no operations after 08:52 in the data set of the delay node.
cfg = rs.conf()
cfg.members[1].priority = 0
cfg.members[1].hidden = true
#延迟1分钟
cfg.members[1].slaveDelay = 60
rs.reconfig(cfg)
View replication lag
If you want to view the oplog of the current node, you can use the rs.printReplicationInfo() command

This clearly describes the size of the oplog, the generation time of the earliest oplog and the last oplog, and the log length start to end refers to a replication window (time difference). Usually, when the size of the oplog remains constant, the more frequent the business write operation, the shorter the replication window will be. Execute the rs.printSecondaryReplicationInfo() command on the node to list the synchronization delays of all standby node members

Add voting node

# 为仲裁节点创建数据目录,存放配置数据。该目录将不保存数据集
mkdir /data/arb
# 启动仲裁节点,指定数据目录和复制集名称
mongod ‐‐port 30000 ‐‐dbpath /data/arb ‐‐replSet rs0
# 进入mongo shell,添加仲裁节点到复制集
rs.addArb("ip:30000")
Remove a replica set node
Use rs.remove() to remove nodes
rs.remove("ip:port")
Remove nodes by rs.reconfig()
cfg = rs.conf()
cfg.members.splice(2,1) #从2开始移除1个元素
rs.reconfig(cfg)
Change replica set nodes
cfg = rs.conf()
cfg.members[0].host = "ip:port"
rs.reconfig(cfg)
Replica set high availability
replica set election
MongoDB's replica set election is implemented using the Raft algorithm ( https://raft.github.io/ ) . The necessary condition for a successful election is that most voting nodes survive . In the specific implementation, MongoDB adds some extensions to the raft protocol, including:
Supports chainingAllowed chain replication, that is, the backup node not only synchronizes data from the master node, but also selects a node closest to itself (with the smallest heartbeat delay) to replicate data .
Added the pre-voting stage, namely preVote, which is mainly used to avoid the problem of Term (term) value surge when the network is partitioned
Support voting priority . If the standby node finds that its priority is higher than that of the primary node, it will actively initiate a vote and try to become the new primary node
A replica set can have up to 50 members, but only 7 voting members. This is because once too many members participate in the data replication and voting process, it will bring more reliability problems.

Voting members

most

Number of Tolerated Failures

1

1

0

2

2

0

3

2

1

4

3

1

5

3

2

6

4

2

7

4

3

When the number of surviving members in the replica set is less than the majority, the entire replica set will not be able to elect a primary node, and at this time cannot provide write services, and these nodes will be in a read-only state. In addition, if you want to avoid the result of a tie ticket, it is best to use an odd number of node members, such as 3 or 5. Of course, in the implementation of the MongoDB replica set, a solution has been provided for the flat ticket problem:
        Add a small amount of random time deviation to the election timer, so as to prevent each node from initiating an election at the same time and improve the success rate.
        Use the arbiter role, this role does not perform data replication, does not undertake read and write services, and is only used for voting.
automatic failover
In a failover scenario, our concerns are:
        How does the standby node perceive that the primary node has failed?
        How to reduce the impact of failover on business?

        A factor that affects the detection mechanism is the heartbeat. After the replica set is established, each member node will start a timer and continue to send heartbeats to other members. The parameter involved here is heartbeatIntervalMillis, which is the heartbeat interval, and the default value is 2s. If the heartbeat is successful, the heartbeat will continue to be sent at a frequency of 2s; if the heartbeat fails, the heartbeat will be retried immediately until the heartbeat resumes successfully.
        Another important factor is the election timeout detection, a heartbeat detection failure does not immediately trigger a re-election. actual
In addition to the heartbeat, the member nodes will also start an election timeout detection timer, which is executed at an interval of 10s by default, which can be specified by the electionTimeoutMillis parameter: If the heartbeat response is successful, the last electionTimeout scheduling will be canceled (guaranteed not to initiate election), and initiate a new round of electionTimeout scheduling. If the heartbeat response fails for a long time, the electionTimeout task is triggered, causing the standby node to initiate an election and become the new primary node. In the implementation of MongoDB, the period of election timeout detection is slightly longer than the electionTimeoutMillis setting. A random offset will be added to this period, which is about 10-11.5s. This design is to stagger the time of active election of multiple standby nodes and improve the success rate.
Therefore, the following conditions must be met to trigger an election in the electionTimeout task:
(1) The current node is the standby node.
(2) The current node has election authority.
(3) There is still no success in heartbeating with the master node within the detection period.

 Business Impact Assessment

In the case of active/standby node switching in the replica set, there will be a short period of no active node, and business write operations cannot be accepted at this time. If the switch is caused by the failure of the primary node, all read and write operations on the node will time out. If you use the driver of MongoDB 3.6 and above, you can reduce the impact by enabling retryWrite.
# MongoDB Drivers 启用可重试写入
mongodb://localhost/?retryWrites=true
# mongo shell
mongo ‐‐retryWrites
If the master node is forced to power down, the entire failover process will be longer , and it may be detected and recovered by other nodes after the election timer expires. This time window is generally within 12s. However, in practice, the consideration of service call loss should also include the client or mongos monitoring and perception of the role of the replica set (the real situation may take up to 30s or more).
For very important businesses, it is recommended to implement some protection strategies at the business level, such as designing a retry mechanism.
Thinking: How to gracefully restart the replica set?
If you want to restart the replica set without losing data, a more elegant way to open it should be as follows:
1. Restart all the Secondary nodes in the replica set one by one
2. Send the rs.stepDown() command to the Primary and wait for the Primary to be downgraded to Secondary
3. Restart the downgraded Primary
Replica set data synchronization mechanism
       In the replica set architecture, data is synchronized between the primary node and the standby node through the oplog . The oplog here is a special fixed set. When a write operation on the primary node is completed, a corresponding record will be written to the oplog set. logs, while the standby node continuously pulls new logs through this oplog, and plays them back locally to achieve data synchronization

 what is oplog

        MongoDB o plog is a collection under the Local library, which is used to save incremental logs generated by write operations (similar to Binlog in MySQL).
        It is a Capped Collection (fixed collection) , that is, when the configured maximum value is exceeded, the oldest historical data will be automatically deleted. MongoDB has a special optimization for o plog deletion to improve deletion efficiency.
        The master node generates a new o plog Entry, and the slave node maintains the same state as the master node by copying the o plog and applying it;
view oplog
use local
db.oplog.rs.find().sort({$natural:‐1}).pretty()
local.system.replset: Used to record the members of the current replica set.
local.startup_log: Used to record the startup log information of the local database.
local.replset.minvalid: Used to record the tracking information of the replica set, such as the fields required for initial synchronization.
ts: operation time, current timestamp + counter, the counter is reset every second
v: oplog version information
op: operation type:
i: insert operation
u: update operation
d: delete operation
c: Execute commands (such as createDatabase, dropDatabase)
n: no operation, special purpose
ns: the set to operate on
o: operation content
o2: operation query condition, only update operation contains this field
The ts field describes the timestamp generated by the oplog, which can be called optime. Optime is the key for the standby node to realize incremental log synchronization . It ensures that the oplog is in order. It consists of two parts:
        The current system time, that is, the number of seconds from the UNIX time to the present, 32 bits.
        Integer timer, different time values ​​will reset the counter, 32 bits
        Optime belongs to the Timestamp type of BSON, which is generally used internally by MongoDB. Since the oplog guarantees the node-level order, the standby node can be pulled by polling , and the tailable cursor technology will be used here.
Each standby node maintains its own offset, that is, the optime of the last log pulled from the primary node. When performing synchronization, it uses this optime to query the oplog set of the primary node. In order to avoid constantly initiating new query links, you can hang the cursor after starting the first query (by setting the cursor as tailable). In this way, as long as new records are generated in the oplog, the standby node can use the same request channel to obtain these data. The tailable cursor is only allowed to be enabled when the queried collection is a fixed collection.

The size of the oplog collection
The size of the oplog collection can be set by the parameter replication.oplogSizeMB. For 64-bit systems,
The default value of oplog is:
oplogSizeMB = min(磁盘可用空间*5%,50GB)
For most business scenarios, it is difficult to evaluate an appropriate oplogSize at the beginning, fortunately
MongoDB provides the replSetResizeOplog command after version 4.0, which can dynamically modify the oplogSize
without restarting the server.
# 将复制集成员的oplog大小修改为60g 指定大小必须大于990M
db.adminCommand({replSetResizeOplog: 1, size: 60000})
# 查看oplog大小
use local
db.oplog.rs.stats().maxSize
idempotence
Each oplog record describes an atomic change of data. For oplog, it must be guaranteed to be idempotent . That is to say, for the same oplog, no matter how many playback operations are performed, the final state of the data will remain unchanged. The current value of the x field of a certain document is 100, and the user sends an {$inc: {x: 1}} to the Primary, which will be converted into a {$set: {x: 101} operation when recording the oplog, so as to ensure idempotence.
The price of idempotence
For the operation of simple elements, converting $inc to $set has no effect, and the execution cost is similar, but when it comes to array element operations, the situation is different.
test
db.coll.insert({_id:1,x:[1,2,3]})
Push 2 elements at the end of the array, check the oplog and find that the $push operation has been converted into a $set operation (set the array to specify
element at position is some value)
rs0:PRIMARY> db.coll.update({_id: 1}, {$push: {x: { $each: [4, 5] }}})
 WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })
 rs0:PRIMARY> db.coll.find()
 { "_id" : 1, "x" : [ 1, 2, 3, 4, 5 ] }
 rs0:PRIMARY> use local
 switched to db local
 rs0:PRIMARY> db.oplog.rs.find({ns:"test.coll"}).sort({$natural:‐1}).prett
()
 {
 "op" : "u",
 "ns" : "test.coll",
 "ui" : UUID("69c871e8‐8f99‐4734‐be5f‐c9c5d8565198"),
 "o" : {
 "$v" : 1,
 "$set" : {
 "x.3" : 4,
 "x.4" : 5
 }
 },
 "o2" : {
 "_id" : 1
 },
 "ts" : Timestamp(1646223051, 1),
 "t" : NumberLong(4),
 "v" : NumberLong(2),
 "wall" : ISODate("2022‐03‐02T12:10:51.882Z")
 }
The cost of converting $push to $set with a specific position is similar, but let's look at adding 2 elements to the head of the array

It can be found that when adding elements to the head of the array , the $set operation in the oplog is no longer setting the value of a certain position in the array (because basically all the element positions have been adjusted), but the final result of the $set array, That is, the contents of the entire array must be written to the oplog. When the push operation specifies the $slice or $sort parameter, the oplog recording method is the same, and the entire array content will be used as the parameter of $set. Update operators such as $pull and $addToSet are similar. After updating the array, the oplog will be converted into the final content of the $set array to ensure idempotence.
The writing of oplog is amplified, causing synchronization to fail to catch up - large array update
When the array is very large, a small update to the array may require the contents of the entire array to be recorded in the oplog. I encountered an actual production environment case where the user's document contained a large array field, 1000 The total size of the elements is about 64KB. The elements in this array are stored in reverse order of time, and the newly inserted elements will be placed at the front of the array.
($position: 0), then keep the first 1000 elements of the array ($slice: 1000).
As a result of the above scenario, every time a new element is inserted into the array on the Primary (the request is about a few hundred bytes), the contents of the entire array must be recorded in the oplog. The Secondary will pull the oplog and replay it during synchronization, and the Primary will synchronize the oplog to the Secondary. The traffic is hundreds of times that of the client-to-Primary network traffic, resulting in full traffic between the primary and backup NICs. Moreover, due to the large amount of oplog, the old content is quickly deleted, which eventually causes the Secondary to fail to catch up and converts to the RECOVERING state. . When using arrays in documents, you must pay attention to the above problems to avoid the problem that the update of the array will cause the synchronization overhead to be infinitely magnified. When using arrays, try to pay attention to:
        1. The number of elements in the array should not be too many, and the total size should not be too large
        2. Try to avoid updating the array
        3. If you must update, try to only insert elements at the end. Complex logic can be considered to support at the business level

replication delay
Since the oplog collection has a fixed size, the oplog stored in it may be flushed by new records at any time. If the replication of the standby node is not fast enough, it cannot keep up with the pace of the primary node, resulting in a replication lag problem . This cannot be ignored. Once the delay of the standby node is too large, there will be a risk of replication breakage at any time, which means that the optime (the latest synchronization record) of the standby node has been aged out by the primary node, so the standby node will not be able to continue data synchronization.
In order to avoid the risk of replication delay as much as possible, we can take some measures, such as:
        Increase the size of the oplog and keep monitoring the replication window.
        Reduce the write speed of the primary node through some scaling means.
        Optimize the network between the active and standby nodes.
        Avoid using too large arrays for fields (may cause oplog bloat) .

data rollback
        Since replication delay is inevitable, this means that the data between the primary and secondary nodes cannot be kept in absolute synchronization. When the master node in the replication set goes down, the standby node will be re-elected as the new master node. Then, when the old master node rejoins, some previous "dirty log data" must be rolled back to ensure that the data set is consistent with the new master node. The greater the gap between the primary and backup replication sets, the higher the risk of massive data rollbacks.
         For the written business data, if it has been replicated to most of the nodes in the replication set, the risk of being rolled back can be avoided. The application can ensure data persistence by setting a higher write level (writeConcern: majority). The data rolled back by the old master node will be written to a separate rollback directory, and the data can still be restored if necessary.
When a rollback occurs, MongoDB will store the rollback data in BSON format in the rollback folder under the dbpath path. The naming format of the BSON file is as follows: <database>.<collection>.<timestamp>.bson
mongorestore ‐‐host 192.168.30.130:27018 ‐‐db test ‐‐collection emp ‐u yanqiuxiang ‐p
yanqiuxiang
‐‐authenticationDatabase=admin rollback/emp_rollback.bson
Sync source selection
MongoDB allows replication through the standby node, which occurs in the following situations:
When settings.chainingAllowed is enabled, the standby node automatically selects the nearest node (with the smallest ping command delay) for synchronization. The settings.chainingAllowed option is enabled by default, that is to say, by default, the standby node does not necessarily choose the primary node for synchronization. This side effect will increase the delay. You can disable it by the following operations:
cfg = rs.config()
cfg.settings.chainingAllowed = false
rs.reconfig(cfg)
Use the replSetSyncFrom command to temporarily change the synchronization source of the current node. For example, when initializing synchronization, point the synchronization source to the standby node to reduce the impact on the primary node.
db.adminCommand( { replSetSyncFrom: "hostname:port" })

Guess you like

Origin blog.csdn.net/u011134399/article/details/131270334