[Technical Decryption] Description of SequoiaDB Replication Group Election Principle

1. Introduction to SequoiaDB

SequoiaDB is a distributed document database independently developed in China. It is different from the relational database familiar to developers in the past, its data structure is BSON type, a data type very similar to JSON structure.

In addition to the obvious differences in data types between SequoiaDB and relational databases, they support distributed storage. When users build a system that can deal with massive data and include high concurrent operations, they no longer need to do complex table and database work at the business level as in the past. When defining a data table, clearly tell the database that this table needs to be based on Which field and which rules are used for distributed storage, the distributed storage of data becomes transparent to users. Users can focus more on developing with business logic instead of how to divide tables and databases.

2. Introduction to the overall architecture of SequoiaDB

 

Figure 1: Schematic diagram of the overall architecture of SequoiaDB

In the entire cluster of SequoiaDB , there are three main roles, the coordinator node, the catalog node and the data node.

1.1  Coordination Node

The coordinator node (called Coord in English) is the task distribution node of SequoiaDB (general users become the master node). It does not store any data itself, and is mainly responsible for receiving access requests from applications. Therefore, when ordinary users are dealing with SequoiaDB, they access the coordination nodes. For nodes with other roles, it is generally not recommended for users to visit.

In the old version of SequoiaDB, the coordinator node was deployed like an island in the database cluster. How to say it, mainly because in the old version, each coordinator node knows the catalog node information through a catalogaddr parameter, there is no communication method between the coordinator nodes, and no coordinator node information is stored in the catalog node.

But in the new version, the coordinating nodes have found an organization and call it CoordRG. In the entire database cluster, all coordinating nodes belong to the CoordRG, and this information is stored in the CatalogRG.

1.2  Cataloging Nodes

The catalog node (called Catalog in English) is mainly responsible for storing the deployment structure of the entire database, node status information, and recording the collection space and collection parameter information, as well as the index information in the collection and the main sub-table information. At the same time, the cataloging node also records the data segmentation status of each collection, and its function is similar to the Metastore of relational databases.

Catalog nodes belong to the Catalog Node Group (CatalogRG). In a database cluster, there must be and can only have one catalog node group (CatalogRG). A catalog node group has at least one catalog node and a maximum of 7 catalog nodes.

1.3  Data Nodes

Data nodes are responsible for data storage, task calculation, and the establishment of indexes, the provision of data backup and transaction functions, and the responsibility of data nodes.

A data node belongs to a data group (DataRG). In a SequoiaDB cluster, there is at least one data group, and each data group contains at least one data node. A data group can have up to 7 data nodes. SequoiaDB does not have a clear limit on the number of data groups, and users can deploy and configure them according to the number of their own servers.

1.4  SequoiaDB role node logic display

 

 

3. SequoiaDB replication group? data set?

In SequoiaDB, because there are three main role groups, many users will be confused when they understand the concept of SequoiaDB.

In SequoiaDB, catalog node groups and data groups are generally referred to as replication groups. However, because catalog groups are rarely contacted by users, users gradually equate replication groups with data groups.

The data group refers to the DataRG composed of data nodes, which is the smallest division unit when the user divides the data into the collection. There is no explicit limit to the number of data sets in the database.

4. What is an election

The election mentioned here is not the election of a president or a beauty as commonly understood by people, but in a large-scale distributed environment, a process is selected as the master node of the group according to the election algorithm. Usually in a distributed environment, there are two roles of master node and slave node. The master node has the highest authority and can perform any addition, deletion, query and modification operations, while the data of the slave node is required to keep synchronized with the master node. Usually only provides read services for the system, but in some strict scenarios, the slave node even only does data backup.

Some users may wonder why in a distributed environment, it is necessary to distinguish between master and slave nodes. Isn't it the ideal state to treat everyone equally?

But whether you are in a distributed environment, the way of storage is different from the centralized storage of data in the past. The data is stored on multiple servers. When the network is good, each node can communicate normally, as each node can complete the data. In theory, there is no problem with write and delete operations, but once the network is disconnected, if each node continues to be responsible for data write and delete, data inconsistency will occur.

For example, in a banking scenario, user A has only 100 yuan in his card, and he consumes 100 yuan through application A, but since application A can only access node A and node B, node C does not know that the card has been reduced by 100 yuan. So when he looked up the balance from application B, he found that the money in the card had not changed, and he could continue to spend an additional 100 yuan through application B.

Therefore, users can understand through this example why in a distributed environment, the roles of nodes need to be differentiated, in order to ensure the security of data.

 

After the user has made it clear that the node roles in a group need to be distinguished, another question will arise, when should the group initiate an election, and whether there is a threshold for initiating an election.

In SequoiaDB, when a master node already exists in a replication group, the database will not initiate a new round of election, but the user can request the data group to initiate a new round of election by forcibly switching the master node command, thereby generating a new master node. node.

In addition, before the replication group initiates an election, the number of surviving nodes in the group must be greater than (the total number of nodes in the group/2+1), that is, the number of nodes that can participate in the election in the current group must exceed half, and the election will be valid. What is this restriction?

Let's imagine the following. If there are four nodes in a replication group, what will be the impact if the election rules do not require more than half but only half of them. When the network environment of a system suddenly becomes abnormal, it is assumed that there are two nodes on both sides of the system. Due to the different networks, the two nodes that have changed all think that the processes of the other two nodes have exited, but since the number of surviving nodes meets half of the total nodes , you can elect a new master node, so on both sides of the network, two master nodes will be generated at the same time.

Just now, the author has introduced to the user that if data writing and deletion can be performed on both sides of the network, data inconsistency will occur in some scenarios, which is the same at the same time. This condition is referred to in the database as "split brain".

 

A database "split-brain" situation should be avoided by every database developer.

Therefore, in a distributed database, the election system is an important measure to ensure data consistency when users operate data in a distributed environment.

5. SequoiaDB election principle

In the SequoiaDB database, election is involved in the catalog node group and the data node group. The election mechanism and principle of the two role groups are the same. In order to give users a more detailed understanding, the author will take the data group as an example to explain the election principle of SequoiaDB.

In the SequoiaDB election, there are several important guidelines

Criterion 1

If the master node already exists in the data group, other slave nodes cannot request a new round of election in the group;

Criterion 2

To conduct an election in a data group, the requirement of surviving nodes >=(total number of nodes in the group/2+1) must be met before an election can be initiated;

Criterion 3

In the data group, the nodes compete to be elected as the master node, and there are several priorities. The priorities are arranged from high to low as follows: Node LSN > Node Election Weight > Node NodeID

Guideline 4

Elections are conducted within the data group, and the principle of two-stage submission is followed to ensure the correctness of the election.

6. Election scenario simulation

After introducing the election principle of SequoiaDB to users, in order to allow users to understand the election strategy of SequoiaDB more intuitively, the author lists several scenarios and introduces them to users one by one.

1.5  Scenario 1

Suppose the data group contains three data nodes, A node (main node, NodeID=1000, weight=10), B node (NodeID=1001, weight=10) and C node (NodeID=1002, weight=10), and in At the beginning, objective resources such as disk and network are abundant.

If there is a scenario, when the master node of the data group (node ​​A) receives the request to write data, the master node has just completed the data write operation, but has not had time to synchronize the write task to nodes B and C. Due to external factors, the process of node A is suddenly forcibly shut down, then which node in the data group will be elected as the master node at this time?

Answer: C node.

Principle description:

Because in the data group, if the surviving node meets the election requirements, and the election weight of the surviving node is the same as that of the LSN, the node with the larger NodeID will be elected as the master node and undertake the read and write operations of the data group.

 

1.6  Scene 2

Suppose the data group contains three data nodes, A node (main node, NodeID=1000, weight=10), B node (NodeID=1001, weight=10) and C node (NodeID=1002, weight=10), and in At the beginning, objective resources such as disk and network are abundant.

If there is a scenario, when the master node of the data group (node ​​A) receives the request to write data, the master node just synchronizes the write task to node B after completing the data writing operation, but has not had time to write this The incoming task is synchronized to the C node. Due to external factors, the process of node A is suddenly forcibly shut down, then which node in the data group will be elected as the master node at this time?

Answer: Node B.

Principle description:

Because in the data group, if the surviving nodes meet the election requirements and the election weights of the surviving nodes are the same, the LSN size between the nodes will be judged, and the node with the larger LSN will be elected as the master node, and then the latest database log will be synchronized. to other surviving nodes.

1.7  Scenario 3

Suppose the data group contains three data nodes, A node (main node, NodeID=1000, weight=10), B node (NodeID=1001, weight=10) and C node (NodeID=1002, weight=10), and in At the beginning, objective resources such as disk and network are abundant.

If there is a scene, at the beginning, the three nodes of the data group are running normally, but due to external factors, the processes of node A and node B are suddenly forcibly shut down, then which node of the data group will be elected at this time Master node?

Answer: The data group failed to be elected, and there is no master node for the time being.

Principle description:

When the data group is elected, it is first necessary to broadcast messages widely in the entire data group to detect which data nodes are alive in the current data group. If the number of surviving nodes does not satisfy >=(total number of nodes*0.5+1) In this scenario, the surviving node data is 1, and the total number of nodes is 3], then the data group does not meet the basic requirements of the election, and the election is withdrawn, and the data group temporarily remains in the state of no master node.

1.8  Scene Four

Suppose the data group contains three data nodes, A node (main node, NodeID=1000, weight=10), B node (NodeID=1001, weight=10) and C node (NodeID=1002, weight=10), and in At the beginning, objective resources such as disk and network are abundant.

If there is a scenario, at the beginning, the three nodes of the data group are running normally, but due to external factors, the process of node A is suddenly forcibly shut down, and then node C is elected as the master node through election. After a period of time, the process of node A is started again. After node A starts up, will it become the master node again?

Answer: After the A node is restarted, it automatically becomes a slave node.

Principle description:

In a data group, if a master node is stopped and restarted, if the data group has a new node elected as the master node, the restarted node will automatically become a slave node after joining the data group, and will Actively request data synchronization from the current master node.

1.9  Scenario Five

[In the case of no owner in the group, another node is started, this situation is introduced]

Suppose the data group contains three data nodes, A node (main node, NodeID=1000, weight=10), B node (NodeID=1001, weight=10) and C node (NodeID=1002, weight=10), and in At the beginning, objective resources such as disk and network are abundant.

If there is a scenario, at the beginning, the three nodes of the data group are running normally, but due to external factors, the processes of node A and node B are suddenly forcibly shut down. At this time, the data group does not meet the minimum requirements for election. , the C node is always in the slave node state, the data temporarily lacks the master node, and the data write service cannot be provided temporarily (the data read service is normal). If node A is restarted after a period of time, may the data group be elected at this time, and if so, which node will be elected as the master node?

Answer: The data group can be elected, and the C node will be elected as the master node.

Principle description:

After the restart of node A, the current surviving nodes of the data group have met the basic requirements of >=(total number of nodes/2+1), so the data can be re-elected. Since the LSN value and election weight of node A and node C are the same, and the NodeID value of node C is larger than that of node A, node C will be elected as the master node.

7. Summary

In this article, the author introduces the basic architecture of the database and the election principle of the database in this architecture to users who use SequoiaDB, and introduces multiple simulation scenarios so that users can more intuitively understand the election mechanism of SequoiaDB replication groups.

The election mechanism of SequoiaDB is quite special. The node selection of the master node is affected by the LSN of the node, the node weight and the NodeID, and many users guess the random selection, but the election is carried out regularly. Users who understand the operating principles of SequoiaDB can be more confident in using SequoiaDB and operating and maintaining SequoiaDB, and can locate problems and propose solutions in a more targeted manner when an abnormality occurs in the database cluster.

 

SequoiaDB giant sequoia database 2.6 latest version download

SequoiaDB Giant Sequoia Database Technology Blog

SequoiaDB Giant Sequoia Database Community

 

                 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324541392&siteId=291194637