WeChat side: what is consistent hashing? In what scenario? What's the problem?

Hello everyone, my name is Xiaolin.

When I was browsing the facebook of Niuke.com, I found that a classmate was asked this question when he was on WeChat:

The first question is: what is consistent hashing, usage scenarios, and what problems does it solve?

This question is quite interesting, so I will talk about it today.

Go!

How to distribute requests?

There is definitely not only one server behind most websites to provide services, because the concurrency and data volume of a single machine are limited, so multiple servers are used to form a cluster to provide external services.

But the question is, now there are so many nodes (hereinafter collectively referred to as the server node, because there is one less word), how to distribute the client's request?

In fact, this problem is the "load balancing problem". There are many algorithms to solve the load balancing problem. Different load balancing algorithms correspond to different allocation strategies and adapt to different business scenarios.

The simplest way is to introduce an intermediate load balancing layer, and let it forward external requests "in turn" to the internal cluster. For example, if the cluster has 3 nodes and there are 3 external requests, then each node will process 1 request to achieve the purpose of allocating requests.

Considering that the hardware configuration of each node is different, we can introduce a weight value, set the weight value of the node with better hardware configuration to a higher value, and then allocate it to different nodes according to the weight value of each node according to a certain proportion, so that Nodes with better hardware configuration take on more requests. This algorithm is called weighted round-robin.

The weighted round robin algorithm usage scenario is based on the premise that the data stored in each node is the same. Therefore, for each request to read data, access to any node can get the result.

However, the weighted round-robin algorithm cannot cope with the "distributed system", because in the distributed system, the data stored by each node is different.

When we want to increase the capacity of the system, the data will be horizontally divided into different nodes for storage, that is, the data will be distributed to different nodes. For example , in a distributed KV (key-valu) cache system, which node or nodes a key should be obtained from, it should be determined. It does not mean that any access to a node can get the cached result.

Therefore, we need a load balancing algorithm that can deal with distributed systems.

What's wrong with using a hash algorithm?

Some students may quickly think of: hash algorithm . Because the hash calculation is performed on the same key, each calculation is the same value, so that a certain key can be determined to a node, which can meet the load balancing requirements of the distributed system.

The simplest way of hash algorithm is to perform modulo operation. For example, there are 3 nodes in a distributed system, and the data is mapped based on the hash(key) % 3formula .

If the client wants to get the data of the specified key, the node can be located by the following formula:

hash(key) % 3

If the value calculated by the above formula is 0, it means that the key needs to be obtained from the first node.

But there is a very fatal problem. If the number of nodes changes, that is, when expanding or shrinking the system, the data whose mapping relationship has been changed must be migrated , otherwise the problem of not being able to query the data will occur.

For example, suppose we have a distributed KV cache system composed of three nodes A, B, and C. hash(key) % 3The , and each node stores different data:

Now there are 3 requests to query the key, respectively query the data of key-01, key-02, and key-03. The value of these three keys after the hash() function calculation is hash(key-01) = 6, hash (key-02) = 7, hash(key-03) = 8, and then modulo these values.

Through such a hash algorithm, each key can be located to the corresponding node.

When 3 nodes cannot meet the business requirements, we add a node, and the number of nodes changes from 3 to 4, which means that the cardinality in the modulo hash function changes, which will cause most of the mapping relationships to change , as shown in the following figure:

For example, the previous hash(key-01) % 3= 0 becomes hash(key-01) % 4= 2. When querying the key-01 data, the node C is addressed, and the key-01 data is stored in It is on node A, not on node C, so no data can be queried.

In the same way, if we scale down the distributed system, such as removing a node, there may be a problem that the data cannot be queried due to the change of the cardinality in the modulo hash function.

To solve this problem, we need to migrate the data . For example, when the number of nodes changes from 3 to 4, we need to remap the data and nodes based on the new calculation formula hash(key) % 4 .

Assuming that the total number of data pieces is M, when the hash algorithm faces changes in the number of nodes, all data needs to be migrated in the worst case, so its data migration scale is O(M) , so the data migration cost is too high.

Therefore, we should rethink a new algorithm to avoid excessive data migration when the distributed system expands or shrinks.

What's wrong with using a consistent hashing algorithm?

The consistent hash algorithm solves the problem of excessive data migration when the distributed system expands or shrinks.

The consistent hash algorithm also uses the modulo operation, but unlike the hash algorithm, the hash algorithm performs the modulo operation on the number of nodes, while the consistent hash algorithm performs the modulo operation on 2^32, which is a fixed value .

We can organize the result value of the consistent hashing algorithm to take the modulo operation of 2^32 into a circle, just like a clock, the circle of a clock can be understood as a circle composed of 60 points, and here we put this The circle is imagined as a circle composed of 2^32 points. This ring is called a hash ring , as shown in the following figure:

Consistent hashing requires two-step hashing:

  • The first step: perform hash calculation on the storage node, that is, perform hash mapping on the storage node, such as hashing according to the IP address of the node;
  • The second step: when the data is stored or accessed, hash mapping the data;

Therefore, consistent hashing refers to mapping both "storage nodes" and "data" to an end-to-end hash ring .

The question is, how to find the node that stores the data by hash mapping the "data" to get a result?

The answer is that the result value of the mapping finds the first node in a clockwise direction , which is the node that stores the data.

For example, there are 3 nodes that have been hashed and mapped to the following positions:

Next, perform hash calculation on the key-01 to be queried, determine the position where this key-01 is mapped on the hash ring, and then find the first node clockwise from this position, which is the one that stores the key-01 data. node.

For example, in the map of key-01 in the following figure, the first node found in the clockwise direction is node A.

Therefore, when you need to read and write the value of the specified key, you need to address it through the following two steps:

  • First, hash the key to determine the position of the key on the ring;
  • Then, walk clockwise from this position, and the first node encountered is the node that stores the key.

Knowing the way of consistent hash addressing, let's see, if a node is added or a node is reduced, will a large amount of data migration occur?

Assuming that the number of nodes increases from 3 to 4, the new node D is mapped to the position in the following figure after hash calculation:

You can see that key-01 and key-03 are not affected, only key-02 needs to be migrated to node D.

Suppose the number of nodes is reduced from 3 to 2, such as removing node A:

As you can see, key-02 and key-03 are not affected, only key-01 needs to be migrated to Node B.

Therefore, in the consistent hash algorithm, if a node is added or removed, it only affects the clockwise adjacent successor nodes of the node on the hash ring, and other data will not be affected .

The mapping of the three nodes in the above diagrams is relatively scattered on the hash ring, so it seems that requests will be "balanced" to each node.

However, the consistent hashing algorithm does not guarantee that nodes can be evenly distributed on the hash ring , which will bring a problem that a large number of requests will be concentrated on one node.

For example, the mapping positions of the three nodes in the following figure are all on the right half of the hash ring:

At this time, more than half of the data will be addressed to node A, that is, node A where the access requests are mainly concentrated. This is definitely not possible. As for load balancing, this situation is not balanced at all.

In addition, in the case of uneven distribution of nodes, during disaster recovery and expansion, adjacent nodes on the hash ring are easily affected, and an avalanche chain reaction is likely to occur.

For example, if node A is removed in the above figure, when node A goes down, according to the rules of the consistent hash algorithm, all the data on it should be migrated to the adjacent node B. In this way, the data volume of node B , and the number of visits will rapidly increase many times. Once the new pressure exceeds the upper limit of the processing capacity of node B, it will cause node B to collapse, thereby forming an avalanche chain reaction.

Therefore, although the consistent hash algorithm reduces the amount of data migration, there is a problem of uneven distribution of nodes .

How to improve balance through virtual nodes?

To solve the problem of uneven distribution of nodes on the hash ring, it is necessary to have a large number of nodes. The more nodes, the more evenly distributed nodes on the hash ring.

But the problem is, in practice we don't have that many nodes. So at this time, we add virtual nodes , that is, make multiple copies of a real node.

The specific approach is to no longer map real nodes to the hash ring, but map virtual nodes to the hash ring, and map virtual nodes to actual nodes, so there is a "two-layer" mapping relationship.

For example, set 3 virtual nodes for each node:

  • Add numbers to node A as virtual nodes: A-01, A-02, A-03
  • Add numbers to node B as virtual nodes: B-01, B-02, B-03
  • Add numbers to node C as virtual nodes: C-01, C-02, C-03

After the introduction of virtual nodes, the original situation where there are only 3 nodes on the hash ring will become 9 virtual nodes mapped to the hash ring, and the number of nodes on the hash ring will be tripled.

You can see that when the number of nodes increases, the distribution of nodes on the hash ring is relatively uniform . At this time, if an access request is addressed to the virtual node "A-01", then the real node A is found through the "A-01" virtual node, so that the request can access the real node A.

In order to facilitate your understanding above, each real node only contains 3 virtual nodes, so the equalization effect that can be achieved is actually very limited. In actual projects, the number of virtual nodes will be much larger. For example, in Nginx's consistent hash algorithm, each real node with a weight of 1 contains 160 virtual nodes.

In addition, virtual nodes will not only improve the balance of nodes, but also improve the stability of the system. When the nodes change, there will be different nodes sharing the changes of the system, so the stability is higher .

For example, when a node is removed, multiple virtual nodes corresponding to the node will be removed, and the next virtual node in the clockwise direction of these virtual nodes may correspond to different real nodes, that is, these different virtual nodes. The real nodes share the pressure caused by node changes.

Moreover, after having virtual nodes, you can also add weights to nodes with better hardware configuration, such as adding more virtual machine nodes to nodes with higher weights.

Therefore, the consistent hashing method with virtual nodes is not only suitable for scenarios with nodes with different hardware configurations, but also for scenarios where node scales will change .

Summarize

Different load balancing algorithms are applicable to different business scenarios.

Strategies such as rotation training can only be applied to scenarios where the data of each node is the same, and data can be requested by accessing any node. However, it is not applicable to distributed systems, because distributed systems mean that data is horizontally divided into different nodes. When accessing data, the node that stores the data must be addressed.

Although the hash algorithm can establish the mapping relationship between data and nodes, every time the number of nodes changes, all data needs to be migrated in the worst case, which is too troublesome, so it is not suitable for scenarios where the number of nodes changes.

In order to reduce the amount of data to be migrated, a consistent hashing algorithm has emerged.

Consistent hashing refers to mapping both "storage nodes" and "data" to an end-to-end hash ring. If a node is added or removed, it only affects the clockwise adjacent successors of the node on the hash ring. Node, other data will not be affected.

However, the consistent hash algorithm cannot distribute nodes evenly, and a large number of requests will be concentrated on one node. In this case, when disaster recovery and capacity expansion are performed, a chain reaction of avalanches is prone to occur.

In order to solve the problem that the consistent hash algorithm cannot evenly distribute nodes, it is necessary to introduce virtual nodes to make multiple copies of a real node. Instead of mapping real nodes to hash rings, virtual nodes are mapped to hash rings, and virtual nodes are mapped to actual nodes, so there is a "two-layer" mapping relationship.

After the introduction of virtual nodes, the balance of nodes can be improved, and the stability of the system can also be improved. Therefore, the consistent hashing method with virtual nodes is not only suitable for scenarios with nodes with different hardware configurations, but also for scenarios where node sizes will change.

over!

Guess you like

Origin blog.csdn.net/qq_34827674/article/details/123044451