Why do you need consistent hashing when you have a hashing algorithm?

Hello, I'm Weiki and welcome to ape java.

foreword

In actual development work, we often encounter the performance bottleneck of a single cluster, such as the master-slave mode of Redis and the master-slave mode of MySQL. Because the write operation can only be performed on the master node, the write performance of the master node will be How to solve the bottleneck of the writing capacity of the entire cluster?

Answer: sub-cluster, breaking through the performance limit of a single cluster

Friends with more experience in development will immediately think of adding a Proxy layer. The Proxy layer will process the read and write requests from the client, and then the Proxy will hash the Key and route the request to the corresponding cluster. For example, Codis is based on this Proxy method. Two classic hash routing algorithms are introduced below.

Hash algorithm

Assuming there are three nodes A, B, and C, when the user's request comes, you can perform hash calculation on a certain set key at the Proxy agent layer, and then use the hash (key) to perform a modulo operation on the total number of nodes, so that Requests with the same key can always be routed to the same node, as shown in the figure below:
img.png

Hash cluster
If, with the development of the business, the number of users is increasing, the original A, B, and C3 nodes cannot bear the traffic pressure of the user end, then it is necessary to add a node, as shown in the following figure:
img.png

expansion

From the figure above, we can see that when a node D is added, the original routing algorithm needs to change from hash(key)%3 to hash(key)%4, so what will happen?

If hash(key) = 100, hash(key)%3 = 1, hash(key)%4=0, the value of the calculated result changes at this time, for the same request, it is routed to node 1 before capacity expansion ( Node B), after expansion, it is routed to the serial number 0 node (A node), the addressing changes, the request can obtain data at node B, but fails to obtain data at node A, then this expansion method will cause the request to obtain data to fail, resulting in Problems will be catastrophic.

Shrinkage

In the same way: For example, due to the epidemic, the number of users is getting smaller and smaller. Originally, 3 nodes were needed, but now 2 nodes can meet the demand, and the capacity needs to be reduced, as shown in the following figure:

img.png

When scaling down, there will also be the problem of addressing failure during capacity expansion, which can be solved by migrating data:

for example:

Expansion operation, routing to node A before expansion, routing to node B after expansion, data can be migrated from node A to node B to meet the new routing method;

The scaling operation is the same.

But when the amount of data is relatively large, there is a cost to migrating data. Is there a better way to reduce this data migration?

** Answer: consistent hash

consistent hash algorithm

The consistent hash algorithm also uses the modulo operation, but unlike the hash, the hash algorithm performs the modulo operation on the total number of nodes. When the number of nodes changes, the result of the modulo will change, while the consistent hash The algorithm is to perform a modulo operation on the fixed value of 2^32-1, so the hash (key) algorithm remains unchanged, and the result of the modulo remains unchanged.

In fact, consistent hash organizes the entire hash value space into a virtual ring, that is, a hash ring, as shown in the following figure:
img.png

If there are 3 nodes A, B, and C, when you need to read and write the value of the key, you can follow the following two steps to address:

  • Perform c-hash() calculation on the key and determine the position of the key on the ring;
  • From the position where the key is located, "walk" clockwise along the hash ring, and the first node encountered is the node corresponding to the key;

As shown in the figure below: hash(keyA)%2 32 addresses node A, hash(keyB)%2 32 addresses node B, hash(keyC)%2^32 addresses node C
img_1.png

expansion

When adding a cluster D, key C was originally addressed and routed to node C, but now it is addressed and routed to node D, so the affected keys are all requests between node B and node D, and other keys are not affected.
img.png

Shrinkage

When shrinking node C, keyC originally addresses and routes to node C. According to the principle of consistent hash, the nearest node A is found clockwise, so the key between node B and node C will be affected, and the addressing route to the new node A.
img.png

Through the above analysis of consistent hash expansion and contraction, we can know that when the node is expanded and contracted, the affected key is local, and the data that needs to be migrated is also controllable, and can be pushed out as the number of nodes increases. , the affected keys will decrease inversely.

fairness

As shown in the figure below, when the distribution of nodes replaced by hash is uneven, node A bears 70% of the traffic, while nodes B and C only bear 30% of the traffic, which will lead to uneven data access, resulting in uneven Fair, how to solve it?

Answer: virtual node
img.png

As shown in the figure below: A virtual node X is added to node C and node A, which actually points to node B. In this way, keyD is originally routed to node A, but now it is routed to node B. This solves the problem caused by uneven data access. fairness issue.
img.png

Summarize

  • The hash algorithm is based on the modulus of the total number of nodes for addressing and routing. Therefore, it is not suitable for business scenarios with frequent expansion and contraction, or a large amount of data, where a large amount of data migration may occur.
  • Consistency hash is a special hash algorithm. The increase or decrease of nodes only affects the routing addressing of part of the data. Therefore, only part of the data needs to be migrated to achieve cluster stability.
  • Consistency hash algorithm, when the number of nodes is small, the nodes may be unevenly distributed on the hash ring, which will eventually lead to uneven business access to nodes, which can be solved by introducing more virtual nodes.
  • Consistent hash algorithm has better fault tolerance and scalability

at last

If you think this blog post is helpful to you, thank you for forwarding it to more friends, we will present you with more dry goods, welcome to pay attention to the official account: ape java

Guess you like

Origin blog.csdn.net/m0_54369189/article/details/126089954