Data Structure and Algorithm Lecture 16: Consistent Hash Algorithm of Distributed Algorithms

Data Structure and Algorithm Lecture 16: Consistent Hash Algorithm of Distributed Algorithms

This article is the sixteenth lecture on data structure and algorithm. Consistent Hash algorithm is a classic algorithm. The introduction of Hash ring is to solve the 单调性(Monotonicity)problem; the introduction of virtual node is to solve 平衡性(Balance)the problem.

1. Why introduce a consistent hash algorithm

In a distributed cluster, adding and deleting machines, or automatically leaving the cluster after a machine fails, these operations are the most basic functions of distributed cluster management. If the commonly used hash(object)%N algorithm is used, then after a machine is added or deleted, a lot of original data cannot be found, which seriously violates the principle of monotonicity .

2. Introduction to Consistent Hash Algorithm

The consistent hashing algorithm is a distributed hashing (DHT) implementation algorithm proposed by the Massachusetts Institute of Technology in 1997. The design goal is to solve the hot spot problem in the Internet. The original intention is very similar to CARP. Consistent hashing corrects the problems caused by the simple hashing algorithm used by CARP, so that distributed hashing (DHT) can be truly applied in the P2P environment.

The consistent hash algorithm proposes four definitions for judging the quality of a hash algorithm in a dynamically changing Cache environment:

  • 平衡性(Balance): Balance means that hash results can be distributed to all buffers as much as possible, so that all buffer spaces can be utilized. Many hashing algorithms can satisfy this condition.
  • 单调性(Monotonicity): Monotonicity means that if some content has already been assigned to the corresponding buffer through hashing, a new buffer is added to the system. The result of the hash should be able to ensure that the original allocated content can be mapped to the original or new buffer , and will not be mapped to other buffers in the old buffer set.
  • 分散性(Spread): In a distributed environment, the terminal may not see all the buffers, but only a part of them. When the terminal wants to map the content to the buffer through the hash process, the buffer range seen by different terminals may be different, resulting in inconsistent hash results. The final result is that the same content is mapped to different buffers by different terminals. This situation should obviously be avoided, because it causes the same content to be stored in different buffers, reducing the efficiency of system storage. Dispersion is defined as the degree to which the above occurs. A good hash algorithm should be able to avoid inconsistencies as much as possible, that is, minimize dispersion .
  • 负载(Load): The load problem is actually looking at the dispersion problem from another angle. Since different terminals may map the same content to different buffers, a specific buffer may also be mapped to different content by different users. As with decentralization, this should be avoided, so a good hashing algorithm should minimize the buffering load.

3. Consistent Hash Algorithm

3.1, Hash ring

A key value can be hashed into a space with 2^32 buckets using a common hash algorithm. It can also be understood as hashing the key value into a digital space of [0, 2^32). We assume that this is an end-to-end ring space. As shown below:

  • img

Assuming we now have 4 key values, key1, key2, key3, and key4, we use a certain hash algorithm to map them to the ring-shaped hash space above.

k1=hash(key1);
k2=hash(key2);
k3=hash(key3);
k4=hash(key4);

img

Similarly, suppose we have 3 cache servers, and add the cache servers to the above ring through the hash algorithm. Generally, the hash is based on the IP address of the machine or a unique computer alias.

c1=hash(cache1);
c2=hash(cache2);
c3=hash(cache3);

img

Next is how the data is stored on the cache server. After the key value is hashed, find the machine node closest to you in the above ring hash space clockwise, and then store the data on it. As shown in the figure above, k1 is stored on the c3 server, k4 and k3 are stored on the c1 server, and k2 is stored on the c2 server. It is shown as follows in a diagram:

img

3.2. Delete node

Assuming that the cache3 server is down, it needs to be removed from the cluster at this time. Then, the k1 previously stored on c3 will search clockwise for the nearest node, that is, the c1 node. In this way, k1 will be stored on c1. See the figure below, which is relatively clear.

img

After the c3 node is removed, only k1 originally stored on c3 is affected, while k3, k4, and k2 are not affected, which means that the avalanche problem that may be caused by the initial solution (hash(key)%N) is solved.

3.3. Add nodes

After the new C4 node is added, k4 originally stored in C1 is migrated to C4 to share the storage pressure and traffic pressure on C1.

img

3.4. The problem of imbalance

The above simple consistent hash solution still has problems in some cases: after a node goes down, the data needs to fall to the node closest to it, which will cause a sudden increase in the pressure on the next node, which may lead to an avalanche and the entire service hangs .

As shown below:

img

When node C3 is removed, k1 that was previously on C3 will be migrated to C1, which brings two parts of pressure:

  • The previously requested traffic on C3 is transferred to C1, which will increase the traffic of C1. If there is hot data on C3 before, it may cause C1 to fail due to the pressure.
  • The key value previously stored in C3 is escaped to C1, which will increase the content usage of C1, and there may be a bottleneck.

When the above two pressures occur, the C1 node may also be down. Then the pressure will be transmitted to C2, and a situation similar to snowballing will appear again. The service pressure will avalanche, resulting in the unavailability of the entire service. This violates one of the four principles mentioned at the beginning 平衡性. After the node goes down, the distribution of traffic and memory breaks the original balance.

3.5. Virtual nodes

"Virtual node" (virtual node) is the replica (replica) of the actual node (machine) in the hash space. An actual node (machine) corresponds to several "virtual nodes".

Still use pictures to explain, assuming that there is the following correspondence between real nodes and virtual nodes.

Visual100—> Real1
Visual101—> Real1
Visual200—> Real2
Visual201—> Real2
Visual300—> Real3
Visual301—> Real3

Similarly, the result after hash is as follows:

hash(Visual100)> V100  —> Real1
hash(Visual101)> V101  —> Real1
hash(Visual200)> V200  —> Real2
hash(Visual201)> V201  —> Real2
hash(Visual300)> V300  —> Real3
hash(Visual301)> V301  —> Real3

The hash result of the key value is as above, and will not be written here for the time being.

img

Similar to the one without adding virtual nodes introduced before, we mainly talk about the situation after a downtime.

Assuming that the Real1 machine is down, the following will happen.

  • The k1 data originally stored on the virtual node V100 will be migrated to V301, which means it has been migrated to the Real3 machine.
  • The k4 data originally stored on the virtual node V101 will be migrated to the V200, which means it has been migrated to the Real2 machine.

The result is as follows:

img

This solves the previous problem. After a node goes down, the storage and traffic pressure is not all transferred to a certain machine, but distributed to multiple nodes. Solved the possible avalanche problem of node downtime.

When there are many physical nodes and virtual nodes, the avalanche may be smaller.

4. Application of Consistent Hash

4.1. Expansion in Redis cluster

Usage scenario :

  • Redis expansion

Reasons for using consistent hash?

  • When the business develops and the website cache service needs to be expanded, problems will arise. For example, if 3 cache servers need to be expanded to 4, 75% of the data will fail to hit.

The consistent hash algorithm is used to solve the scalability problem of the cluster

  • The principle of the algorithm is: construct an integer ring with a length of 2^32, place the cache server node on this ring according to the hash value of the node name, then calculate the hash value of the key to cache data, find the nearest server node clockwise, and put the data on the server. When the server cluster expands, the capacity is expanded from 3 to 4. According to the consistent hash algorithm, the data hit rate is as high as 75%.

4.2. Do load balancing in dubbo

all

5. Reference articles

Guess you like

Origin blog.csdn.net/qq_28959087/article/details/131840721