Summary of Consistent Hash Algorithms


title: Consistent Hash Algorithm Summary
date: 2023-05-22 11:25:13
tags:

  • Algorithm
    categories:
  • Data structure and algorithm
    cover: https://cover.png
    feature: false

1. Background

Suppose we have three cache servers for caching pictures. We number these three cache servers as No. 0, No. 1, and No. 2. Now, there are 30,000 pictures that need to be cached. We hope that these pictures will be evenly cached on these 3 servers so that they can share the pressure of the cache. In other words, we hope that each server can cache about 10,000 pictures, so what should we do?

If we cache 30,000 images evenly on 3 servers without any regularity, can it meet our requirements? Can! But if we do this, when we need to access a cache item, we need to traverse 3 cache servers to find the cache we need to access from 30,000 cache items. The traversal process is too inefficient and takes too long. what?

The original method is to hash the key of the cache item, and take the result of the hash to the number of cache servers, and determine which server the cache item will be cached on based on the result of the modulus. For example, suppose we use the image name as the key to access the image. Assuming that the image name is not repeated, then we can use the following formula to calculate which server the image should be stored on

hash (picture name)% N

Because it is assumed that the name of the picture is not repeated, so when we do the same hash calculation on the same picture name, the result should be unchanged. If we have 3 servers and use the result after hashing to calculate the remainder of 3, then the remainder must be 0, 1 or 2, which is exactly the same as our previous server number. If the result of the remainder is 0, we cache the picture corresponding to the current picture name on server 0; 2, the same reason

Then, when we access any picture, we only need to perform the above calculation on the picture name again to find out which cache server the corresponding picture should be stored on. We only need to search for the picture on this server. In this way, we can meet our needs. We temporarily call the above algorithm HASH algorithm or modulo algorithm.

However, when using the above HASH algorithm for caching, there will be some defects. Just imagine, if 3 caching servers can no longer meet our caching requirements, what should we do? That’s right, it’s very simple, just add two more cache servers. Suppose we add one cache server, then the number of cache servers will change from 3 to 4. At this time, if you still use the above method to cache the same picture, then the server number of this picture must be different from the original server number of 3 servers, because the divisor changes from 3 to 4. When the divisor remains the same, the remainder must be different. When the number of servers changes, all caches will be invalidated within a certain period of time. When the application cannot obtain data from the cache, it will request data from the backend server

In the same way, if one of the three caches suddenly fails and cannot perform caching, then we need to remove the faulty machine, but if one cache server is removed, the number of cache servers will change from three to two. Overwhelmed, so we should find a way to prevent this from happening, but due to the above-mentioned HASH algorithm itself, when using the modulo method for caching, this situation is unavoidable. In order to solve these problems, the consistent hash algorithm was born

2. Basic concepts

In fact, the consistent hash algorithm also uses the method of modulus, but the modulus method just described is to take the modulus of the number of servers, and the consistent hash algorithm is to 2 32 2^{32}232 model

First of all, we imagine the thirty-second power of two as a circle, just like a clock. The circle of a clock can be understood as a circle composed of 60 points, and here we imagine this circle as consisting of 2 32 2^ {32}2A circle composed of 32 points, the schematic diagram is as follows :

The point directly above the ring represents 0, the first point on the right side of the 0 point represents 1, and so on, 2, 3, 4, 5, 6... until 2 32 − 1 2^{32}-12321 , that is to say, the first point on the left of point 0 represents2 32 − 1 2^{32}-12321

We call this ring composed of 2 to the 32 points a hash ring

So, what does the consistent hashing algorithm have to do with the circle in the above figure?

Still taking the previously described scenario as an example, assuming we have 3 cache servers, server A, server B, and server C, then, in the production environment, these three servers must have their own IP addresses, and we use their respective IP addresses for hash calculation, and the hashed result pairs 2 32 2^ {32}232 modulo, you can use the following formula to show

hash (IP address of server A) % 2^32

The result calculated by the above formula must be a 0 to 2 32 − 1 2^{32}-1232An integer between 1 , we use the calculated integer to represent server A, since this integer must be between 0 and2 32 − 1 2^{32}-12321 , then there must be a point on the hash ring in the above figure corresponding to this integer, and we have just explained that using this integer to represent server A, then server A can be mapped to this ring, as shown in the following figure

Similarly, server B and server C can also be mapped to the hash ring in the above figure through the same method

Assuming that the three servers are mapped to the hash ring as shown in the above figure (of course, this is an ideal situation), so far, we have linked the cache server to the hash ring. We have mapped the cache server to the hash ring through the above method. Then, using the same method, we can also map the objects that need to be cached to the hash ring

We need to use the cache server to cache the picture, and we still use the name of the picture as the key to find the picture, then we can use the following formula to map the picture to the hash ring in the above figure

hash (picture name) % 2^32

The schematic diagram after mapping is as follows, the orange circle in the figure below represents the picture

Now the server and the picture are mapped to the hash ring, so which server should the picture in the above picture be cached on? The picture in the picture above will be cached on server A, why? Because starting from the position of the picture, the first server encountered in the clockwise direction is server A, so the picture in the above figure will be cached on server A, as shown in the figure below

The consistent hashing algorithm uses this method to determine which server an object should be cached on. After mapping the cache server and the cached object to the hash ring, starting from the location of the cached object, the first server encountered in a clockwise direction is the server where the current object will be cached. Since the value of the cached object and the server hash is fixed, a picture will be cached on a fixed server when the server remains unchanged. Then, when you want to access this picture next time, just use the same algorithm again. Calculate, you can calculate which server the picture is cached on, and go directly to the corresponding server to find the corresponding picture

The previous example only used one picture for demonstration, assuming that there are four pictures that need to be cached, the schematic diagram is as follows

Pictures No. 1 and No. 2 will be cached on server A, picture No. 3 will be cached on server B, and picture No. 4 will be cached on server C

3. Advantages

Assuming that server B has failed, we need to remove server B now. Then, we can remove server B from the hash ring in the above figure. After removing server B, the schematic diagram is as follows

When server B is not removed, picture 3 should be cached in server B, but after server B is removed, according to the rules of the previously described consistent hash algorithm, picture 3 should be cached in server C, because starting from the position of picture 3, the first cache server node encountered clockwise is server C, that is, if server B fails and is removed, the cache location of picture 3 will change

However, picture 4 will still be cached in server C, and pictures 1 and 2 will still be cached in server A. This is the same as before server B was removed. This is the advantage of the consistent hash algorithm. If the previous hash algorithm is used, when the number of servers changes, all the caches of all servers will be invalidated at the same time. When using the consistent hash algorithm, if the number of servers changes, not all caches will be invalidated, but only some caches will be invalidated. The front-end cache can still share the pressure of the entire system without all the pressure all centralized at the same time on the backend server

4. Skew of the hash ring

When introducing the concept of consistent hashing, we ideally evenly map the three servers to the hash ring, as shown in the figure below

In actual mapping, the server may be mapped as follows

If the server is mapped as shown in the figure above, then most of the cached objects are likely to be cached on a certain server, as shown in the figure below

In the picture above, pictures No. 1, No. 2, No. 3, No. 4, and No. 6 are all cached on server A, only picture No. 5 is cached on server B, and no picture is even cached on server C. If the situation in the above picture occurs, the three servers A, B, and C are not fully utilized on a reasonable average, and the cache distribution is extremely uneven. Moreover, if server A fails at this time, the number of invalid caches will also reach the maximum value. In extreme cases, it may still cause system crashes. crash, the situation in the above figure is called the skew of the hash ring, so how should we prevent the skew of the hash ring?

5. Virtual nodes

Since we only have 3 servers, when we map the servers to the hash ring, it is very likely that the hash ring will be skewed. When the hash ring is skewed, the cache will often be extremely unevenly distributed on each server.

If you want to distribute the cache to 3 servers in a balanced manner, it is best to let these 3 servers appear on the hash ring as much as possible and evenly. However, there are only 3 real server resources. How can we increase them out of thin air? Since there are no redundant real physical server nodes, we can only copy the existing physical nodes through virtual methods. These nodes that are virtual copied from actual nodes are called "virtual nodes". The hash ring after joining the virtual node is as follows

"Virtual node" is a replica of "actual node" (actual physical server) on the hash ring, and one actual node can correspond to multiple virtual nodes

As can be seen from the above figure, the three servers A, B, and C have each virtualized a virtual node. Of course, if you need, you can also virtualize more virtual nodes. After introducing the concept of virtual nodes, the cache distribution becomes more balanced. In the figure above, pictures No. 1 and No. 3 are cached in server A, pictures No. 5 and No. 4 are cached in server B, and pictures No. 6 and No. 2 are cached in server C. If you are still worried, you can virtualize more virtual nodes to reduce the impact of hash ring skew. The more virtual nodes, the more nodes on the hash ring, and the greater the probability that the cache will be evenly distributed

Guess you like

Origin blog.csdn.net/ACE_U_005A/article/details/131768729