Consistent hashing algorithms (distributed or balanced algorithms)

Reprinted from: https://blog.csdn.net/baidu_30000217/article/details/53671716

Introduction:

The consistent hash algorithm is a distributed hash (DHT) implementation algorithm proposed by the Massachusetts Institute of Technology in 1997. The design goal is to solve the hot spot problem in the Internet. The original intention is very similar to CARP. Consistent hashing corrects the problems caused by the simple hashing algorithm used by CARP, so that distributed hashing (DHT) can be truly applied in the P2P environment.

Scene introduction:

For example, if you have N cache servers (hereinafter referred to as cache), then how to map an object object to N caches, you are likely to use a general method similar to the following to calculate the hash value of the object, and then evenly map it to N caches:

hash(object)%N

  

The above modulo method is generally called a simple hash algorithm. It is true that the distributed arrangement (mapping) can be achieved relatively uniformly through a simple hash algorithm, but let us consider the following two situations:

1) A cache server m goes down (this situation must be considered in practical applications), so that all objects mapped to cache m will be invalid. What should I do? Cache m needs to be removed from the cache. At this time, the cache It is N-1, and the mapping formula becomes hash(object)%(N-1);

2) Due to the increased access, the cache needs to be added. At this time, the cache is N+1, and the mapping formula becomes hash(object)%(N+1);

What do 1) and 2) mean? This means that whether cache servers are added or removed, all of a sudden almost all caches are invalid. For the server, this is a disaster, flooding access directly to the backend server;

In order to solve the above problems, we introduce the consistent hashing algorithm (consistent hashing).

Hash Algorithms and Monotonicity

One measure of the Hash algorithm is Monotonicity, which is defined as follows:

  Monotonicity means that if some content is already hashed to the corresponding buffer, a new buffer is added to the system. The result of the hash should be able to ensure that the original allocated content can be mapped to the new buffer, and will not be mapped to other buffers in the old buffer set.

Simply put, monotonicity requires that when removing/adding a cache (machine, ip), it can change the existing key mappings as little as possible.

It is easy to see that the above simple hash algorithm hash(object)%N cannot satisfy the monotonicity requirement. Because the change of N will change the modulo result.

Consistent Hash algorithm principle:

Consistent Hash algorithm simply says that when removing/adding a cache, it can change the existing key mapping relationship as little as possible and satisfy the monotonicity requirement as much as possible.

Let's briefly talk about the basic principles of the consistent Hash algorithm in 6 steps.

Step 1: Ring hash space

Consider that the usual hash algorithm maps the value to a 32-bit key value (and then modulo), that is, the numerical space of the power of 0~2^32-1; we can imagine this space as a first ( 0 ) tails ( 2^32-1 ) of the rings. As shown below:

Step 2: Process the object into an integer and map it to the annular hash space

For example, now we have four objects object1~object4, and the four objects are processed into integer keys through the hash function:

key1 = hash(object1); 
key2 = hash(object2); 
key3 = hash(object3); 
key4 = hash(object4);

Then map these objects to the ring hash space according to the value of the key:

Step 3: Map the cache to the ring hash space

The basic idea of ​​the consistent hash algorithm is to map both the object and the cache to the same hash value space and use the same hash algorithm.

Assuming that there are three cache servers: cacheA, cacheB, and cacheC, the corresponding key values ​​are obtained through hash function processing:

keyA = hash (cacheA); 
keyB = hash (cacheB); 
keyC = hash (cacheC);

Map the three cache servers to the ring hash space according to the key value:

Having said that, by the way, the hash calculation of the cache is mentioned. The general method can use the IP address or machine name of the cache machine as the hash input.

After the above steps, we map both the object and the cache server to the same ring hash space. The next consideration is how to map objects to the cache server.

Step 4: Map the object to the cache server

We start from the object key (key1 in the figure) in the clockwise direction of the circle until we encounter a cache server (cacheB), and map the object corresponding to the object key to this server. Because the hash value of the object and the cache is fixed, the cache must be unique and deterministic. According to this method, it can be concluded that the object object 1 is mapped to cacheB, object2 and object3 are mapped to cacheC, and object4 is mapped to cacheA. As shown in the figure:

As mentioned earlier, the biggest problem brought by the ordinary hash algorithm (the method of hashing and then calculating the remainder) is that it cannot satisfy monotonicity. When the number of caches changes (add/remove), almost all caches will be invalid, and then It has a huge impact on the background server, and then analyzes the consistent hash algorithm.

Step 5: Add a cache server

Now if the access increases, the cacheD server needs to be added. After the hash function calculation (keyD = hash(cacheD)), it is found that the value is between key3 and key2, that is, the position on the ring is also between them. At this time, what is affected is to start counterclockwise along KeyD until it encounters the objects between the next cache server (keyB) (these objects were originally mapped to cacheC), and these objects can be remapped to cacheD.

In our case only object2 (key2) needs to be changed, just remap it to cacheD:

Step 6: Remove the cache server

Or according to the original diagram (before step 5), if the cacheB server is down now, the cacheB server needs to be removed. At this time, only those who start counterclockwise along keyB know that they encounter the next server (cacheA) The objects between, that is, those objects that were originally mapped to cacheB.

In our example, only object1 (key1) needs to be changed, and it can be remapped to cacheC:

Balance and virtual nodes:

Another indicator for considering the Hash algorithm is Balance, which is defined as follows:

balance

  Balance means that the hash result can be distributed as far as possible to all buffers, so that all buffer space can be used.

The hash algorithm does not guarantee absolute balance. If there are few caches, the objects cannot be evenly mapped to the cache. For example, in the above example, when only cache A and cache C are deployed, among the four objects, Cache A only stores object1, while cache C stores object2, object3, and object4; the distribution is very uneven.

In order to solve this situation, the consistent hash algorithm introduces the concept of "virtual node", which can be defined as follows:

virtual node

"Virtual node" (virtual node) is the replica of the actual node in the hash space. An actual node corresponds to several "virtual nodes", and this corresponding number also becomes the "copy number", "virtual node" Arranged by hash value in the hash space.

Still taking the case of deploying only cache A and cache C as an example, we have seen in the removal of cacheB server diagram that the cache distribution is not uniform. Now we introduce virtual nodes and set the "number of copies" to 2, which means that there will be 4 "virtual nodes" in total, cache A1, cache A2 represent cache A; cache C1, cache C2 represent cache C; Suppose an ideal situation, as shown in the figure:

At this point, the mapping relationship between objects and "virtual nodes" is:

objec1->cache C2 ; objec2->cache A1 ; objec3->cache C1 ; objec4->cache A2 ;

Therefore, objects object4 and object2 are mapped to cache A, while object3 and object1 are mapped to cache C; the balance has been greatly improved.

After the introduction of "virtual nodes", the mapping relationship is transformed from {object->node} to {object->virtual node}. The mapping relationship when querying the cache where the object is located is shown in the figure.

The hash calculation of the "virtual node" can be performed by adding a numeric suffix to the IP address of the corresponding node. For example, suppose the IP address of cache A is 202.168.14.241.

Before introducing the "virtual node", calculate the hash value of cache A:

Hash(“202.168.14.241”);

After the "virtual node" is introduced, calculate the hash values ​​of cache A1 and cache A2 of the "virtual node" point:

Hash(“202.168.14.241#1”); // cache A1

Hash(“202.168.14.241#2”); // cache A2

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324893063&siteId=291194637