[Reserved] [distributed] Consistent hashing algorithm

This article reprinted from:
https://www.cnblogs.com/lpfuture/p/5796398.html
if infringement, please inform the assembly line, thank you! !

0. purpose

Distributed system node according to the data stored hash value, the hash has recalculated when new nodes or nodes offline, Common hashing algorithm requires that all data distribution. The data consistent hashing algorithm only needs to recalculate the offline node can be. The key is that the data on the ring, the distribution node closest clockwise.

1. Consistency Hash nature

Considering that each node of the distributed system are likely to fail, and it is possible to dynamically add new nodes come in, how to ensure that outside still be able to provide good service when the number of nodes when the system is changed, it is worth considering, especially really designing a distributed cache system, if a server fails, the whole system is not used if a suitable algorithm to ensure consistency, then the cache in the system, all the data may be lost (i.e., because the number of system nodes is reduced, the client needs to be recalculated when requesting an object whose hash value (generally associated with the number of nodes in the system), since the hash value has been changed, so it is likely to find the object storage server node), and therefore it is consistent hash is critical, well-distributed cahce system consistency hash algorithm should meet the following areas:

  • Balance (Balance)
    balance refers to the hash results can be distributed to all the buffer to the extent possible, so that all of the buffer space may have been utilized. We Duoha Xi algorithms are able to meet this condition.
  • Monotonic (Monotonicity)
    Monotonicity means that if you already have something assigned by hash to the appropriate buffer, there are new buffer is added to the system, then the result of the hash should be able to ensure that the original content can be assigned It is mapped to the new buffer to go, and will not be mapped to other buffer old buffer set. Simple hashing algorithm often can not meet the requirement of monotonicity, as the simplest linear Hash: x = (ax + b) mod (P), in the formula, P represents the entire size of the buffer. Easy to see that when the buffer size is changed (from P1 to P2), all the original hash result will vary, so does not satisfy the monotonicity requirement. Hash result of the change means that when buffer space is changed, all the mapping relationship to be completely updated within the system. In the P2P system, the buffer is equivalent to a change Peer to join or leave the system, that occur frequently in the P2P system, and thus bring great calculation load transmission. Monotonicity is to ask the hashing algorithm can deal with this situation.
  • Dispersibility (the Spread)
    in a distributed environment, it is possible to see all the terminal buffer, but can only see part of it. When the terminal desired by the process of mapping the content hash to the buffer, since different terminals as seen buffering range may be different, leading to inconsistent results hash, the end result is the same content are mapped to different terminals of different buffer. This situation is clearly to be avoided because it leads to the same content is stored in a different buffer to reduce the efficiency of the storage system. It is defined dispersion of the severity of this happening. Good hash algorithm should be able to avoid inconsistencies occur, that is, to minimize dispersion.
  • Load (Load)
    load problem is actually dispersed to treat the problem from another angle. Since different terminals may be mapped to different contents of the same buffer, then for a particular buffer, it may also be mapped to different users different content. And dispersibility, as this situation is to be avoided, thus good hash algorithm should be able to minimize the load of the buffer.
  • Smoothness (Smoothness)
    smoothness cache server refers to the number of changes smoothly and a smooth change is consistent cached objects.

2. Principle

Consistent hashing algorithm (Consistent Hashing) was first in the paper "Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web" was presented. Briefly, consistent hashing the hash value of the entire space into a virtual ring, such as a hash function value space assumed 0-2 H ^ 32-1 (i.e., a 32-bit hash value is no signed integer), the entire hash space ring as follows:

The entire space organization in a clockwise direction. 0 and 232-1 in directions coincident zero.  
The next step for each server using a hash Hash, specifically ip or hostname of the server selected as a key hash, so that each machine can determine its position on the hash ring, where it is assumed that the above four after the server using a hash ip address space in the position loop as follows:

Next, the following algorithm is used to access location data corresponding to the server: the key data using the same hash function Hash value is calculated, and determines the position of the data on the ring, from the position of "walking" the ring in a clockwise direction, the first case to the server that it should be targeted to the server.  
For example, we have Object A, Object B, Object C , Object D four data objects, after hashing, the space position on the ring is as follows:

The consistent hashing algorithm, the data is designated as A to Node A, B are set as to the Node B, C are set as to the Node C, D are set as to the Node D.
The following analysis of fault tolerance and scalability consistent hashing algorithms. Unfortunately now assumed that the Node C is down, this time can see objects A, B, D are not affected, only the objects C are relocated to Node D. In general, the consistent hashing algorithms, if a server becomes unavailable, the data is only affected server to the ring space in front of a server (i.e., the first encountered traveling in the counterclockwise direction data between servers), others are not affected.
Consider another case, if the additional server Node X in the system, as shown below:

At this time, the object Object A, B, D are not affected, only the object C requires relocation to a new Node X. In general, the consistent hashing algorithm, if the additional server, the data is only affected to the new server before the ring space of a server (i.e., the first server encountered traveling in the counterclockwise direction data between), other data will not be affected.
In summary, consistent hashing algorithm to increase or decrease the nodes are only a small portion of the data re-positioning loop space, has good fault tolerance and scalability.
In addition, consistent hashing algorithm when the service node too easily because the node caused by uneven division of data skew problem. For example only two server systems, which rings are distributed as follows:

At this concentration will inevitably lead to large amounts of data to the Node A, and only a very small amount will be positioned on the Node B. In order to address this data skew problem, consistent hashing algorithm introduces virtual node mechanism that calculates a plurality of hash for each service node, each of the calculation results are placed in a location service node, called a virtual node. Specific practices can increase the number behind the server ip or hostname to achieve. The case of the above example, three may be calculated for each virtual server node can then calculate the "Node A # 1", "Node A # 2", "Node A # 3", "Node B # 1", respectively, "Node B # 2 "," node B # 3 "hash value, thus forming six virtual node:

Unchanged while the data location algorithm, but more virtual node mapping step of the actual node, e.g. target "Node A # 1", "Node A # 2", the data "Node A # 3" are positioned three virtual node to Node A. This would solve the service node data skew problem came from. In practical applications, usually the virtual node number is set to 32 or greater, so even small service node can do a relatively uniform distribution of data.

Guess you like

Origin www.cnblogs.com/suyeSean/p/11427038.html