Hash algorithm [consistency] What is a Hash algorithm consistency


This article reprinted from blog


1. Introduction consistency Hash Algorithm

Consistent hashing algorithm proposed in 1997 by the Massachusetts Institute of Technology a distributed hash (DHT) algorithm, the design goal is to solve the Internet's hot spot (Hot spot) issues, mind and CARP very similar. Consistent hashing Fixed simple hashing algorithm used by CARP brings, makes the distributed hash (DHT) can really be applied in P2P environment. 
    Consistency hash algorithm proposed four definitions in Cache dynamically changing environment, it is determined hash algorithm is good or bad:

1, the balance (Balance): refers to the balance of hash results can be distributed to all the buffer to the extent possible, so that all of the buffer space may have been utilized. We Duoha Xi algorithms are able to meet this condition.

2, monotonicity (Monotonicity): If the monotonicity refers to something already assigned by the corresponding hash to the buffer, there are new buffer is added to the system. Hash result should be able to ensure that the original content has been assigned can be mapped to existing or new buffer to go, and will not be mapped to other buffer old buffer set. 

3, the dispersibility of the (Spread): In a distributed environment, it is possible to see all the terminal buffer, but can only see part of it. When the terminal desired by the process of mapping the content hash to the buffer, since different terminals as seen buffering range may be different, leading to inconsistent results hash, the end result is the same content are mapped to different terminals of different buffer. This situation is clearly to be avoided because it leads to the same content is stored in a different buffer to reduce the efficiency of the storage system. It is defined dispersion of the severity of this happening. Good hash algorithm should be able to avoid inconsistencies occur, that is, to minimize dispersion. 

4, a load (Load): loading problems actually perceive dispersibility problem from another angle. Since different terminals may be mapped to different contents of the same buffer, then for a particular buffer, it may also be mapped to different users different content. And dispersibility, as this situation is to be avoided, thus good hash algorithm should be able to minimize the load of the buffer.

In a distributed cluster, add and delete the machine, or automatically from the machine after the failed cluster These operations are distributed cluster management most basic functions. If a common hash (object)% N algorithm, then there are added or deleted after the machine, a lot of the original data can not be found, such a serious violation of the principle of monotony. The next major explain how consistent hashing algorithm is designed to:

Hash annular space

According to conventional hash algorithm corresponding to the hash key space having a power of 32 ^ 2 buckets, i.e. digital space of 0 to (2 ^ 32) -1's. Now these numbers can be connected end to end, imagine a closed ring. As shown below

The data processing after a certain hash algorithm mapped to the ring

Now we will object1, object2, object3, object4 four objects corresponding to the key value calculated by the Hash function-specific, and then hashed to the Hash ring. FIG follows:
    the Hash (object1) = key1;
    the Hash (object2) = key2;
    the Hash (object3) = key3;
    the Hash (object4) = KEY4;

The machine is mapped onto the ring through the hash algorithm

The new machines added to the cluster in a distributed using consistent hashing algorithm, by using the principle is the same Hash algorithm stored in the object will be mapped to the ring machine (in general machines for hash calculations using the machine unique alias IP or machine as input values), and then calculate a clockwise direction, all the objects are stored in the nearest machine.
Suppose there NODE1, NODE2, NODE3 three machines, corresponding to the value obtained by Hash KEY algorithm mapped to the ring, which diagram is as follows:
Hash (NODE1) = KEY1;
Hash (NODE2) = the KEY2;
Hash (NODE3) = KEY3 ;

The objects can be seen through the machine in the same hash space, object1 is stored so that clockwise rotation to the NODE1, NODE2 in the stored object3, object2, object4 stored in the NODE3. In such a deployment environment, hash ring is not going to change, so the hash value calculated by the object will be able to quickly locate the corresponding machine, so you can find the location of an object is actually stored.

Delete adding machine

Local remainder ordinary hash algorithm is the most wrong after adding or deleting the machine will be adhered to have a large number of object storage location fails, thus greatly does not satisfy the monotonicity. Let's analyze how consistent hashing algorithm is processed.

  1. Delete node (machine) is
        to distribute the above example, if NODE2 failure is removed, the method according to the clockwise migration, object3 will be migrated to NODE3, so just object3 mapping position has changed, other the object does not have any changes. As shown below:

  1. Adding nodes (machines)
    If a new node is added to the cluster Node4, obtained by a corresponding hash algorithm KEY4, and mapped to the ring, as shown below:

Adoption of the rules clockwise migration, then object2 are migrated to NODE4, the other objects also maintained that the original storage location. Through the analysis of node additions and deletions, consistent hashing algorithm while maintaining the monotony, or migrate data reaches a minimum, such a distributed clustering algorithm is very suitable to avoid large amounts of data migration , reducing the pressure on the server.

The balance

According to the above analysis illustrates, consistent hashing algorithm satisfies the monotonicity and load balancing characteristics and dispersibility general hash algorithm, but also as a reason it is not widely used, because of the lack of balance. The following will analyze how consistent hashing algorithm is to meet the balance of nature. hash algorithm is not guaranteed balanced, as described above only deploys NODE1 and where NODE3 of (NODE2 deleted FIG), object1 is stored into NODE1, whereas object2, object3, object4 are stored into NODE3 in so as become very a state of imbalance. In consistent hashing algorithm, in order to meet the balance as much as possible, that the introduction of virtual nodes.
    - "virtual node" (virtual node) is the actual node (machine) in the hash space replica (Replica), an actual node (machine) corresponding to a number of "virtual node", this number also corresponds to a "Copy the number "," virtual node "arranged in the hash value in the hash space.
To deploy only the above situation and NODE1 NODE3 of (NODE2 deleted map), for example, prior to the distribution of objects on the machine is very balanced, and now we order two copies (copy number), for example, so that the entire hash ring there are four virtual nodes, the last graph object map as follows:

The figure shows the mapping between objects: object1-> NODE1-1, object2-> NODE1-2, object3-> NODE3-2, object4-> NODE3-1. By introducing virtual nodes, the distributed object to a more balanced. So in practice, is really an object query is how to work it? Object from the virtual node to the hash to the actual conversion node follows:

Computing hash "virtual node" IP address corresponding to the node number plus a suffix may be employed. For example, suppose the IP address NODE1 is 192.168.1.100. Before the introduction of "virtual node", the hash value calculation cache A:
the Hash ( "192.168.1.100");
the introduction of a "virtual node" calculated "virtual section" and the hash value of the point NODE1-1 NODE1-2 of:
the Hash ( " #. 1 192.168.1.100 "); // NODE1-1
the Hash (" 192.168.1.100 # 2 "); // NODE1-2

Guess you like

Origin www.cnblogs.com/54chensongxia/p/11596962.html