Consistent Hash Algorithm (transfer)

Please indicate the source of the reprint: http://blog.csdn.net/cywosp/article/details/23397179

The consistent hash algorithm is a distributed hash (DHT) implementation algorithm proposed by the Massachusetts Institute of Technology in 1997. The design goal is to solve the hot spot problem in the Internet. The original intention is very similar to CARP. Consistent hashing corrects the problems caused by the simple hashing algorithm used by CARP, so that distributed hashing (DHT) can be truly applied in the P2P environment.

Consistent hash algorithm proposes four definitions for determining whether a hash algorithm is good or bad in a dynamically changing Cache environment:

1. Balance : Balance means that the hash result can be distributed to all buffers as much as possible, so that all buffer space can be utilized. Many hash algorithms can satisfy this condition.

2. Monotonicity : Monotonicity means that if some content is already dispatched to the corresponding buffer by hashing, a new buffer is added to the system. The result of the hash should be able to ensure that the original allocated content can be mapped to the original or new buffer, and will not be mapped to other buffers in the old buffer set.

3. Spread : In a distributed environment, the terminal may not see all the buffers, but only a part of them. When the terminal wants to map the content to the buffer through the hashing process, the buffer range seen by different terminals may be different, resulting in inconsistent hash results. The final result is that the same content is mapped to different terminals by different terminals. in the buffer. This situation should obviously be avoided, because it causes the same content to be stored in different buffers, reducing the efficiency of system storage. Scattering is defined as the severity of the above-mentioned occurrences. A good hash algorithm should be able to avoid inconsistencies as much as possible, that is, to reduce the dispersion as much as possible.

4. Load : The load problem is actually looking at the problem of decentralization from another perspective. Since different terminals may map the same content to different buffers, a particular buffer may also be mapped to different content by different users. Like decentralization, this situation should be avoided, so a good hashing algorithm should minimize the buffering load.

In a distributed cluster, adding and deleting machines, or automatically leaving the cluster after a machine failure is the most basic function of distributed cluster management. If the commonly used hash(object)%N algorithm is used, after a machine is added or deleted, many original data cannot be found, which seriously violates the monotonicity principle. Next, we will mainly explain how the consistent hashing algorithm is designed:

Ring Hash Space

According to the commonly used hash algorithm, the corresponding key is hashed into a space with 2^32 buckets, that is, the digital space of 0~(2^32)-1. Now we can connect these numbers head to tail and imagine a closed loop. As shown below

Map the data to the ring after processing it through a certain hash algorithm

Now we calculate the corresponding key values of the four objects object1, object2, object3, and object4 through a specific Hash function, and then hash them to the Hash ring. As shown below:

Hash(object1) = key1；

Hash(object2) = key2；

Hash(object3) = key3；

Hash(object4) = key4；

Map the machines to the ring through the hash algorithm

A new machine is added to a distributed cluster using the consistent hash algorithm. The principle is to map the machine to the ring by using the same hash algorithm as the object storage (generally, the hash calculation of the machine is performed by the machine. IP or machine's unique alias as input value), and then calculate in a clockwise direction to store all objects in the machine closest to you.

Assuming that there are three machines NODE1, NODE2 and NODE3, the corresponding KEY values are obtained through the Hash algorithm and mapped to the ring. The schematic diagram is as follows:

Hash(NODE1) = KEY1;

Hash(NODE2) = KEY2;

Hash(NODE3) = KEY3;

通过上图可以看出对象与机器处于同一哈希空间中，这样按顺时针转动object1存储到了NODE1中，object3存储到了NODE2中，object2、object4存储到了NODE3中。在这样的部署环境中，hash环是不会变更的，因此，通过算出对象的hash值就能快速的定位到对应的机器中，这样就能找到对象真正的存储位置了。

机器的删除与添加

普通hash求余算法最为不妥的地方就是在有机器的添加或者删除之后会照成大量的对象存储位置失效，这样就大大的不满足单调性了。下面来分析一下一致性哈希算法是如何处理的。

1. 节点（机器）的删除

以上面的分布为例，如果NODE2出现故障被删除了，那么按照顺时针迁移的方法，object3将会被迁移到NODE3中，这样仅仅是object3的映射位置发生了变化，其它的对象没有任何的改动。如下图：

2. 节点（机器）的添加

如果往集群中添加一个新的节点NODE4，通过对应的哈希算法得到KEY4，并映射到环中，如下图：

通过按顺时针迁移的规则，那么object2被迁移到了NODE4中，其它对象还保持这原有的存储位置。通过对节点的添加和删除的分析，一致性哈希算法在保持了单调性的同时，还是数据的迁移达到了最小，这样的算法对分布式集群来说是非常合适的，避免了大量数据迁移，减小了服务器的的压力。

平衡性

根据上面的图解分析，一致性哈希算法满足了单调性和负载均衡的特性以及一般hash算法的分散性，但这还并不能当做其被广泛应用的原由，因为还缺少了平衡性。下面将分析一致性哈希算法是如何满足平衡性的。hash算法是不保证平衡的，如上面只部署了NODE1和NODE3的情况（NODE2被删除的图），object1存储到了NODE1中，而object2、object3、object4都存储到了NODE3中，这样就照成了非常不平衡的状态。在一致性哈希算法中，为了尽可能的满足平衡性，其引入了虚拟节点。

——“虚拟节点”（ virtual node ）是实际节点（机器）在 hash 空间的复制品（ replica ），一实际个节点（机器）对应了若干个“虚拟节点”，这个对应个数也成为“复制个数”，“虚拟节点”在 hash 空间中以hash值排列。

以上面只部署了NODE1和NODE3的情况（NODE2被删除的图）为例，之前的对象在机器上的分布很不均衡，现在我们以2个副本（复制个数）为例，这样整个hash环中就存在了4个虚拟节点，最后对象映射的关系图如下：

根据上图可知对象的映射关系：object1->NODE1-1，object2->NODE1-2，object3->NODE3-2，object4->NODE3-1。通过虚拟节点的引入，对象的分布就比较均衡了。那么在实际操作中，正真的对象查询是如何工作的呢？对象从hash到虚拟节点到实际节点的转换如下图：

“虚拟节点”的hash计算可以采用对应节点的IP地址加数字后缀的方式。例如假设NODE1的IP地址为192.168.1.100。引入“虚拟节点”前，计算 cache A 的 hash 值：

Hash(“192.168.1.100”);

引入“虚拟节点”后，计算“虚拟节”点NODE1-1和NODE1-2的hash值：

Hash(“192.168.1.100#1”); // NODE1-1

Hash(“192.168.1.100#2”); // NODE1-2

参考：

[1] http://blog.huanghao.me/?p=14

Consistent Hash Algorithm (transfer)

Guess you like