Consistent hashing algorithm to understand

Scene analysis

In a distributed cache design flexibility, the most important goal is the newly added cache server, so that the whole server cluster should have been cached data is also possible to access. For server cluster management, critical routing algorithm, which determines whether the client access to the cluster which server.


The remainder Hash

Simple routing algorithm can use the remainder Hash :

Hash value with the number of servers in addition to cache data key, and the remainder is a list of servers index number.

The idea is simple code to achieve the following:

class RemainderHash {

    private List<String> serverNodes;
    private int serverNodeSize; public RemainderHash(List<String> serverNodes) { this.serverNodes = serverNodes; this.serverNodeSize = serverNodes.size(); } public String getServerNode(String key) { return serverNodes.get(hash(key)); } public int hash(String key) { return Math.abs(key.hashCode() % serverNodeSize); } } 

This algorithm ensures that cached data is distributed in a balanced whole server cluster.
But if the server from the expansion units 3 to 4, the cache hit rate of repetition before and after expansion (before and after the expansion, is still the same data on the same cache server) is only 24.99%, and if the server 100 to the expansion station 101, the buffer before and after expansion repeat hit rate of only 0.71%. Balance and repetition hit rate in this test code

Hash algorithm so that the remainder can not meet the actual production needs.


Ideological consistency Hash Algorithm

To solve these problems, the more popular algorithm is consistent Hash algorithm that consistency by Hashimplementing data structures ring keyto the map cache server.

The specific process of the algorithm is:
① first construct a length of 2 32 integral ring (referred consistency Hash ring),
② a node name according to the Hash value (distribution in the range [0, 2 32 -1]) The cache placed on the server node Hash ring; (n server nodes to ensure balanced is placed on the ring)
③ the Hash value of key data cache (which is also the distribution range [0, 2 32 -1]) in the ring Hash Find clockwise from the Hash key value nearest cache server node, which is the server's data should be placed.

Hash consistency schematic .png

Figure, the circle indicates the beginning of the Node server, in a square represents the beginning of the key data cache, the dotted line with an arrow indicates the data cache should be placed in which server. As can be seen, after a new server, only the affected cache data distribution between short and Node2 Node4 (clockwise) data.

However, there is a problem algorithm described above: adding new nodes Node4 affected only Node3, original Node1 and Node2 are not affected, that is, after expansion, different load per server, did not reach a balanced distribution of cache data in the table the effect on the server.

The above-described solutions load imbalance problem is the use of virtual layers : virtual each physical server as a set of virtual cache server, the cache server Hash value of the virtual Hash placed on the ring, find the cache data by a virtual server Hash first key value node, and then find the physical server virtual server node corresponding.
After using virtual node, the server node cache data distribution diagram in the following :( Source: at The Magic of the Simple Consistent Hashing )

Consistency Hash algorithm uses virtual node schematic .png

Thus, when adding a new physical server node, is a set of virtual nodes (the number is assumed to n) added to the ring, these dummy affect n virtual nodes affected n virtual nodes and corresponding to a different physical server and finally, adding a new server will share the original part of the load of all the cluster servers to achieve load balancing purposes. (More virtual node corresponding to each physical node, the more balanced the load between the physical nodes, but too much can affect performance. In practice, the experience value of 150).


Consistency Hash Algorithm

public class ConsistentHash<T> { /** * 哈希函数 */ private final HashFunction hashFunction; /** * 每台物理服务器节点虚拟出虚拟节点个数 */ private final int numberOfReplicas; /** * Hash环 * <p> * 该Map中仅存放虚拟服务器节点,并不存放实际的缓存数据!!! * 该Map中key为虚拟服务器节点对应在环中的位置 * <p> * 使用TreeMap的原因: * TreeMap的内部使用红黑树(平衡查找树), * 因而在确定了一个缓存数据key的hash值在该环中的位置后, * 可以很快查找到该缓存数据应该放置的物理服务器 */ private final SortedMap<Integer, T> circle = new TreeMap<Integer, T>(); /** * 构造器 * * @param hashFunction 哈希函数 * @param numberOfReplicas 每台物理服务器节点虚拟出虚拟节点个数 * @param nodes 物理服务器节点 */ public ConsistentHash(HashFunction hashFunction, int numberOfReplicas, Collection<T> nodes) { this.hashFunction = hashFunction; this.numberOfReplicas = numberOfReplicas; //将物理服务器放在环上 for (T node : nodes) { add(node); } } /** * 添加物理服务器节点 * <p> * 实现:将该物理服务器节点放在环中的numberOfReplicas个位置处, * (添加一个物理服务器节点,就要添加numberOfReplicas个虚拟节点) * * @param node 要添加的物理服务器节点 */ public void add(T node) { for (int i = 0; i < numberOfReplicas; i++) { circle.put(hashFunction.hash(node.toString() + i), node); } } /** * 移除物理服务器节点 * <p> * 实现:因为该物理服务器节点放在了环中的多个位置,所以在删除时都要删除 * (移除一个物理服务器节点,就要将该服务器节点对应的numberOfReplicas个虚拟节点移除) * * @param node 要移除的物理服务器节点 */ public void remove(T node) { for (int i = 0; i < numberOfReplicas; i++) { circle.remove(hashFunction.hash(node.toString() + i)); } } /** * 根据缓存数据的key获取物理服务器节点 * * @param key 缓存数据的key * @return 该缓存数据所在的物理服务器节点 */ public T get(Object key) { if (circle.isEmpty()) { return null; } int hash = hashFunction.hash(key); //如果缓存数据的key的哈希值没有落在虚拟服务器节点上, //则 if (!circle.containsKey(hash)) { //获取map中key值大于该缓存数据的key的哈希值的子map SortedMap<Integer, T> tailMap = circle.tailMap(hash); //如果该子map不为空,则返回该子map的第一个元素(因为该map是排序好的,第一个即是最小的元素) //如果该子map为空,说明该缓存数据的key的哈希值超出了哈希环中最后的那个虚拟服务器节点对应的位置(顺时针), // 则将该缓存数据放在哈希环中第一个虚拟服务器节点中(顺时针) hash = tailMap.isEmpty() ? circle.firstKey() : tailMap.firstKey(); } return circle.get(hash); } } class HashFunction { int hash(Object key) { return key.hashCode(); } } 

In the above code, Hash algorithm could be improved, since the Hash algorithm is critical, the majority of Object hashCode () method is not well balanced distribution of the hash value obtained, for example, are more often these hash values ​​within a certain range small integer.

Popular Hash algorithms include MD5, SHA-1 and SHA-2, but the MD5 collision many cases, therefore, MD5 has not recommended a few years ago as the application of a hashing algorithm program, replacing it is safe hash algorithm .

// TODO hash algorithm classification

So how consistency Hash use it? Typically, the only experience consistency Hash algorithms rely on someone else's library, rather than trying to write. For example, most of the support distributed cache component clients support consistent hashing. It is worth noting that only the client needs to achieve consistent hash algorithm, the cache server in addition to caching data, doing nothing.


reference

① "large-scale Web Site Technology Framework - Core Principles and Case Studies" :( text description section)
Consistent Hashing :( Code consensus algorithm)
Hash algorithms are summarized
at The Magic of the Simple Consistent Hashing



Author: maxwellyue
link: https: //www.jianshu.com/p/05ad6637e66b
Source: Jane book
Jane book copyright reserved by the authors, are reproduced in any form, please contact the author to obtain authorization and indicate the source.

Guess you like

Origin www.cnblogs.com/xiaoshen666/p/11258597.html