2-2-1 Consistency Hash problem and solution

Preface, distributed and cluster

Distributed and clustered concepts

Distributed and clusters are not the same. Distributed must be clusters, but clusters are not necessarily distributed (because a cluster is the work of multiple instances, and after a distributed system is split, it will be multiple Example; the cluster is not necessarily distributed, because the replicated cluster is not split but replicated)

Insert picture description here

1. Consistent Hash algorithm

1.1. Hash retrospective traceability of consistent Hash algorithm

Hash algorithm, such as MD5, SHA and other encryption algorithms in the field of security encryption, and Hash tables in data storage and search, all of which apply the Hash algorithm.

Why do I need to use Hash?
Hash algorithms are more commonly used in the field of data storage and search. The most classic is the Hash table. Its query efficiency is very high.
If the design of the hash algorithm is better, then The data query time complexity of the hash table can be close to O(1)

1.1.1 Case
  • Case: Provide a set of data 1,5,7,6,3,4,8, store this set of data, and then give a random number n, please judge whether n exists in the data set just now

  • Sequential search

    list:List[1,5,7,6,3,4,8]
    // realize by loop judgment
    for(int element: list) { if(element == n) { if equal, it means that n exists in the data set }} The above method is called the sequential search method: this method is completed by looping, which is more primitive and less efficient



  • Minute search

    After sorting, it is more efficient than the sequential search method, but the efficiency is not particularly good.

  • Direct addressing
    Insert picture description here

    Define an array, the length of the array is equal to the length of the data set, where the length is 9, data 1 is stored in the position of the subscript 1, and 3 is stored in the element position of the subscript 3,, and so on.
    At this time, I want to see whether 5 exists or not. I only need to determine whether list.get(5) array[5] is empty. If it is empty, it means that 5 does not exist in the data set. If it is not empty, it means that 5 is in the data set. Among them, the goal is reached through one search, and the time complexity is O(1). This method is called "direct addressing method": directly bind the data and the subscript of the array together, and when searching, directly array[n] retrieves the data

    Advantages: fast speed, find the results in one search
    Disadvantages:
    1) waste of space, such as 1,5,7,6,3,4,8,12306, the maximum value is 12306, according to the above-mentioned method, you need to define one such as The length is an array of 12307, but only a few pieces of data are stored, and space in other locations is wasted
    2) Data such as: 1,5,7,6,3,4,8,1,2,1,2,1 ,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2 The maximum value is 12, which is less than 13 spaces and can’t store so much content

  • Zipper method

    Now, in another design, if the data is 3, 5, 7, 12306, a total of 4 data, we open up any space, such as 5, then where the specific data is stored, we can perform the data Find the modulus (the number of spatial positions is 5), and determine the subscript of the storage location according to the modulus remainder. If 3%5=3, then the data of 3 can be placed at the position where the subscript is 3, 12306%5=1 , Store the data of 12306 in the position of subscript 1

Insert picture description here

The above modulus of the data (data% space position) is a hash algorithm, but this is a simpler hash algorithm than ordinary. This method of constructing the Hash algorithm is called the method of dividing and leaving the remainder.

If the data is 1, 6, 7, 8, store these 4 data in the array above

Insert picture description here
Zipper method: The data length is defined, how to store more content, calculate the Hash value, and put a linked list in the storage location of the array element

Insert picture description here
If the Hash algorithm design is better, then the query efficiency will be closer to O(1), if the Hash algorithm design is lower, then the query efficiency will be very low

Insert picture description here
Therefore, the query efficiency of the Hash table depends on the Hash algorithm, which can distribute the data evenly, which can save space and
improve query efficiency. The research on the Hash algorithm is a very deep knowledge, and it is more complicated. For a long time, the Hash algorithm inside the Hash table
has also been updated, and many mathematicians are also studying it.
Divide and leave the remainder method 3%5
Linearly construct the Hash algorithm
Direct addressing method is also a method of constructing Hash, but simpler, the expression: H(key)=key than
H(key)=a*key + b (a, b are constants)
hashcode is actually obtained through a hash algorithm

1.2 Load balancing application scenarios of consistent Hash algorithm

Hash algorithm to apply it in a distributed cluster architecture scene in the
Hash algorithms are distributed clusters to apply it in many products, such as distributed cluster architecture ⽐ Redis, Hadoop, elasticsearch,
Mysql sub-library sub-table, Nginx load balancing and other
major The application scenarios are summarized in two

1.2.1 Load balancing of requests (similar to nginx's ip_hash strategy)

Insert picture description here

Nginx's IP_hash strategy can always route the requests sent by the client to the same target server without changing the client's ip to
achieve session stickiness and avoid dealing with session sharing issues.
If there is no IP_hash strategy, how to implement it Sticky session?
You can maintain a mapping table to store the mapping relationship between the client IP or sessionid and the specific target server
<ip,tomcat1>
Disadvantages
1) Then, in the case of many clients, the mapping table is very large, which wastes memory space
2) The client goes online and offline, and the target server goes online and offline, both of which will lead to the re-maintenance of the mapping table, which is very costly

1.2.2, Nginx's ip_hash strategy C language source code core part browsing
1.2.3 Distributed storage application scenarios of consistent Hash algorithm

If using a hashing algorithm, a lot of simple things, we can calculate the hash value of the ip address or sessionid into ⾏, hash value and service
number Intake ⾏ modulo operation, the resulting value is the current request should be routed to the Server number, in this way,
the request sent by the same client ip can be routed to the same target server to achieve session stickiness.

Take the distributed memory database Redis as an example. There are three Redis servers
in the cluster: redis1, redis2, and redis3. Then, when storing data, which server is the <key1, value1> data stored in? Hash processing for the key
hash(key1)%3=index, use the remainder index to lock the specific server node for storage

Two, handwriting algorithm

2.1, the problems of ordinary Hash algorithm

Insert picture description here

2.2, the principle of consistent Hash algorithm

Hash ring

Insert picture description here

2.3 Analysis of Consistent Hash Algorithm Scaling and Expansion

Insert picture description here
Insert picture description here

2.4. Consistent Hash algorithm + virtual node scheme

1) As mentioned above, each server is responsible for a period of time, and the consistent hash algorithm only needs to relocate a small
part of the data in the ring space for the increase or decrease of nodes , which has good fault tolerance and scalability .
However, when the number of service nodes is too few, the consistent hash algorithm is likely to cause data skew problems due to uneven node distribution. For example,
there are only two servers in the system, and the ring distribution is as follows. Node 2 can only be responsible for a very small segment, and a large number of client
requests fall on node 1. This is the data (request) skew problem
2) To solve This kind of data skew problem, the consistent hash algorithm introduces the virtual node mechanism, that is, multiple
hashes are calculated for each service node , and each calculation result position is placed in this service node, which is called a virtual node.
The specific method can be realized by adding a number after the server ip or host name. ⽐ As may be calculated for each section of the server virtual three
points may then calculate the "node 1 ip # 1", "node 1 ip # 2", "ip # 3 of the node 1", "node 2, respectively,
The hash values ​​of ip#1", " ip#2 of node 2", and "ip#3 of node 2" form six virtual nodes. When the client is routed to the virtual node, it is actually
routed to that node. Real node corresponding to virtual node

Insert picture description here

2.5. Realization of ordinary Hash algorithm of Hash algorithm by handwriting

Insert picture description here

2.6, the implementation of the consistent hash algorithm of the Hash algorithm by handwriting

Insert picture description here
Insert picture description here
Insert picture description here

2.7. The implementation of the consistent hash algorithm of the Hash algorithm by handwriting and the implementation of the virtual node scheme

Three, Nginx configuration consistent Hash load balancing strategy

Insert picture description here

Insert picture description here

Guess you like

Origin blog.csdn.net/qq_42082278/article/details/112512777