Application of Consistent Hash Algorithm in Distributed Cache

Purpose


Introduce the consistent hashing algorithm (Consistent Hashing) and its application in distributed cache, and introduce the principle of consistent hashing algorithm.

Application scenarios


Suppose we have a website. Recently, we found that with the increase of traffic, the pressure on the server is increasing, and the way of reading and writing the database directly before is not very effective, so we want to introduce Redis as a caching mechanism. Now we have a total of three machines that can be used as Redis servers, as shown in the figure below.

Schematic diagram of distributed cache.png

problem to be solved


Generally speaking, we will use distributed cache in large-scale access and large concurrent traffic, that is, deploy cheap machines in the same subnet to form a multi-machine cluster, and then distribute read requests through load balancing and certain routing rules. Map the request to the
corresponding cache server. How to accurately map requests and cache servers, as well as elegant expansion, eliminating cache servers is the pain point of distributed cache deployment.
Next, we will analyze some traditional methods to solve the above problems.

1. The problem of accurate mapping between the request and the cache server.

  • Simplest strategy - random selection:
    Meaning: Send each Redis request to a Redis server randomly.
    Problems arising:
   1.同一份数据可能被存在不同的机器上而造成数据冗余。
   2.有可能某数据已经被缓存但是访问却没有命中,因为无法保证对相同key的所有访问都被发送到相同的服务器。
     因此,随机策略无论是时间效率还是空间效率都非常不好。
  • Solve to ensure that the same key accesses the same Redis server every time - calculate the hash:
    Meaning: guarantee that access to the same key will be sent to the same server.
    Scheme description:
    For each access, its hash value can be calculated as follows:
    h = Hash(key) % 3
    where Hash is a hash mapping function from strings to positive integers. In this way, if we number the Redis Server as 0, 1, and 2 respectively, then we can calculate the server number h according to the above formula and key, and then access it.
    Although this method solves the two problems mentioned above, there are some other problems. If the above method is abstracted, it can be considered as:
    h = Hash(key) % N
    This formula calculates which server the request for each key should be sent to, where N is the number of servers, and the server is based on 0 – (N-1 )serial number.

2. Elegant expansion, eliminating cache server problems
For the problem of locating the Redis cache server based on the hash operation of the requested key: fault tolerance and scalability will become extremely poor.

  • Fault tolerance: refers to whether the entire system can operate correctly and efficiently when one or several servers in the system become unavailable;
  • Scalability: Refers to whether the entire system can operate correctly and efficiently when new servers are added.
    Now suppose that one server is down, then in order to fill the vacancy, the down server should be removed from the numbered list, and the following servers will be moved forward one place in order and their number
    value will be decremented by one. At this time, each key is Recalculate according to h = Hash(key) % (N-1); similarly, if a new server is added, although the original server number does not need to be changed,
    it needs to be calculated according to h = Hash(key) % (N+1) Recalculate the hash value. Therefore, once the server changes in the system, a large number of keys will be
    relocated to different servers, resulting in a large number of cache misses.
    And this situation is very bad in a distributed system.

A well-designed distributed hashing scheme should have good monotonicity, that is, the increase or decrease of service nodes will not cause a large number of hash relocations. Consistent hashing algorithm is such a hashing scheme.

Solution - Consistent hash algorithm##


Algorithm Brief
Consistent Hashing was first proposed in the paper " Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web ". In simple terms, consistent hashing organizes the entire hash value space into a virtual ring, such as assuming that the value space of a hash function H is 0 - 2 to the 32nd power -1 (that is, the hash value is a 32 Bit unsigned integer), the entire hash space ring is as follows:

                                                


 Consistent hash function value space.png

The entire space is organized clockwise. 0 and 232-1 coincide in the direction of zero.

The next step is to use H to perform a hash on each server. Specifically, the ip or host name of the server can be selected as the key for hashing, so that each machine can determine its position on the hash ring. Here, it is assumed that the above three The location of the server in the ring space after using the IP address hash is as follows:

                               

Consistent hash function value space (1).png

Next, use the following algorithm to locate the data to access the corresponding server: use the same function H to calculate the hash value h of the data key, determine the position of the data on the ring according to h, and "walk" clockwise along the ring from this position . A server encountered is the server it should target.

For example, there are data objects corresponding to four keys A, B, C, and D in our cache server. After hash calculation, their positions in the ring space are as follows:

                               


Consistent hash function value space (2).png

As of now, there seems to be nothing magical about it. Please read it below: Analysis of Fault Tolerance and
Scalability
Let's analyze the fault tolerance and scalability of the consistent hashing algorithm. Now suppose Redis-2 is down:

                               

                                                                       Consistent hash function value space (3).png We can see that the ACD node is not affected, and only the B node is redirected to Redis-0. Consider another situation below, if we add a server Redis-3 Server to the system:



                                              Consistent hash function value space (4).png

It can be found that the key C is relocated to the Redis-3 server, and other non-C keys are not affected.

To sum up, the consistent hash algorithm only needs to relocate a small part of the data in the ring space for the increase or decrease of nodes, which has good fault tolerance and scalability.

data skew problem


Solution -
When there are too few service nodes in the virtual node consistent hash algorithm, it is easy to cause the problem of data skew due to the uneven distribution of nodes. For example, there are two servers in our system, and their ring distribution is as follows:

                                     

Consistent hash function value space (5).png

At this time, a large amount of data will inevitably be concentrated on Redis-1, and only a very small amount will be located on Redis-0. In order to solve this data skew problem, the consistent hash algorithm introduces a virtual node mechanism, that is, multiple hashes are calculated for each service node, and a service node is placed in each calculation result position, which is called a virtual node. The specific method can be achieved by adding a number after the server ip or host name. For example, in the above case, we decided to calculate three virtual nodes for each server, so we can calculate "Redis-1 #1", "Redis-1 #2", "Redis-1 #3", "Redis-0 #" respectively 1", "Redis-0 #2", and "Redis-0 #3" hash values, thus forming six virtual nodes:

                         

Consistent hash function value space (6).png

At the same time, the data positioning algorithm remains unchanged, but there is one more step of mapping from virtual nodes to actual nodes. The data is located on Redis-1. This solves the problem of data skew when there are few service nodes. In practical applications, the number of virtual nodes is usually set to 32 or more, so even a few service nodes can achieve relatively uniform data distribution.

Summarize

At present, consistent hashing has basically become the standard configuration of distributed system components. For example, various clients of Redis provide built-in consistent hashing support. This article only briefly introduces the idea of ​​this algorithm and its application scenarios in distributed applications.



Author: fxliutao
Link: https://www.jianshu.com/p/793c76ee84fc
Source: Jianshu The
copyright belongs to the author. For commercial reprints, please contact the author for authorization, and for non-commercial reprints, please indicate the source.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325816159&siteId=291194637