Learn "consistent hashing" in one go, just rely on these 18 pictures

When architect Da Liu saw the problem of out-of-order accounting submitted by intern Xiao Li , he knew it was right: this time, Da Liu will use the old fellow of consistent hashing to solve this problem.

Well, consistent hashing, a must-have medicine for distributed architects, let's try it together.

1. Eyes full of what I looked like 20 years ago, let's start with hash

N years ago, the distributed architecture of the Internet was in the ascendant. Due to business needs, Da Liu's company introduced a business architecture designed by the IBM team.

This architecture adopts a distributed idea and communicates through RabbitMQ's message middleware. This architecture, in the era at that time, was considered a black technology architecture with advanced thinking and rare technology.

However, due to the fact that distributed technology was not widely implemented at that time, there are many places that are not yet mature. Therefore, in the use of this architecture over the years, some problems have gradually become prominent. Among them, the most typical problems are two:

  1. RabbitMQ is a single point, once it breaks, the whole system will be completely paralyzed.

  2. The business system for receiving and sending messages is also a single point. If there is a problem at any point, the messages in the corresponding queue will either not be consumed, or a large number of messages will accumulate.

No matter what kind of problem, in the end, the entire distributed system cannot be used, and the follow-up processing is very troublesome.

As for the single-point problem of RabbitMQ, because the cluster function of RabbitMQ was very weak at that time, and the normal mode had the single-point problem of the queue itself, finally, Keepalived was used in conjunction with two unrelated RabbitMQs to achieve high availability.

For the single-point problem of the business system, there were twists and turns when it was solved from the beginning. Generally speaking, we want to solve single-point problems, and the method is to heap machines and heap applications. Sending and receiving is a single point, we can just deploy a few more applications directly. From a technical point of view, it is nothing more than that multiple applications that send and receive messages compete to put messages in MQ and get messages.

However, it is precisely after clustering the applications that send and receive messages that the system has problems.

The system architecture itself will be applied to various types of businesses of the company, and some businesses have strict requirements on the order of messages.

For example, the internal IM application of the company, whether it is a point-to-point chat or a group chat message, requires the dialogue messages to be strictly ordered. And when we cluster the applications that produce and consume messages, the problem arises:

Chat history is out of order

When A and B are talking, some messages will not be received by B in strict accordance with the order sent by A, so the entire chat sequence becomes a mess.

After investigation, it was found that the root cause of the problem lies in the application cluster. Since there is no special treatment for the application cluster to send and receive messages, when A sends a chat message to B, the message sent to RabbitMQ will be scrambled by the consumer at B. If A sends out several messages in a short period of time, it may be snatched by different applications in the cluster.

At this time, the problem of out-of-order appears. Although the application business logic is the same, the applications in these clusters may still have differences in the speed of processing information, which will eventually lead to confusion in the chat information that users see.

Problem found, what is the solution?

As we said above, the disorder of the message order is caused by different applications in the cluster grab messages and then process them at different speeds. If we can guarantee that the conversation between A and B will always be consumed by the same application in the consuming message cluster application where B is located from the beginning to the end of the session, then we can guarantee the order of the messages. In this way, we can queue the grabbed messages in the application that consumes the messages, and then process them in sequence.

So, how is this guarantee achieved?

First, we create queues in RabbitMQ with the same prefix followed by the queue number. Then, different applications in the cluster will listen to these two queues with different numbers respectively. When sending a message at A, we do a simple hash of the message:

m = hash(id) mod n

Here, id is the identity of the user. n is the number of business system deployments where B is located in the cluster. The final m is the destination queue number we need to send to.

Suppose, the result of hash(id) is 2000, n is 2, and m = 0 after calculation. At this point, A will send the conversation information between him and B to the queue of chat00. After B receives the message, it will be displayed to the end user in turn. In this way, the problem of chat disorder is solved.

So, is this the end of the matter? Is this solution perfect?

2. It seems that we need to increase the number of applications

With the development of the company, the number of people in the company has also increased sharply, the number of IM users within the company has also increased, and new problems have emerged.

The main problem is that people are getting slower with chat messages. The reason is also very simple, there are not enough cluster machines to collect chat information. The solution can be simple and straightforward, just add another machine.

However, since a new machine has been added to the cluster receiving the message, at this time, we need to do a few more things:

  1. We need to add an additional queue chat02 for the application on the newly added machine.

  2. We also need to modify our rules for distributing messages, changing the original hash(id) mod 2 to hash(id) mod 3.

  3. Restart the project sending the message for the modified rules to take effect.

  4. Deploy the message-receiving application to the new machine.

So far, everything is still under control. Developers only need to add a new queue when needed, and then make small changes to our allocation rules.

But what they didn't know was that a storm was coming.

3. New problems come, maybe this is life

Because many people inside the company are using this IM tool. Sometimes, for convenience, the company's customers and some partners also use this IM. This complicates things. At first, the developers were business as usual, adding a machine every time people complained that messages were being slowed down.

Worst of all, the company's customers complained that IM was sometimes completely unavailable. This is no small thing. Problems within the company can also be resolved through internal communication. But the problem of the company's customers should not be taken lightly, because it is related to the reputation of the company's products.

So, what the hell is going on here?

It turns out that the root cause is the restart of the service after each modification of the configuration rules. Every time you modify the configuration rules, you need to plan an appropriate downtime to re-launch the project.

However, this approach doesn't work when the company's customers also use the IM. Because many of the company's customers are abroad. That said, it's likely that someone is always using the IM, day or night.

This forces developers, when adding machines, to coordinate and communicate a launch time with multiple parties, then issue an announcement, and then go online. This kind of repeated communication, re-launch, repeated communication, and re-launch directly tossed the developers half to death.

Often after the communication, the online time is directly put into half a month later. And in this half month, the developers have to bear the saliva of countless internal IM users. The painstaking communication, the hoarse explanation, the lack of sleep and the lack of sleep, all these push the developers to make changes to the technical solution in front of them.

4. The idea is turned, the queue is circled

The essence of the demand for new technical solutions is:

No matter whether the distribution message rule changes or the cluster machine is added, the service cannot be stopped.

For this situation, a good solution is if we perform dynamic timing detection on the project configuration file, and when changes are found, the configuration rules can be refreshed.

Everything looks good. After using dynamic timing detection, whenever we need to add machines in the cluster, we only need the following three steps:

  1. add a queue

  2. Modifying Rules for Assigning Messages

  3. Deploy a new machine

Customers have no awareness, and developers do not need to coordinate and communicate with users to make special launch arrangements. However, there are some problems with this solution:

  1. As we deploy more and more systems, we need more and more systems to manually modify the rules.

  2. If the consumer machine goes down, we need to delete the queue, and at the same time, we need to delete and modify the rules for distributing messages. When the machine recovers, we need to change the rules for distributing messages back.

The rules for distributing messages are really annoying. Every time there is a change, you have to pay attention to the rules for distributing messages. Is there any way to make this assignment more automatic?

If we assume that there are 100 queues for sending and receiving chat messages in MQ (100: this is an impossible number for our IM), we only need to configure in the configuration rule:

m = hash(id) mod 100

Then, after our application for sending messages is started, it dynamically detects all the real queue information for sending and receiving chat messages.

When we find that there is no real corresponding queue through the number calculated by the hash, we will find a real queue according to certain rules. This queue is the queue we want to send messages to.

If we do this, in the future, every time the queue changes, whether it increases or decreases, we don't need to think about the allocation rules anymore, we just need to remove the problematic queue or increase the queue with corresponding consumers. Can.

This idea is the idea of ​​consistent hashing.

How to do it?

In the first step, we assume that there are 100 queues for sending and receiving chat messages, and these queues are in a ring.

In the second step, we obtain the actual number of queues for sending and receiving chat messages, assuming there are 5 queues.

In the third step, we map the real queue to the ring we assumed in the first step.

In the fourth step, we calculate the corresponding queue number by assigning the rule hash(id) mod 100.

If the result of hash(id) is 2000, then the calculated queue number m = 0. At this time, we checked and found that the chat00 queue corresponding to number 0 does exist, so we directly send the message to chat00.

If the result of our hash(id) is 1999, then the calculated queue number m = 99. At this point, we checked the queue mapping relationship and found that number 99 does not have a corresponding real queue. What should we do at this time? Very simple, we continue to search clockwise, who did we find? 0 corresponds to the chat00 queue, which is real. At this time, we send the message to the chat00 queue.

The above four steps are a basic consistent hashing algorithm.

So, does this set of consistent hashing algorithms meet the needs of us not wanting to always update the message distribution rules? Let's verify:

  1. Suppose we need to add a machine to the cluster on the consumer information side.
    If we want to add a machine, we also need to add a queue to MQ. At this time, our allocation rule is hash(id) mod 100. After adding queues, the actual number of queues is assumed to be 6. At this time, if the result of hash(id) mod 100 is less than 6, then the allocation rules are the same as when no machines are added. Which queue was previously allocated and which queue is now allocated. But for the case where the result is equal to 6, something changes. Information is automatically assigned to chat05. When allocated to chat05, new consumers will automatically start to work normally, and we do not need to do any manual intervention, nor do we need to consider changes in allocation rules.

    Before adding machines:

    After adding machines:

  2. Assuming that a machine in the consumer information cluster is down to
    simulate downtime, we will reduce a queue at this time. The actual number of queues after the reduction is 5, which is just the opposite of increasing the queue. When m = 5, there will be no change in the behavior. Which queue is assigned before, or which queue is assigned. If m = 6, since there is no real queue, a clockwise search will be performed, and the result will be chat00. Those who were previously assigned to chat05 will be assigned to chat00. At this time, chat00 happens to have consumers, so the users of the system are unaware, and we can concentrate on repairing our machines. When the machine is restored, it will be the same as the newly added machine, and the information whose calculation result is 6 will be redistributed back to chat05.

At present, we can see that when we introduce consistent hashing, no matter whether we add a new machine or a cluster machine is down, I only need to follow the state of the machine and do one operation: increase or decrease the queue in MQ. Everything simplified.

So, is there still a problem with this plan?

5. The unbalanced ring may be the straw that breaks the camel

Suppose we currently have 5 queues and our allocation rule is m = hash(id) mod 100. Well, at this point, the problem arises.

If the value of m is greater than 5, since there is no corresponding real queue, the system will search clockwise along the hash ring we constructed, and finally find the queue chat00.

Then, you will find that as long as the id whose m value is greater than 5 corresponds to the information sent by the user, it will eventually fall into the chat00 queue.

In extreme cases, if a large amount of information pours into the chat00 queue, the consumer corresponding to chat00 may not be able to handle it, which may lead to the collapse of the consumer.

Then, after the queue is removed, according to the rules, a large amount of information will flow into the subsequent queue chat01 of chat00. This information will cause the application corresponding to chat01 to crash, and eventually the entire cluster. This is the avalanche effect.

We need a smarter way to solve this problem.

6. From real to virtual, maybe we should dare to think more

After the above discussion, we found that when we allocate queues, the reason why we are out of balance is because of the imbalance in the allocation of our queues on the ring.

All our real queues are arranged in a clockwise order on the ring. In the above scenario, we only have 5 queues. At this point, we assume there will be 100 queues. Then, m = hash(id) mod 100 in this formula:

There is a 95% chance that m is greater than 5

Since our 5 queues are arranged sequentially in number order. That means that all information with m greater than 5 will be mapped to a non-existing queue. Finally, according to the rules, it will slide clockwise to the chat00 queue corresponding to 0.

If we can make the real queues evenly distributed on the ring, will this serious imbalance happen again?

From the above figure, we can see that if we can make the real queue evenly distributed on the ring, then this serious imbalance will be greatly alleviated.

So how to make these queues evenly distributed in this ring? Remember when we agonized over the constant revision of the rules for distributing messages, and we boldly assumed a queue number that our IM system would never reach?

We assume that there are 100 queues in MQ, and then we go to determine whether these queues actually exist. If it does not exist, we just slide clockwise until we find the real queue.

If we are a little bolder and secretly further optimize our assumptions, and map some queues that need to be judged as non-existent to the ones that actually exist, then are we equal to distributing these real queues evenly to Is this ring on?

Like the picture above, the method of mapping a small number of existing queues to multiple hypothetical queues is the virtual node method of consistent hashing.

As for how to map a small number of queues to multiple hypothetical queues, there are various implementation algorithms.

For example, we can add some numbers to the real queue names to hash them separately, such as hash(chat00) mod 100, hash(chat00#1) mod 100, and then according to the obtained remainder, go to the real queue of chat00 and on the position map in the ring corresponding to the remainder.

If hash(chat00) mod 100 = 31, then the position of No. 31 corresponds to chat00, and all subsequent messages corresponding to m = 31 in m = hash(id) mod 100 will be sent directly to the chat00 queue.

And hash(00#1) mod 100 = 56, the message corresponding to m = 56 will also be sent directly to the chat00 queue.

In this way, we indirectly uniformly distribute the real queues in MQ, thereby greatly reducing the phenomenon of information imbalance.

7. Understanding the idea of ​​an algorithm is better than the implementation of the algorithm

Well, the idea of ​​consistent hashing is temporarily analyzed here through actual scenarios.

As a very classic algorithm idea, consistent hashing is widely used in major distributed projects to solve various fragmentation problems and task distribution problems.

However, here, I want to correct a point: a lot of people online say that redis uses consistent hashing. This is wrong, redis just uses the idea of ​​consistent hashing. For example, the ring distribution in consistent hashing, and the idea that virtual nodes correspond to real nodes.

However, redis does not use any hash algorithm to calculate the distribution. If you are interested, you can take a closer look at the relevant content. From the example of redis, we can see that only by understanding the idea of ​​​​the algorithm can we more easily and flexibly decompose, modify, and improve the algorithm according to local conditions, so that the algorithm can be more realistically integrated into our projects.

Through this article, we start with hashing and go to the distribution of virtual nodes using consistent hashing. How do you feel about the good medicine of consistent hashing?

This is the first time I wrote a graphic article, let's accommodate the aesthetics of straight men!

It was my first time to write an illustrated article, and I was really tired of drawing blood! Please give everyone a thumbs up .

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324153859&siteId=291194637