Understand the evolution of redis architecture in one article (with diagrams)

This article takes approximately 13 minutes to read.

In this article, I want to talk to you about the architectural evolution of Redis.

Nowadays, Redis is becoming more and more popular and is used in almost many projects. When you use Redis, have you ever thought about how Redis provides services stably and with high performance?

You can also try answering these questions:

  • The scenario where I use Redis is very simple. Are there any problems if I only use the stand-alone version of Redis?

  • My Redis is down and my data is lost. What should I do? How can I ensure that my business applications are not affected?

  • Why do we need a master-slave cluster? What are its advantages?

  • What is a sharded cluster? Do I really need a sharded cluster?

  • ...

If you already know something about Redis, you must have heard of the concepts of data persistence, master-slave replication, and sentinels . What are the differences and connections between them?

If you have such doubts, in this article, I will take you step by step from 0 to 1, and then from 1 to N, to build a stable and high-performance Redis cluster.

In this process, you can learn what optimization solutions Redis has adopted in order to achieve stability and high performance, and why it is done so?

Master these principles, so that you can be "with ease" when using Redis.

This article has a lot of useful information, I hope you can read it patiently.

Start with the simplest: stand-alone version of Redis

First, let's start with the simplest scenario.

Suppose you now have a business application and need to introduce Redis to improve the performance of the application. At this time, you can choose to deploy a stand-alone version of Redis for use, like this:

This architecture is very simple. Your business application can use Redis as a cache, query data from MySQL, and then write it to Redis. Then the business application reads the data from Redis, because Redis data is stored in memory. Medium, so the speed is very fast.

If your business size is not large, then such an architectural model can basically meet your needs. Is not it simple?

As time goes by, your business volume gradually develops, and more and more data is stored in Redis. At this time, your business applications become more and more dependent on Redis.

However, suddenly one day, your Redis goes down for some reason. At this time, all your business traffic will hit the back-end MySQL. This will cause the pressure on your MySQL to increase dramatically, and in severe cases, it may even overwhelm MySQL. .

What should you do at this time?

I guess your plan is to restart Redis quickly so that it can continue to provide services.

However, because the data in Redis was previously in memory, even if you restart Redis now, all the previous data will be lost. Although Redis can work normally after restarting, because there is no data in Redis, business traffic will still hit the back-end MySQL, and MySQL is still under great pressure.

This is how to do? You are lost in thought.

Is there any good way to solve this problem?

Since Redis only stores data in memory, can it also write a copy of this data to disk?

If we adopt this method, when Redis restarts, we can quickly restore the data in the disk to the memory so that it can continue to provide services normally.

Yes, this is a good solution. The process of writing memory data to disk is "data persistence".

Data Persistence: Be Prepared

Now, the Redis data persistence you envision is like this:

However, how should data persistence be done specifically?

I guess the easiest solution you can think of is that every time Redis performs a write operation, in addition to writing to the memory, it also writes a copy to the disk, like this:

Yes, this is the simplest and most straightforward solution.

But if you think about it carefully, there is a problem with this solution: every write operation by the client requires writing to both memory and disk, and the time-consuming of writing to disk is definitely much slower than writing to memory! This will inevitably affect the performance of Redis.

How to avoid this problem?

We can optimize it like this: Redis writes memory by the main thread. After writing the memory, the result is returned to the client, and then Redis uses another thread to write to the disk. This can avoid the impact of the main thread writing to the disk on performance.

This is indeed a good solution. In addition, we can change the perspective and think about other ways to persist data?

At this time, you have to consider it based on the usage scenarios of Redis.

Recall, when we use Redis, what scenarios do we usually use it for?

Yes, cache.

Using Redis as a cache means that although the full amount of data is not saved in Redis, for data that is not in the cache, our business applications can still get results by querying the back-end database, but the speed of querying the back-end data will be slower. , but it actually has no impact on business results.

Based on this feature, our Redis data persistence can also be done in the form of "data snapshot".

So what is a data snapshot?

To put it simply, you can understand it this way:

  1. If you think of Redis as a water cup, writing data to Redis is equivalent to pouring water into this cup.

  2. At this time, you take a camera and take a picture of the water cup. At the moment when you take the picture, the water capacity in the water cup is recorded in the photo, which is the data snapshot of the water cup.

In other words, the data snapshot of Redis records the data in Redis at a certain moment, and then only needs to write this data snapshot to the disk.

Its advantage is that data is written to the disk "once" only when persistence is required, and there is no need to operate the disk at other times.

Based on this solution, we can regularly take data snapshots for Redis and persist the data to the disk.

In fact, the persistence solutions mentioned above are Redis’s “RDB” and “AOF”:

  • RDB: Only persist data snapshots at a certain moment to the disk (create a sub-process to do this)

  • AOF: Every write operation is persisted to the disk (the main thread writes to the memory, and according to the policy, you can configure whether the main thread or the sub-thread performs data persistence)

In addition to the above mentioned differences, they also have the following characteristics:

  1. RDB uses binary + data compression to write to disk, so that the file size is small and the data recovery speed is also fast.

  2. AOF records every write command and has the most complete data, but the file size is large and the data recovery speed is slow.

If you were asked to choose a persistence solution, you could choose like this:

  1. If your business is not sensitive to data loss, use RDB solution to persist data

  2. If your business has high data integrity requirements, use the AOF solution to persist data.

Assuming that your business has relatively high requirements for Redis data integrity and chooses the AOF solution, you will encounter these problems at this time:

  1. AOF records every write operation. As time goes by, the AOF file size will become larger and larger.

  2. With such a large AOF file, data recovery becomes very slow.

What to do? Data integrity requirements have become higher and data recovery has become more difficult? Is there any way to reduce the file size? What about improving recovery speed?

Let's continue to analyze the characteristics of AOF.

Since the AOF file records every write operation, but the same key may be modified multiple times, we only retain the last modified value. Is that okay?

Yes, this is the "AOF rewrite" we often hear. You can also understand it as AOF "slimming".

We can rewrite the AOF file regularly to avoid continuous expansion of the file size, so that the recovery time can be shortened during recovery.

Thinking about it further, is there any way to continue reducing the AOF file?

Let’s review what we mentioned earlier, the respective characteristics of RDB and AOF:

  1. RDB is stored in binary + data compression mode, and the file size is small

  2. AOF records every write command, with the most complete data

Can we leverage the strengths of each?

Of course, this is the "hybrid persistence" of Redis.

Specifically, when AOF rewrites, Redis first writes a data snapshot in the AOF file in RDB format, and then appends each write command generated during this period to the AOF file. Because RDB is written with binary compression, the AOF file size becomes smaller.

At this time, when you use AOF files to recover data, the recovery time will be shorter!

Only Redis 4.0 and above support hybrid persistence.

With such optimization, your Redis no longer has to worry about instance downtime. When a downtime occurs, you can use persistent files to quickly restore the data in Redis.

But is this okay?

Think about it carefully, although we have optimized the persistent files to the minimum, it still takes time to restore the data. During this period, your business applications will still be affected. What should you do?

Let's analyze whether there is a better solution.

If an instance goes down, it can only be solved by restoring data. Can we deploy multiple Redis instances and keep the instance data synchronized in real time? So when an instance goes down, we can choose one of the remaining instances to continue. Just provide services.

Yes, this solution is the "master-slave replication: multiple copies" that I will talk about next.

Master-slave replication: multiple copies

At this point, you can deploy multiple Redis instances, and the architectural model becomes like this:

We call the node that reads and writes real-time here the master, and the other node that synchronizes data in real-time is called the slave.

The advantages of using a multi-copy solution are:

  1. Shorten the unavailability time: If the master goes down, we can manually promote the slave to the master to continue providing services.

  2. Improve read performance: Let the slave share part of the read requests to improve the overall performance of the application

This solution is good. It not only saves data recovery time, but also improves performance. So are there any problems with it?

You can think about it.

In fact, the problem is: when the master goes down, we need to "manually" promote the slave to the master, and this process also takes time.

Although it is much faster than recovering data, it still requires manual intervention. Once manual intervention is required, human reaction time and operation time must be taken into account. Therefore, your business applications will still be affected during this period.

how to solve this problem? Can we automate this switching process?

For this situation, we need an "automatic failover" mechanism, which is the capability of the "Sentinel" we often hear about.

Sentinel: failover

Now, we can introduce an "observer" and let this observer monitor the health status of the master in real time. This observer is the "sentinel".

How to do it?

  1. Every once in a while, the sentinel asks the master whether it is normal.

  2. The master responds normally, indicating that the status is normal, and the reply timeout indicates an abnormality.

  3. Sentinel detects an abnormality and initiates master-slave switching

With this solution, there is no need for humans to intervene and everything becomes automated. Isn’t it great?

But there is another problem here. If the master status is normal, but there is a problem with the network between them when the sentry is asking the master, then the sentry may misjudge.

How to solve this problem?

The answer is that we can deploy multiple sentinels and distribute them on different machines. They monitor the status of the master together. The process becomes like this:

  1. Multiple sentinels ask the master if it is normal every once in a while.

  2. The master responds normally, indicating that the status is normal, and the reply timeout indicates an abnormality.

  3. Once a sentinel determines that the master is abnormal (whether it is a network problem or not), it will ask other sentinels. If multiple sentinels (set a threshold) all think that the master is abnormal, then it will be determined that the master has indeed failed.

  4. After multiple sentinels negotiate, it is determined that the master is faulty, and a master-slave switch is initiated.

Therefore, we use multiple sentinels to negotiate with each other to determine the status of the master. This can greatly reduce the probability of misjudgment.

After the sentinel negotiation determines that the master is abnormal, there is another question: which sentry will initiate the master-slave switch?

The answer is to select a sentinel "leader", and this leader will perform master-slave switching.

The question comes again, how to choose this leader?

Imagine how elections are done in real life?

Yes, vote.

When electing the sentinel leader, we can formulate such an election rule:

  1. Each sentinel asks the other sentinels to vote for them.

  2. Each sentinel only votes for the first sentinel that requests a vote, and can only vote once

  3. The sentinel that gets more than half of the votes first is elected as the leader and initiates master-slave switching.

In fact, this election process is what we often hear: the "consensus algorithm" in the field of distributed systems.

What is a consensus algorithm?

We deploy sentinels on multiple machines, and they need to work together to complete a task, so they form a "distributed system."

In the field of distributed systems, the algorithm for how multiple nodes reach consensus on an issue is called a consensus algorithm.

In this scenario, multiple sentinels negotiate together to elect a leader that is recognized by all, which is accomplished using a consensus algorithm.

This algorithm also stipulates that the number of nodes must be an odd number. This ensures that even if a node fails in the system, more than "half" of the remaining nodes are in normal status and can still provide correct results. In other words, this algorithm is also compatible with There is a faulty node.

There are many consensus algorithms in the field of distributed systems, such as Paxos and Raft. In the scenario of Sentinel election leader, the Raft consensus algorithm is used because it is simple enough and easy to implement.

Now, we use multiple sentinels to jointly monitor the status of Redis. In this way, we can avoid the problem of misjudgment. The architectural model becomes like this:

Okay, let’s summarize it here.

Your Redis has been optimized from the simplest stand-alone version to data persistence, master-slave multiple copies, and sentinel clusters. Your Redis is getting better and better in terms of performance and stability. Even if a node fails, it will still be restored. Don't worry anymore.

If your Redis is deployed in this architecture mode, it can basically run stably for a long time.

...

As time goes by, your business volume begins to experience explosive growth. At this time, can your architectural model still be able to bear such a large amount of traffic?

Let’s analyze it together:

  1. Stability: If Redis fails, we have Sentinel+ replicas, which can automatically complete master-slave switching.

  2. Performance: As the number of read requests increases, we can deploy multiple slaves to separate reading and writing to share the reading pressure.

  3. Performance: The volume of write requests increases, but we only have one master instance. What should we do if this instance reaches a bottleneck?

See, when your write request volume becomes larger and larger, a master instance may not be able to bear such a large write traffic.

To perfectly solve this problem, you need to consider using "sharded clusters" at this time.

Sharded clusters: scale out

What is a "sharded cluster"?

To put it simply, if one instance cannot withstand the writing pressure, can we deploy multiple instances, organize these instances according to certain rules, treat them as a whole, and provide services to the outside world? In this way, we can solve the problem of writing only one instance. bottleneck problem?

Therefore, the current architectural model becomes like this:

Now the question comes again, how to organize so many instances?

Our rules are as follows:

  1. Each node stores a portion of data, and the sum of all node data is the full amount of data.

  2. Develop a routing rule to route different keys to a fixed instance for reading and writing.

Sharded clusters can also be divided into two categories based on the location of routing rules:

  1. Client Sharding

  2. Server-side sharding

Client-side sharding means that the routing rules for keys are done on the client side, as follows:

The disadvantage of this solution is that the client needs to maintain the routing rules, that is to say, you need to write the routing rules into your business code.

How to avoid coupling routing rules in business code?

You can optimize this way and encapsulate this routing rule into a module. When you need to use it, just integrate this module.

This is the solution adopted by Redis Cluster.

Redis Cluster has built-in sentinel logic, eliminating the need to deploy sentinels.

When you use Redis Cluster, your business application needs to use the supporting Redis SDK. The routing rules are integrated in this SDK, and you do not need to write it yourself.

Let’s look at server-side sharding.

This solution means that the routing rules are not placed on the client, but an "intermediate proxy layer" is added between the client and the server. This proxy is the Proxy we often hear.

The data routing rules are maintained in this Proxy layer.

In this way, you don't need to care about how many Redis nodes there are on the server, you only need to interact with this Proxy.

Proxy will forward your request to the corresponding Redis node according to routing rules. Moreover, when the cluster instance is not enough to support larger traffic requests, it can also expand horizontally and add new Redis instances to improve performance. All of this is for you. For the client, it is transparent and imperceptible.

The industry's open source Redis sharding cluster solutions, such as Twemproxy and Codis, adopt this solution.

There are many details involved in data expansion in sharded clusters. This content is not the focus of this article, so we will not go into details yet.

At this point, when you use sharded clusters, you can calmly face greater traffic pressure in the future!

Summarize

Okay, let's summarize how we build a stable and high-performance Redis cluster step by step.

First of all, when using the simplest stand-alone version of Redis, we found that when Redis crashed, the data could not be recovered, so we thought of "data persistence" to persist the data in the memory to the disk. , so that data can be quickly restored from the disk after Redis restarts.

When persisting data, we are faced with the problem of how to persist data to disk more efficiently. Later we found that Redis provides two solutions: RDB and AOF, which correspond to data snapshots and real-time command records respectively. When we do not have high requirements for data integrity, we can choose the RDB persistence solution. If you have high data integrity requirements, you can choose the AOF persistence solution.

But we also found that the AOF file size will become larger and larger as time goes by. At this time, the optimization plan we thought of was to use AOF rewrite to slim it down and reduce the file size. Later, we found that we could Combining the respective advantages of RDB and AOF, the "hybrid persistence" method of combining the two is used when rewriting AOF, which further reduces the size of the AOF file.

Later, we found that although the data could be restored through data recovery, it also took time to restore the data, which meant that business applications would still be affected. We further optimized and adopted a "multi-copy" solution to keep multiple instances synchronized in real time. When one instance fails, other instances can be manually promoted to continue providing services.

But there are problems with this. Manually upgrading instances requires manual intervention, and manual intervention also takes time. We began to think of ways to automate this process, so we introduced the "Sentinel" cluster. The Sentinel cluster negotiates with each other. Faulty nodes are discovered and switchover can be completed automatically, thus greatly reducing the impact on business applications.

Finally, we focused on how to support greater write traffic, so we introduced "sharded clusters" to solve this problem, allowing multiple Redis instances to share the write pressure and face greater traffic in the future. We can also add new instances and expand horizontally to further improve the performance of the cluster.

At this point, our Redis cluster can provide long-term stable and high-performance services for our business.

Here I have drawn a mind map to help you better understand the relationship between them and the evolution process.

postscript

After reading this, I think you should have your own opinions on how to build a stable and high-performance Redis cluster.

In fact, the optimization ideas discussed in this article revolve around the core idea of ​​"architectural design":

  • High performance: read and write separation, sharded cluster

  • High availability: data persistence, multiple copies, automatic failover

  • Easy to expand: sharded cluster, horizontal expansion

When we talk about sentinel clusters and sharded clusters, this also involves knowledge related to "distributed systems":

  • Distributed Consensus: Sentinel Leader Election

  • Load balancing: sharded cluster data sharding, data routing

Of course, in addition to Redis, you can use this idea to think and optimize when building any data cluster to see how they do it.

For example, when you are using MySQL, you can think about the differences between MySQL and Redis. How does MySQL achieve high performance and high availability? In fact, the ideas are similar.

We can see distributed systems and data clusters everywhere now. I hope that through this article, you can understand how these software evolved step by step, what problems they encountered during the evolution process, and in order to solve these problems, these software What kind of plan did the designer design and what trade-offs did he make?

Only if you understand the principles and master the ability to analyze and solve problems, then in the future development process, or when learning other excellent software, you can quickly find the "key points" and master it in the shortest time. and can take advantage of them in practical applications.

In fact, this thinking process is also the idea of ​​​​doing "architectural design". When doing software architecture design, the scenario you face is to discover problems, analyze problems, solve problems, evolve and upgrade your architecture step by step, and finally achieve a balance in performance and reliability. Although all kinds of software are emerging in endlessly, the ideas of architectural design will not change. I hope that you will really absorb these ideas, so that you can cope with the ever-changing changes.

This article takes approximately 13 minutes to read.

In this article, I want to talk to you about the architectural evolution of Redis.

Nowadays, Redis is becoming more and more popular and is used in almost many projects. When you use Redis, have you ever thought about how Redis provides services stably and with high performance?

You can also try answering these questions:

  • The scenario where I use Redis is very simple. Are there any problems if I only use the stand-alone version of Redis?

  • My Redis is down and my data is lost. What should I do? How can I ensure that my business applications are not affected?

  • Why do we need a master-slave cluster? What are its advantages?

  • What is a sharded cluster? Do I really need a sharded cluster?

  • ...

If you already know something about Redis, you must have heard of the concepts of data persistence, master-slave replication, and sentinels . What are the differences and connections between them?

If you have such doubts, in this article, I will take you step by step from 0 to 1, and then from 1 to N, to build a stable and high-performance Redis cluster.

In this process, you can learn what optimization solutions Redis has adopted in order to achieve stability and high performance, and why it is done so?

Master these principles, so that you can be "with ease" when using Redis.

This article has a lot of useful information, I hope you can read it patiently.

Start with the simplest: stand-alone version of Redis

First, let's start with the simplest scenario.

Suppose you now have a business application and need to introduce Redis to improve the performance of the application. At this time, you can choose to deploy a stand-alone version of Redis for use, like this:

This architecture is very simple. Your business application can use Redis as a cache, query data from MySQL, and then write it to Redis. Then the business application reads the data from Redis, because Redis data is stored in memory. Medium, so the speed is very fast.

If your business size is not large, then such an architectural model can basically meet your needs. Is not it simple?

As time goes by, your business volume gradually develops, and more and more data is stored in Redis. At this time, your business applications become more and more dependent on Redis.

However, suddenly one day, your Redis goes down for some reason. At this time, all your business traffic will hit the back-end MySQL. This will cause the pressure on your MySQL to increase dramatically, and in severe cases, it may even overwhelm MySQL. .

What should you do at this time?

I guess your plan is to restart Redis quickly so that it can continue to provide services.

However, because the data in Redis was previously in memory, even if you restart Redis now, all the previous data will be lost. Although Redis can work normally after restarting, because there is no data in Redis, business traffic will still hit the back-end MySQL, and MySQL is still under great pressure.

This is how to do? You are lost in thought.

Is there any good way to solve this problem?

Since Redis only stores data in memory, can it also write a copy of this data to disk?

If we adopt this method, when Redis restarts, we can quickly restore the data in the disk to the memory so that it can continue to provide services normally.

Yes, this is a good solution. The process of writing memory data to disk is "data persistence".

Data Persistence: Be Prepared

Now, the Redis data persistence you envision is like this:

However, how should data persistence be done specifically?

I guess the easiest solution you can think of is that every time Redis performs a write operation, in addition to writing to the memory, it also writes a copy to the disk, like this:

Yes, this is the simplest and most straightforward solution.

But if you think about it carefully, there is a problem with this solution: every write operation by the client requires writing to both memory and disk, and the time-consuming of writing to disk is definitely much slower than writing to memory! This will inevitably affect the performance of Redis.

How to avoid this problem?

We can optimize it like this: Redis writes memory by the main thread. After writing the memory, the result is returned to the client, and then Redis uses another thread to write to the disk. This can avoid the impact of the main thread writing to the disk on performance.

This is indeed a good solution. In addition, we can change the perspective and think about other ways to persist data?

At this time, you have to consider it based on the usage scenarios of Redis.

Recall, when we use Redis, what scenarios do we usually use it for?

Yes, cache.

Using Redis as a cache means that although the full amount of data is not saved in Redis, for data that is not in the cache, our business applications can still get results by querying the back-end database, but the speed of querying the back-end data will be slower. , but it actually has no impact on business results.

Based on this feature, our Redis data persistence can also be done in the form of "data snapshot".

So what is a data snapshot?

To put it simply, you can understand it this way:

  1. If you think of Redis as a water cup, writing data to Redis is equivalent to pouring water into this cup.

  2. At this time, you take a camera and take a picture of the water cup. At the moment when you take the picture, the water capacity in the water cup is recorded in the photo, which is the data snapshot of the water cup.

In other words, the data snapshot of Redis records the data in Redis at a certain moment, and then only needs to write this data snapshot to the disk.

Its advantage is that data is written to the disk "once" only when persistence is required, and there is no need to operate the disk at other times.

Based on this solution, we can regularly take data snapshots for Redis and persist the data to the disk.

In fact, the persistence solutions mentioned above are Redis’s “RDB” and “AOF”:

  • RDB: Only persist data snapshots at a certain moment to the disk (create a sub-process to do this)

  • AOF: Every write operation is persisted to the disk (the main thread writes to the memory, and according to the policy, you can configure whether the main thread or the sub-thread performs data persistence)

In addition to the above mentioned differences, they also have the following characteristics:

  1. RDB uses binary + data compression to write to disk, so that the file size is small and the data recovery speed is also fast.

  2. AOF records every write command and has the most complete data, but the file size is large and the data recovery speed is slow.

If you were asked to choose a persistence solution, you could choose like this:

  1. If your business is not sensitive to data loss, use RDB solution to persist data

  2. If your business has high data integrity requirements, use the AOF solution to persist data.

Assuming that your business has relatively high requirements for Redis data integrity and chooses the AOF solution, you will encounter these problems at this time:

  1. AOF records every write operation. As time goes by, the AOF file size will become larger and larger.

  2. With such a large AOF file, data recovery becomes very slow.

What to do? Data integrity requirements have become higher and data recovery has become more difficult? Is there any way to reduce the file size? What about improving recovery speed?

Let's continue to analyze the characteristics of AOF.

Since the AOF file records every write operation, but the same key may be modified multiple times, we only retain the last modified value. Is that okay?

Yes, this is the "AOF rewrite" we often hear. You can also understand it as AOF "slimming".

We can rewrite the AOF file regularly to avoid continuous expansion of the file size, so that the recovery time can be shortened during recovery.

Thinking about it further, is there any way to continue reducing the AOF file?

Let’s review what we mentioned earlier, the respective characteristics of RDB and AOF:

  1. RDB is stored in binary + data compression mode, and the file size is small

  2. AOF records every write command, with the most complete data

Can we leverage the strengths of each?

Of course, this is the "hybrid persistence" of Redis.

Specifically, when AOF rewrites, Redis first writes a data snapshot in the AOF file in RDB format, and then appends each write command generated during this period to the AOF file. Because RDB is written with binary compression, the AOF file size becomes smaller.

At this time, when you use AOF files to recover data, the recovery time will be shorter!

Only Redis 4.0 and above support hybrid persistence.

With such optimization, your Redis no longer has to worry about instance downtime. When a downtime occurs, you can use persistent files to quickly restore the data in Redis.

But is this okay?

Think about it carefully, although we have optimized the persistent files to the minimum, it still takes time to restore the data. During this period, your business applications will still be affected. What should you do?

Let's analyze whether there is a better solution.

If an instance goes down, it can only be solved by restoring data. Can we deploy multiple Redis instances and keep the instance data synchronized in real time? So when an instance goes down, we can choose one of the remaining instances to continue. Just provide services.

Yes, this solution is the "master-slave replication: multiple copies" that I will talk about next.

Master-slave replication: multiple copies

At this point, you can deploy multiple Redis instances, and the architectural model becomes like this:

We call the node that reads and writes real-time here the master, and the other node that synchronizes data in real-time is called the slave.

The advantages of using a multi-copy solution are:

  1. Shorten the unavailability time: If the master goes down, we can manually promote the slave to the master to continue providing services.

  2. Improve read performance: Let the slave share part of the read requests to improve the overall performance of the application

This solution is good. It not only saves data recovery time, but also improves performance. So are there any problems with it?

You can think about it.

In fact, the problem is: when the master goes down, we need to "manually" promote the slave to the master, and this process also takes time.

Although it is much faster than recovering data, it still requires manual intervention. Once manual intervention is required, human reaction time and operation time must be taken into account. Therefore, your business applications will still be affected during this period.

how to solve this problem? Can we automate this switching process?

For this situation, we need an "automatic failover" mechanism, which is the capability of the "Sentinel" we often hear about.

Sentinel: failover

Now, we can introduce an "observer" and let this observer monitor the health status of the master in real time. This observer is the "sentinel".

How to do it?

  1. Every once in a while, the sentinel asks the master whether it is normal.

  2. The master responds normally, indicating that the status is normal, and the reply timeout indicates an abnormality.

  3. Sentinel detects an abnormality and initiates master-slave switching

With this solution, there is no need for humans to intervene and everything becomes automated. Isn’t it great?

But there is another problem here. If the master status is normal, but there is a problem with the network between them when the sentry is asking the master, then the sentry may misjudge.

How to solve this problem?

The answer is that we can deploy multiple sentinels and distribute them on different machines. They monitor the status of the master together. The process becomes like this:

  1. Multiple sentinels ask the master if it is normal every once in a while.

  2. The master responds normally, indicating that the status is normal, and the reply timeout indicates an abnormality.

  3. Once a sentinel determines that the master is abnormal (whether it is a network problem or not), it will ask other sentinels. If multiple sentinels (set a threshold) all think that the master is abnormal, then it will be determined that the master has indeed failed.

  4. After multiple sentinels negotiate, it is determined that the master is faulty, and a master-slave switch is initiated.

Therefore, we use multiple sentinels to negotiate with each other to determine the status of the master. This can greatly reduce the probability of misjudgment.

After the sentinel negotiation determines that the master is abnormal, there is another question: which sentry will initiate the master-slave switch?

The answer is to select a sentinel "leader", and this leader will perform master-slave switching.

The question comes again, how to choose this leader?

Imagine how elections are done in real life?

Yes, vote.

When electing the sentinel leader, we can formulate such an election rule:

  1. Each sentinel asks the other sentinels to vote for them.

  2. Each sentinel only votes for the first sentinel that requests a vote, and can only vote once

  3. The sentinel that gets more than half of the votes first is elected as the leader and initiates master-slave switching.

In fact, this election process is what we often hear: the "consensus algorithm" in the field of distributed systems.

What is a consensus algorithm?

We deploy sentinels on multiple machines, and they need to work together to complete a task, so they form a "distributed system."

In the field of distributed systems, the algorithm for how multiple nodes reach consensus on an issue is called a consensus algorithm.

In this scenario, multiple sentinels negotiate together to elect a leader that is recognized by all, which is accomplished using a consensus algorithm.

This algorithm also stipulates that the number of nodes must be an odd number. This ensures that even if a node fails in the system, more than "half" of the remaining nodes are in normal status and can still provide correct results. In other words, this algorithm is also compatible with There is a faulty node.

There are many consensus algorithms in the field of distributed systems, such as Paxos and Raft. In the scenario of Sentinel election leader, the Raft consensus algorithm is used because it is simple enough and easy to implement.

Now, we use multiple sentinels to jointly monitor the status of Redis. In this way, we can avoid the problem of misjudgment. The architectural model becomes like this:

Okay, let’s summarize it here.

Your Redis has been optimized from the simplest stand-alone version to data persistence, master-slave multiple copies, and sentinel clusters. Your Redis is getting better and better in terms of performance and stability. Even if a node fails, it will still be restored. Don't worry anymore.

If your Redis is deployed in this architecture mode, it can basically run stably for a long time.

...

As time goes by, your business volume begins to experience explosive growth. At this time, can your architectural model still be able to bear such a large amount of traffic?

Let’s analyze it together:

  1. Stability: If Redis fails, we have Sentinel+ replicas, which can automatically complete master-slave switching.

  2. Performance: As the number of read requests increases, we can deploy multiple slaves to separate reading and writing to share the reading pressure.

  3. Performance: The volume of write requests increases, but we only have one master instance. What should we do if this instance reaches a bottleneck?

See, when your write request volume becomes larger and larger, a master instance may not be able to bear such a large write traffic.

To perfectly solve this problem, you need to consider using "sharded clusters" at this time.

Sharded clusters: scale out

What is a "sharded cluster"?

To put it simply, if one instance cannot withstand the writing pressure, can we deploy multiple instances, organize these instances according to certain rules, treat them as a whole, and provide services to the outside world? In this way, we can solve the problem of writing only one instance. bottleneck problem?

Therefore, the current architectural model becomes like this:

Now the question comes again, how to organize so many instances?

Our rules are as follows:

  1. Each node stores a portion of data, and the sum of all node data is the full amount of data.

  2. Develop a routing rule to route different keys to a fixed instance for reading and writing.

Sharded clusters can also be divided into two categories based on the location of routing rules:

  1. Client Sharding

  2. Server-side sharding

Client-side sharding means that the routing rules for keys are done on the client side, as follows:

The disadvantage of this solution is that the client needs to maintain the routing rules, that is to say, you need to write the routing rules into your business code.

How to avoid coupling routing rules in business code?

You can optimize this way and encapsulate this routing rule into a module. When you need to use it, just integrate this module.

This is the solution adopted by Redis Cluster.

Redis Cluster has built-in sentinel logic, eliminating the need to deploy sentinels.

When you use Redis Cluster, your business application needs to use the supporting Redis SDK. The routing rules are integrated in this SDK, and you do not need to write it yourself.

Let’s look at server-side sharding.

This solution means that the routing rules are not placed on the client, but an "intermediate proxy layer" is added between the client and the server. This proxy is the Proxy we often hear.

The data routing rules are maintained in this Proxy layer.

In this way, you don't need to care about how many Redis nodes there are on the server, you only need to interact with this Proxy.

Proxy will forward your request to the corresponding Redis node according to routing rules. Moreover, when the cluster instance is not enough to support larger traffic requests, it can also expand horizontally and add new Redis instances to improve performance. All of this is for you. For the client, it is transparent and imperceptible.

The industry's open source Redis sharding cluster solutions, such as Twemproxy and Codis, adopt this solution.

There are many details involved in data expansion in sharded clusters. This content is not the focus of this article, so we will not go into details yet.

At this point, when you use sharded clusters, you can calmly face greater traffic pressure in the future!

Summarize

Okay, let's summarize how we build a stable and high-performance Redis cluster step by step.

First of all, when using the simplest stand-alone version of Redis, we found that when Redis crashed, the data could not be recovered, so we thought of "data persistence" to persist the data in the memory to the disk. , so that data can be quickly restored from the disk after Redis restarts.

When persisting data, we are faced with the problem of how to persist data to disk more efficiently. Later we found that Redis provides two solutions: RDB and AOF, which correspond to data snapshots and real-time command records respectively. When we do not have high requirements for data integrity, we can choose the RDB persistence solution. If you have high data integrity requirements, you can choose the AOF persistence solution.

But we also found that the AOF file size will become larger and larger as time goes by. At this time, the optimization plan we thought of was to use AOF rewrite to slim it down and reduce the file size. Later, we found that we could Combining the respective advantages of RDB and AOF, the "hybrid persistence" method of combining the two is used when rewriting AOF, which further reduces the size of the AOF file.

Later, we found that although the data could be restored through data recovery, it also took time to restore the data, which meant that business applications would still be affected. We further optimized and adopted a "multi-copy" solution to keep multiple instances synchronized in real time. When one instance fails, other instances can be manually promoted to continue providing services.

But there are problems with this. Manually upgrading instances requires manual intervention, and manual intervention also takes time. We began to think of ways to automate this process, so we introduced the "Sentinel" cluster. The Sentinel cluster negotiates with each other. Faulty nodes are discovered and switchover can be completed automatically, thus greatly reducing the impact on business applications.

Finally, we focused on how to support greater write traffic, so we introduced "sharded clusters" to solve this problem, allowing multiple Redis instances to share the write pressure and face greater traffic in the future. We can also add new instances and expand horizontally to further improve the performance of the cluster.

At this point, our Redis cluster can provide long-term stable and high-performance services for our business.

Here I have drawn a mind map to help you better understand the relationship between them and the evolution process.

postscript

After reading this, I think you should have your own opinions on how to build a stable and high-performance Redis cluster.

In fact, the optimization ideas discussed in this article revolve around the core idea of ​​"architectural design":

  • High performance: read and write separation, sharded cluster

  • High availability: data persistence, multiple copies, automatic failover

  • Easy to expand: sharded cluster, horizontal expansion

When we talk about sentinel clusters and sharded clusters, this also involves knowledge related to "distributed systems":

  • Distributed Consensus: Sentinel Leader Election

  • Load balancing: sharded cluster data sharding, data routing

Of course, in addition to Redis, you can use this idea to think and optimize when building any data cluster to see how they do it.

For example, when you are using MySQL, you can think about the differences between MySQL and Redis. How does MySQL achieve high performance and high availability? In fact, the ideas are similar.

We can see distributed systems and data clusters everywhere now. I hope that through this article, you can understand how these software evolved step by step, what problems they encountered during the evolution process, and in order to solve these problems, these software What kind of plan did the designer design and what trade-offs did he make?

Only if you understand the principles and master the ability to analyze and solve problems, then in the future development process, or when learning other excellent software, you can quickly find the "key points" and master it in the shortest time. and can take advantage of them in practical applications.

In fact, this thinking process is also the idea of ​​​​doing "architectural design". When doing software architecture design, the scenario you face is to discover problems, analyze problems, solve problems, evolve and upgrade your architecture step by step, and finally achieve a balance in performance and reliability. Although all kinds of software are emerging in endlessly, the ideas of architectural design will not change. I hope that you will really absorb these ideas, so that you can cope with the ever-changing changes.

Guess you like

Origin blog.csdn.net/asdf12388999/article/details/128834441