A few things about Redis | High concurrency and high availability

If you use redis caching technology, you must consider how to use redis to add multiple machines to ensure that redis is highly concurrent, and how to make Redis ensure that it does not die immediately after it crashes.

Redis high concurrency: master-slave architecture, one master and multiple slaves. Generally speaking, many projects are actually enough. A single master is used to write data, and a single machine is tens of thousands of QPS. Multiple slaves are used to query data. Multiple slave instances can provide 100,000 QPS per second.

While redis has high concurrency, it also needs to accommodate a large amount of data: one master and multiple slaves, each instance contains complete data. For example, redis master has 10G of memory. In fact, you can only accommodate 10g of data. If your cache has a large amount of data, reaching dozens of g, or even hundreds of g, or a few t, then you need a redis cluster, and after using a redis cluster, you can provide hundreds of thousands of data per second Read and write concurrently.

Redis high availability: If you deploy a master-slave architecture, you can actually add a sentinel, which can be achieved. Any instance downtime will automatically switch between master and slave.

The details are described below.

1. How does redis carry read requests over 100,000+ through read-write separation?

1. The relationship between the high concurrency of redis and the high concurrency of the entire system

If Rredis wants to engage in high concurrency, it needs to do a good job in the underlying cache, so that fewer requests go directly to the database, because the high concurrency of the database is more troublesome to implement, and some operations have transaction requirements, etc., so it is very It is difficult to achieve very high concurrency.

Redis concurrency is not enough for the concurrency of the entire system, but redis, as the entire large-scale cache architecture, is a very important part of the architecture that supports high concurrency.

To achieve high concurrency in the system, the cache middleware and cache system must first be able to support high concurrency, and then after a good overall cache architecture design (multi-level cache, hotspot cache), can it truly support high concurrency.

2. Redis cannot support the bottleneck of high concurrency

The bottleneck that Redis cannot support high concurrency is mainly a single machine problem, that is to say, there is only a single redis, no matter how good the machine performance is, there is an upper limit.

3. How to support higher concurrency

Stand-alone redis cannot support too high concurrency. If you want to support higher concurrency, read and write can be separated. For the cache, it generally supports high read concurrency, and the write request is relatively small, so it can be read and written based on the master-slave architecture.

Configure a master (master) machine to write data, configure multiple slaves (slave) to read data, and synchronize the data to the slave after the master receives the data, so that the slave can configure multiple machines, You can increase the overall concurrency.

2. The security significance of redis replication and master persistence to the master-slave architecture

1. Redis replication principle

A number of slave nodes are hung under a master node, and the write operation writes the data to the master node. Then after the master writes, the data is synchronized to all slave nodes through asynchronous operations to ensure that the data of all nodes is Consistent.

2. The core mechanism of redis replication

(1) Redis replicates data to the slave node in an asynchronous manner, but starting from redis 2.8, the slave node will periodically confirm the number of copies that it replicates each time.

(2) One master node can configure multiple salve nodes.

(3) The slave node can also connect to other slave nodes.

(4) When the slave node performs replication, it will not block the normal work of the master node.

(5) When the slave node is doing replication, it will not block its own operations. It will use the old data to provide services; but when the replication is complete, it needs to delete the old data and load the new data. Suspension of service.

(6) The slave node is mainly used for horizontal expansion and separation of read and write. The expanded slave node can improve throughput.

3. The security significance of master persistence to the master-slave architecture

If this master-slave architecture is adopted, then the persistence of the master node must be turned on. It is not recommended to use the slave node as a hot backup of the master node, because if this is the case, if the master goes down, then the data of the master will be lost. After the restart, the data will be empty. If other slave nodes come to replicate data, they will Copy to empty, so that the data of all nodes are lost.

It is necessary to make a variety of cold backups of the backup files to prevent the entire machine from breaking down and the backup rdb data from being lost.

The exciting content of C/C++Linux server development includes: C/C++, Linux, Nginx, ZeroMQ, MySQL, Redis, MongoDB, ZK, streaming media, P2P, Linux kernel, Docker, TCP/IP, coroutine, DPDK and more Advanced knowledge sharing, video acquisition +qun: Jumping

3. Redis master-slave replication principle, resumable transfer, diskless replication, expired key processing

1. Principle of master-slave replication

① When starting a slave node, it will send a PSYNC command to the master node.

②If the slave node is to reconnect to the master node, then the master node will only copy the missing data to the slave; if it is the first time to connect to the master node, then a full resynchronization will be triggered.

③When starting full resynchronization, the master will start a background thread to generate an RDB snapshot file, and at the same time will cache all write commands newly received from the client in the memory.

④The master node sends the generated RDB file to the slave node, and the slave now writes it to the local disk, and then loads it from the disk to the memory. Then the master node will send the write command cached in the memory to the slave node, and the slave node will also synchronize this part of the data.

⑤ If the slave node is disconnected from the master node due to a network failure, it will automatically reconnect.

⑥ If the master finds that there are multiple slave nodes to reconnect, it will only start an rdb save operation to serve all slave nodes with a copy of the data.

2. Resumable transfer of master-slave replication

Starting from redis2.8, it supports resumable upload. If in the process of master-slave replication, the network is suddenly disconnected, you can continue to copy from the place where it was copied last time instead of copying from the beginning.

principle:

The master node will create a backlog in memory, the master and slave will save a replica offset and a master id, and the offset is saved in the backlog. If the network connection between the master and the slave is broken, the slave will let the master continue to replicate from the last replica offset. But if the offset is not found, a full resynchronization operation will be performed.

3. Diskless copy

Diskless replication means that the master directly creates an RDB file in the memory and sends it to the slave without saving the data on its local disk.

Setting method

Configure the repl-diskless-sync and repl-diskless-sync-delay parameters.

repl-diskless-sync: This parameter guarantees diskless copying.

repl-diskless-sync-delay: This parameter means to wait for a certain period of time before starting replication, so that you can wait for multiple slave nodes to reconnect.

4. Expired key processing

The slave does not expire the key, only waiting for the master to expire the key.

If the master expires a key or eliminates a key, the master will simulate a del command to the slave, and the slave will delete the key after receiving it.

Four. In-depth analysis of the complete flow process and principle of redis replication

1. The complete process of copying

①The slave node is started, only the information of the master node is saved, including the host and ip of the master node, but the data replication has not yet started. The host and ip of the master node are configured in slaveOf in the redis.conf file.

②There is a timed task inside the slave node, which checks whether there is a new master node to connect to a replication every second, and if found, establish a socket network connection with the master node.

③The slave node sends a ping command to the master node.

④ If the master sets requirepass, then the slave node must send the master auth password for password verification.

⑤The master node performs full replication for the first time and sends all data to the slave node.

⑥ The master node continues to write down commands and asynchronously replicates them to the slave node.

2. The mechanism of data synchronization

It refers to the situation when the slave connects to the master for the first time, and it performs full replication.

① Both master and slave maintain an offset

The master will continue to accumulate the offset in itself, and the slave will also continue to accumulate the offset in itself. The slave will report its offset to the master every second, and the master will also save the offset of each slave.

This is not to say that the specific is used for full replication, mainly because the master and slave must know the offset of their respective data in order to know the data inconsistency between each other

②backlog

The master node has a backlog in memory, which is 1M by default.

When the master node replicates data to the slave node, it will also synchronize the data in the backlog.

The backlog is mainly used for incremental replication when full replication is interrupted.

③master run id

Redis can view the master run id through the info server.

Purpose: The slave locates the only master according to it.

Why not use host+ip: Because it is unreliable to use host+ip to locate the master. If the master node restarts or the data changes, then the slave should be distinguished according to different master run ids. If the run id is different, you need to do the whole amount at once copy.

If you need to restart redis without changing the run id, you can use the redis-cli debug reload command.

3. Full copy process and mechanism

① The master executes bgsave to generate an RDB file locally.

②The master node sends the RDB snapshot file to the slave node. If the replication time of the RDB file exceeds 60 seconds (repl-timeout), then the slave node will fail to copy the task. You can adjust this parameter appropriately.

③For a machine with a gigabit network card, generally 100M per second is transmitted, and 6G files are likely to exceed 60 seconds.

④ When the master node generates the RDB file, it caches all newly received write commands in the memory. After the slave node saves the RDB file, these write commands are copied to the slave node.

⑤Check the client-output-buffer-limit slave parameters, such as [client-output-buffer-limit slave 256MB 64MB 60], which means that during the copying period, the memory buffer continues to consume more than 64M, or exceeds 256MB at a time, then stop copying. Copy failed.

⑥ After the slave node receives the RDB file, it clears its own data, and then reloads the RDB file into its own memory. In this process, it provides external services based on the old data.

⑦If the slave node has AOF enabled, BRREWRITEAOF will be executed immediately, AOF, RDB generation, RDB copy through the network, slave old data cleaning, slave aof rewrite, it takes time, if the amount of data copied is between 4G~6G , Then it is likely that the full copy time will take up to 1.5 minutes to 2 minutes.

4. Incremental replication process and mechanism

① If the network connection between the master and the slave is broken during the full replication, then the slave reconnects to the master to trigger the supplemental replication.

②The master directly obtains some missing data from its own backlog and sends it to the slave node.

③The master obtains data from the backlog according to the offset in the psync sent by the slave.

5. Heartbeat

Both master and slave send heartbeat messages to each other.

The master sends it every 10 seconds by default, and the slave node sends it every 1 second by default.

6. Asynchronous replication

Each time the master receives the write command, it now writes the data internally, and then asynchronously sends it to the slave node

5. How to achieve 99.99% high availability under redis master-slave architecture?

1. What is 99.99% high availability?

High availability (English: high availability, abbreviated as HA), an IT term, refers to the ability of a system to perform its functions without interruption, and represents the degree of system availability. It is one of the criteria for system design. A high-availability system can run longer than the individual components that make up the system.

High availability is usually achieved by improving the fault tolerance of the system. Defining how a system is considered highly available often requires specific analysis based on the specific circumstances of each case.

The measurement method is based on the comparison of system damage, unusable time, and time from inoperable to operational state, and the total operating time of the system. The calculation formula is:

A (Availability), MTBF (Mean Time Between Failure), MDT (Mean Time to Repair)

Online systems and systems that perform critical tasks usually require their availability to reach 5 nines (99.999%).

2. redis is not available

Redis cannot include the unavailability of a single instance, and the unavailability of a master-slave architecture.

Unavailable situation:

① The master node of the master-slave architecture is down. If the master node is down, the cached data can no longer be written, and the data in the slave cannot expire, which leads to unavailability.

② If it is a single instance, the redis process may die due to other reasons. Or the machine where redis is deployed is broken.

Consequences of unavailability: First, if the cache is unavailable, then the request will go directly to the database. If a large number of influx of requests exceeds the carrying capacity of the database, then the database will hang. At this time, if the cache problem cannot be handled in time, because There are too many requests, and the database will hang up soon after restarting, directly causing the entire system to be unavailable.

3. How to achieve high availability

① Ensure that each redis has a backup.

② Ensure that after the current redis fails, you can quickly switch to the backup redis.

To solve this problem, the following sentinel mechanism is introduced.

6. Explanation of the basic knowledge of redis sentry architecture

1. What is a sentinel?

Sentinal is a very important component in the redis cluster architecture. It mainly has the following functions:

①Cluster monitoring, responsible for monitoring whether the redis master and slave processes are working properly.

②Message notification, if a redis instance is faulty, the sentry is responsible for sending a message as an alarm notification to the administrator.

③Failover, if the master hangs up, it will automatically transfer to the slave.

④Configuration center, if a fault occurs, notify the client to connect to the new master.

2. The core knowledge of the sentinel

① The sentinel itself is distributed and needs to be run as a cluster, and the sentinels work together.

② During failover, it is judged that a master is down, and most of the sentinels need to agree.

③Even if part of the sentry is down, the sentry cluster can still work normally.

④ The sentry needs at least 3 instances to ensure its robustness.

⑤ The sentinel + redis master-slave structure cannot guarantee zero data loss, but only guarantees the high availability of the redis cluster.

⑥ Corresponding to the sentinel + redis master-slave architecture, repeated tests and drills are required before use.

3. Why does the sentinel cluster deploy 2 nodes not work properly?

The sentinel cluster must deploy more than 2 nodes. If the cluster has only 2 sentinel instances deployed, then quorum=1 (the number of sentinels that need to be agreed to perform failover).

As shown in the figure, if master1 is down at this time, as long as one of Sentinel 1 and Sentinel 2 thinks that master1 is down, it can failover. At the same time, Sentinel 1 and Sentinel 2 will elect a sentinel to perform failover.

At the same time, the majority (that is, the number of more than half of the sentinels in all clusters) is required at this time. If there are 2 sentinels, the majority is 2, which means that at least 2 sentries are still running before failover can be performed.

But if the entire master and sentinel 1 are down at the same time, then there is only one sentinel left. At this time, there is no majority to run and perform failover. Although there is still a sentinel in two foreign machines, but 1 cannot be greater than 1. It is impossible to guarantee more than half, so failover will not be performed.

4. Classic 3-node sentry cluster

Configuration: quorum = 2,majority=2

If the machine where M1 is located is down, there are two of the three sentinels left. S2 and S3 can agree that the master is down, and then elect one to perform failover

At the same time, the majority of the three sentries is 2, so if the remaining two sentries are running, failover can be allowed

7. Data loss problem of redis sentinel active/standby switch: asynchronous replication, cluster split brain

1. Two scenarios of data loss

①Data loss caused by asynchronous replication

Because the data replication process from the master to the slave is asynchronous, some data may not have time to be copied to the slave. At this time, the master is down, and this part of the data is lost.

②Data loss caused by cluster split brain

What is split-brain: Split-brain, that is, a machine where a master is suddenly disconnected from the normal network and cannot connect to other slave machines, but in fact the master is still running.

At this time, the sentry may think that the master is down, and then start the election to switch other slaves to master.

At this time, there will be two masters in the cluster, which is the so-called split brain.

Although a slave has been switched to master at this time, it is possible that the client has not had time to switch to the new master, and the data that continues to write to the old master may also be lost.

Therefore, when the old master is restored again, it will be suspended as a slave to the new master, its own data will be cleared, and the data will be copied from the new master again

2. Solve the data loss caused by the split brain of asynchronous replication

To solve this problem, you need to configure two parameters:

min-slaves-to-write 1 和 min-slaves-max-lag :

Indicates that at least one slave is required to copy and synchronize data. The delay cannot exceed 10 seconds.

If once all the slave data synchronization and replication delays exceed 10 seconds, then at this time, the master will accept any request.

①Reduce the data loss of asynchronous replication

With the configuration of min-slaves-max-lag, it can be ensured that once the slave replicates data and the ack delay is too long, it is considered that too much data may be lost after the master is down, and then the write request is rejected. When the master is down, the data loss caused by part of the data not being synchronized to the slave is reduced within the controllable range.

②Reduce data loss due to split brain

If a master has a split brain and loses connection with other slaves, then the above two configurations can ensure that if you cannot continue to send data to the specified number of slaves, and the slave does not send itself an ack message for more than 10 seconds, then it will directly refuse Client's write request.

In this way, the old master after the split brain will not accept the new data from the client, and data loss is avoided.

The above configuration ensures that if you lose the connection with any slave and find that no slave acked you after 10 seconds, then the new write request will be rejected.

Therefore, in a split-brain scenario, up to 10 seconds of data is lost

8. In-depth analysis of multiple core underlying principles of redis sentry (including slave election algorithm)

1.sdown and odown two states

Sdown is a subjective downtime. If a sentinel feels that a master is down, it is subjective downtime.

Odown is an objective downtime. If the sentinels of the quorum number feel that a master is down, then it is an objective downtime.

The condition achieved by sdown is very simple. If a sentinel pings a master and the number of milliseconds specified by is-master-down-after-milliseconds is exceeded, the master is subjectively considered down.

The conditions for the conversion from sdown to odown are simple. If a sentinel receives a specified number of quorum within a specified time and other sentinels also think that the master is down, then it is considered to be down, and the master is objectively considered down.

2. Field discovery mechanism of sentinel cluster

① The mutual discovery of sentries is realized through the pub/sub system of redis. Each sentry sends a message to the __sentinel__:hello channel. At this time, other sentries can consume this message and perceive other sentries. The presence.

②Every two seconds, each sentinel will send a message to the __sentinel__:hello channel corresponding to a certain master+slave monitored by it. The content is its own host, ip and run id, as well as the monitoring configuration of this master. .

③ Each sentry will also monitor the __sentinel__:hello channel corresponding to each master+slave monitored by it, and then sense the existence of other sentries that are also monitoring this master+slave.

④ Each sentry will also exchange the monitoring configuration of the master according to other sentries, and synchronize the monitoring configuration with each other.

3. Self-correction of slave configuration

The sentry will be responsible for automatically correcting some configurations of the slave. For example, if the slave is to become a potential master candidate, the sentry will ensure that the slave is replicating the existing master data; if the slave is connected to a wrong master, such as after a failover, the sentry will Make sure they are connected to the correct master.

4. Election algorithm

If a master is considered to be down, and the major number of sentries allows the master/backup switch, then a certain sentry will perform the master/backup switch. At this time, a slave must be elected first. The election will take into account the following circumstances:

①The length of time the slave disconnects from the master

② Slave priority

③The offset of slave copy data

④The run id of the slave

First of all, if a slave disconnects from the master more than 10 times the down-after-millisecondes, plus the length of time the master is down, then the slave is considered not suitable for election as the master.

That is: disconnection time> (down-after-milliseconds * 10 + milliseconds_since_master_is_in_SDOWN_state).

The remaining slaves are sorted according to the following regulations:

①First, sort according to the priority of the slave, the lower the slave priority, the higher the priority.

② If the priority is the same, then it depends on the replica offset. The more data that slave replicates, the lower the offset, the higher the priority.

③If you want the same thing as above, choose the slave with the smallest run id.

5.quorum and majority

Every time a sentry switches between active and standby, the quorum number of sentries must be considered as down, and then a sentry is elected to switch between active and standby. This sentry needs to be authorized by the major number of sentries before the switch can be officially executed.

If quorum <majority, for example, if there are 5 sentries, the majority is 3 (more than half), and the quorum is set to 2, then 3 sentries are required for authorization to perform the switch.

If quorum >= majority, then the number of sentries in the quorum must be authorized to switch. For example, if there are 5 sentries and the quorum is 5, then all 5 sentries must agree to authorize before switching.

6.configuration epoch

The sentry will monitor a set of redis master+slave and have corresponding monitoring configurations.

The sentinel who performs the switch will get a configuration epoch from the new master (salve->master) to be switched to. This is a version number, and the version number must be unique each time it is switched.

If the switch of the first elected sentry fails, the other sentries will wait for the failover-timeout time, and then take over to continue the switch. At this time, a new configuration epoch will be reacquired as the new version number.

7, configuraiton spread

After the sentinel completes the switch, it will update and generate the latest master configuration locally, and then synchronize it to other sentinels through the pub/sub message mechanism mentioned earlier.

The previous version number here is very important, because all kinds of messages are released and monitored through a channel, so after a sentinel completes a new switch, the new master configuration follows the new version number.

The other sentinels update their master configuration according to the size of the version number.

Guess you like

Origin blog.csdn.net/Linuxhus/article/details/113139049