Redis - slice cluster

Table of contents

1. Slice cluster

1. What is a slice cluster?

2. How to save more data?

2.1 Horizontal expansion and vertical expansion

2.2 Advantages and disadvantages of horizontal expansion and vertical expansion

3. Slicing clusters face two major problems:

3.1 Horizontal expansion: the corresponding distribution relationship between data slices and instances

3.2 How does the client locate the data?

3. Summary of slicing clusters

Two: Slicing cluster solution: Codis \ Redis Cluster

1. The overall structure and basic process of Codis

1.1 How Codis handles requests

2. Key technical principles of Codis

2.1 How data is distributed in the cluster

2.2 Example

2.3 The difference between Codis and Redis Cluster data distribution

3. Cluster expansion and data migration

3.1 Add codis server

3.2 Add codis proxy

4: Whether the client can directly interact with the cluster

5. How to ensure cluster reliability?

5.1 Codis server guarantee reliability method

5.2 Codis proxy and Zookeeper reliability

5.3 codis dashboard and codis fe reliability

6. Suggestions for Slicing Cluster Scheme Selection

6.1 Difference between Codis and Redis Cluster

6.2 Two schemes in practical application

7. Summary of Codis and Redis Cluster

3. Communication overhead: the key factor limiting the scale of Redis Cluster

1: Why should the cluster size be limited?

2: Instance communication method and impact on cluster size

2.1 Gossip protocol

3. Impact of communication

3.1 Gossip message size

3.2 Inter-instance communication frequency

4. How to reduce the communication overhead between instances?

4.1 Reduce the size of the message transmitted by the instance

4.2 Reduce the frequency of sending messages between instances:

5. Summary of communication overhead:


1. Slice cluster

In Redis, the data has increased, should we add memory or instances?

Using a cloud host to run a Redis instance, how to choose the memory capacity of the cloud host?

Use Redis to save 50 million key-value pairs, each key-value pair is about 512B

Option 1: Large-memory cloud host: Choose a 32GB memory cloud host to deploy Redis. Because 32GB of memory can save all data, and there is 7GB left to ensure the normal operation of the system. At the same time, I also use RDB to persist the data to ensure that the data can be recovered from the RDB after the Redis instance fails.

Result: The response of Redis is sometimes very slow. Use the INFO command to view the latest_fork_usec indicator value of Redis (indicating the time-consuming of the latest fork), and the result shows that the indicator value is very high, almost to the second level. This has something to do with Redis's persistence mechanism. When using RDB for persistence, Redis will fork sub-processes to complete. The time spent on fork operations is positively related to the amount of data in Redis, and fork will block the main thread during execution. The larger the amount of data, the longer the main thread will be blocked by the fork operation. Therefore, when using RDB to persist 25GB of data, the amount of data is large, and the child process running in the background blocks the main thread when the fork is created, which causes the Redis response to slow down.

Solution 2: Redis slice cluster. Although it is cumbersome to set up a slice cluster, it can save a large amount of data and has less impact on the blocking of the Redis main thread.

If the 25GB data is divided into 5 evenly (of course, it can be divided evenly), and 5 instances are used to save, each instance only needs to save 5GB of data. As shown below:

Then, in a slicing cluster, when an instance generates an RDB for 5GB of data, the amount of data is much smaller, and the fork child process generally does not block the main thread for a long time. After using multiple instances to save data slices, we can not only save 25GB of data, but also avoid the sudden slowdown of response caused by the fork child process blocking the main thread.

1. What is a slice cluster?

Slicing cluster, also called sharding cluster, refers to starting multiple Redis instances to form a cluster, and then dividing the received data into multiple shares according to certain rules, each of which is saved by one instance.

2. How to save more data?

2.1 Horizontal expansion and vertical expansion

In the above case, the method of large-memory cloud chip machine and slice cluster is used. These two methods correspond to two solutions for the increase in the amount of data that Redis can handle: vertical expansion (scale up) and horizontal expansion (scale out).

  • Vertical expansion : upgrade the resource configuration of a single Redis instance, including increasing memory capacity, increasing disk capacity, and using higher-configured CPUs. As shown in the figure below, the original instance memory is 8GB, and the hard disk is 50GB. After vertical expansion, the memory is increased to 24GB, and the disk is increased to 150GB.
  • Horizontal expansion : horizontally increase the number of current Redis instances, as shown in the figure below, where one instance with 8GB memory and 50GB disk is used, and now three instances with the same configuration are used.

2.2 Advantages and disadvantages of horizontal expansion and vertical expansion

2.2.1 Vertical expansion

Pros: Simple and straightforward to implement .

potential problems:

The first problem is that when using RDB to persist data, if the amount of data increases, the required memory will also increase, and the main thread may block when forking the child process (such as the situation in the example just now) . However, if you do not require persistence and saving Redis data, then vertical expansion will be a good choice.

The second problem: Scaling up will be limited by hardware and cost . This is easy to understand. After all, it is easy to expand the memory from 32GB to 64GB. However, if you want to expand to 1TB, you will face limitations in hardware capacity and cost.

2.2.2 Horizontal expansion

Horizontal expansion is a more scalable solution. If you want to save more data, if you adopt this solution, you only need to increase the number of Redis instances, and you don't have to worry about the hardware and cost limitations of a single instance. When facing millions or tens of millions of users, the scale-out Redis slice cluster will be a very good choice .

3. Slicing clusters face two major problems:

After the data is sliced, how is it distributed among multiple instances?

How does the client determine which instance the data it wants to access resides on?

3.1 Horizontal expansion: the corresponding distribution relationship between data slices and instances

In a sliced ​​cluster, data needs to be distributed on different instances. How should the data correspond to the instances?

Redis Cluster scheme

Before Redis 3.0, the official did not provide a specific solution for sliced ​​clusters. Starting from 3.0, an official solution called Redis Cluster is provided to implement slice clusters. The corresponding rules for data and instances are stipulated in the Redis Cluster scheme.

3.1.1 What is Redis Cluster?

The Redis Cluster solution uses hash slots (Hash Slot) to handle the mapping relationship between data and instances. In the Redis Cluster scheme, a slicing cluster has a total of 16384 hash slots. These hash slots are similar to data partitions, and each key-value pair is mapped to a hash slot according to its key.

The specific mapping process is divided into two steps: first, according to the key of the key-value pair, calculate a 16-bit value according to the CRC16 algorithm; then, use this 16-bit value to take the modulus of 16384 to obtain the modulus in the range of 0~16383, Each modulus represents a correspondingly numbered hash slot.

3.1.2 How are hash slots mapped to specific Redis instances?

When deploying the Redis Cluster solution, you can use the cluster create command to create a cluster. At this time, Redis will automatically distribute these slots evenly on the cluster instances . For example, if there are N instances in the cluster, then the number of slots on each instance is 16384/N.

You can also use the cluster meet command to manually establish a connection between instances to form a cluster, and then use the cluster addslots command to specify the number of hash slots on each instance

For example, assuming that the memory sizes of different Redis instances in the cluster are configured differently, if the hash slots are evenly divided among the instances, when saving the same number of key-value pairs, compared with instances with large memory, those with small memory The instance will have greater capacity pressure. In this case, you can use the cluster addslots command to manually allocate hash slots according to the resource configuration of different instances .

The slice cluster in the diagram has 3 instances in total, and assuming that there are 5 hash slots, we can first manually allocate hash slots through the following command: instance 1 saves hash slots 0 and 1, and instance 2 saves hash slot 2 and 3, instance 3 holds hash slot 4.

redis-cli -h 172.16.19.3 –p 6379 cluster addslots 0,1
redis-cli -h 172.16.19.4 –p 6379 cluster addslots 2,3
redis-cli -h 172.16.19.5 –p 6379 cluster addslots 4

During the running of the cluster, after calculating the CRC16 value of key1 and key2, take the modulus of the total number of hash slots 5, and then map to the corresponding instance 1 and instance 3 according to the respective modulus results.

Note: When manually assigning hash slots, all 16384 slots need to be allocated, otherwise the Redis cluster cannot work normally .

Slicing clusters realize the allocation of data to hash slots, hash slots, and instances

Even if the instance has the mapping information of the hash slot, how does the client know which instance the data to be accessed is on?

3.2 How does the client locate the data?

When locating key-value pair data, the hash slots it is in can be obtained by calculation, and this calculation can be performed when the client sends a request. However, to further locate the instance, it is also necessary to know which instance the hash slot is distributed on.

Generally speaking, after the client establishes a connection with the cluster instance, the instance will send the hash slot allocation information to the client. However, when the cluster is just created, each instance only knows which hash slots it has been assigned, but does not know the hash slot information owned by other instances.

Why can the client obtain all hash slot information when accessing any instance?

The Redis instance will send its own hash slot information to other instances connected to it to complete the diffusion of hash slot allocation information. When the instances are connected to each other, each instance has a mapping relationship of all hash slots. After the client receives the hash slot information, it will cache the hash slot information locally. When a client requests a key-value pair, it first calculates the hash slot corresponding to the key, and then sends a request to the corresponding instance.

In a cluster, the correspondence between instances and hash slots is not static. There are two most common changes:

  • In the cluster, when instances are added or deleted, Redis needs to reassign hash slots;
  • For load balancing, Redis needs to redistribute hash slots across all instances.

Instances can also pass messages to each other to obtain the latest hash slot allocation information

The client cannot actively perceive these changes. This will lead to an inconsistency between the allocation information it caches and the latest allocation information, so what should I do?

The Redis Cluster scheme provides a redirection mechanism, which means that when the client sends data read and write operations to an instance, there is no corresponding data on the instance, and the client needs to send an operation command to a new instance.

3.2.1 How to use the MOVED redirection command

How does the client know the access address of the new instance during redirection? If the client requests a hash slot that does not contain a key, how will the cluster respond?

When the client sends a key-value pair operation request to an instance, and there is no key-value pair mapping hash slot on this instance, the instance will return the MOVED command response result to the client, and the result contains the new instance address.

get hello:key
(error)MOVED 13320 172.16.19.5:6379

The MOVED command indicates that the hash slot 13320 where the key-value pair requested by the client is located is actually on the instance 172.16.19.5. By returning the MOVED command, it is equivalent to telling the client the information of the new instance where the hash slot is located. In this way, the client can directly connect to 172.16.19.5 and send operation requests.

Due to load balancing, the data in Slot 2 has been migrated from instance 2 to instance 3, but the client cache still records the information that "Slot 2 is in instance 2", so the command will be sent to instance 2. Instance 2 returns a MOVED command to the client, and returns the latest location of Slot 2 (that is, on instance 3) to the client, and the client will send a request to instance 3 again, and at the same time update the local cache to set the Slot 2 The corresponding relationship with the instance is updated

3.2.2 How to use the ASK command

There may be such a situation: some key-value pairs belonging to the migrated slot are stored in the source node, while other key-value pairs are stored in the target node.

For example: the client sends a request to instance 2, but at this time, only part of the data in Slot 2 has been migrated to instance 3, and some data has not been migrated. In the case where this migration is partially completed, the client will receive an ASK error message, as follows:

Get hello:key
(error)ASK 13320 172.16.19.5:6379

The ASK command in this result indicates that the hash slot 13320 where the key-value pair requested by the client is located is on the instance 172.16.19.5, but this hash slot is being migrated. At this point, the client needs to send an ASKING command to the instance 172.16.19.5 first. The meaning of this command is to let this instance allow the execution of the command sent by the client next. Then, the client sends a GET command to this instance to read data.

The ASK command has two meanings: first, it indicates that the Slot data is still being migrated; second, the ASK command returns the latest instance address of the data requested by the client to the client. At this time, the client needs to send the ASKING command to instance 3 , and then send the operation command

3.2.3 Difference between MOVED command and ASK command

  • The MOVED command will update the hash slot allocation information of the client cache, and the ASK will not update the client cache. If the client requests data in Slot 2 again, it will still send a request to instance 2.
  • The function of the ASK command is only to allow the client to send a request to the new instance, while the MOVED command modifies the local cache so that subsequent commands can be sent to the new instance.

3. Summary of slicing clusters

This article mainly describes the advantages of sliced ​​clusters in storing large amounts of data, as well as the hash slot-based data distribution mechanism and the method for client-side locating key-value pairs.

  • When dealing with a large amount of data and data expansion, although the vertical expansion method of increasing memory is simple and straightforward, it will cause excessive memory and slow down performance. Colleagues are also limited by hardware and cost.
  • The Redis slice cluster provides a horizontal expansion mode, that is, using multiple instances and assigning a certain hash slot to each instance. The data can be mapped to the hash slot through the hash value of the key, and distributed through the hash slot. on a different instance. The scalability is good, and a large amount of data can be stored by adding instances.
  • The cluster is the increase or decrease of instances and the redistribution of data to achieve load balancing, resulting in changes in the mapping relationship between hash slots and instances. When a client requests, it will receive an error message for command execution. MOVED and ASK commands, let the client get the latest information.
  • Before Redis 3.0, Redis did not officially provide a slicing cluster solution. However, at that time, the industry already had some slicing cluster solutions, such as ShardedJedis based on client partition, Codis and Twemproxy based on proxy, etc. The application of these schemes is earlier than the Redis Cluster scheme

The Redis Cluster scheme assigns key-value pairs to different instances through hash slots. This process requires CRC calculation on the keys of the key-value pairs, and then mapping with hash slots. Is there any benefit to this? If you use a table to directly record the correspondence between the key-value pair and the instance (for example, key-value pair 1 is on instance 2, and key-value pair 2 is on instance 1), then there is no need to calculate the correspondence between the key and the hash slot , just look up the table, why doesn't Redis do this?

1. The number of keys stored in the entire cluster is unpredictable. When the number of keys is very large, directly record the instance mapping relationship corresponding to each key. This mapping table will be very large. Whether this mapping table is stored on the server or the client Both take up a lot of memory space.

2. Redis Cluster adopts a decentralized mode (no proxy, the client is directly connected to the server). The client accesses a key at a certain node. If the key is not on this node, this node needs to correct the client's route to The ability of the correct node (MOVED response), which requires the exchange of routing tables between nodes, each node has a complete routing relationship of the entire cluster. If all the storage is the correspondence between keys and instances, the information exchanged between nodes will become very large, consuming too many network resources, and even if the exchange is completed, each node needs to store additional routing tables of other nodes , The memory usage is too large, resulting in a waste of resources.

3. When the cluster is expanding, shrinking, and data balancing, data migration will occur between nodes. During the migration, the mapping relationship of each key needs to be modified, and the maintenance cost is high.

4. Adding a layer of hash slots in the middle can decouple the data from the nodes. The key is calculated by Hash. You only need to care about which hash slot is mapped to, and then find the node through the mapping table of the hash slot and the node. Because it consumes very little CPU resources, it not only makes the data distribution more even, but also makes the mapping table smaller, which is convenient for the client and server to save, and the exchange of information between nodes becomes lightweight.

5. When the cluster is expanding, shrinking, and data balancing, the operations between nodes, such as data migration, are all performed with the hash slot as the basic unit, which simplifies the difficulty of node expansion and shrinkage, and facilitates the maintenance and management of the cluster .

Request routing, data migration

Redis uses the cluster solution to solve the performance bottleneck problem caused by the large amount of data and large write volume of a single node. Multiple nodes form a cluster, which can improve the performance and reliability of the cluster, but it is followed by cluster management issues. There are two core issues: request routing and data migration (scaling/shrinking/data balancing).

1. Request routing: Generally, the mapping table of hash slots is used to find the specified node, and then operate on this node.

Redis Cluster records the complete mapping relationship at each node (to facilitate correction of the client's wrong routing request), and also sends it to the client to let the client cache a copy, so that the client can directly find the specified node, and the client and the server cooperate to complete the data Routing, which requires that when the business uses Redis Cluster, it must be upgraded to the cluster version of the SDK to support the protocol interaction between the client and the server.

Other Redis clustering solutions such as Twemproxy and Codis are centralized models (adding a Proxy layer). The client operates the entire cluster through the Proxy. N multiple Redis instances can be linked behind the Proxy. The Proxy layer maintains the routing forwarding logic. Operating Proxy is like operating an ordinary Redis, and the client does not need to change the SDK, but Redis Cluster implements these routing logics in the SDK. Of course, adding a layer of Proxy will also bring a certain performance loss.

2. Data migration: When the cluster nodes are not enough to support business needs, the nodes need to be expanded. Capacity expansion means that the data between nodes needs to be migrated, and whether the migration process will affect the business is also to determine whether a cluster solution is Sophisticated standard.

Twemproxy does not support online expansion. It only solves the problem of request routing. When expanding, it needs to stop for data redistribution. Both Redis Cluster and Codis have achieved online expansion (without affecting the business or having a very small impact on the business). The key point is that during the data migration process, when the client operates on the key being migrated, how does the cluster handle it? Also ensure that the correct result is responded to?

Both Redis Cluster and Codis require the cooperation between the server and the client/Proxy layer. During the migration process, the server needs to allow the client or Proxy to access (redirect) the new node for the key being migrated. This process is to ensure business It is still unaffected when accessing these keys, and correct results can be obtained. Due to the existence of redirection, the access delay during this period will become larger. After the migration is complete, each node of the Redis Cluster will update the route mapping table, and at the same time let the client perceive and update the client cache. Codis will update the routing table at the Proxy layer, and the client will not be aware of the whole process.

In addition to accessing the correct node, abnormal conditions (migration timeout, migration failure) and performance issues (how to make data migration faster, how to deal with bigkey) need to be resolved during the data migration process. There are many details in this process.

The data migration of Redis Cluster is synchronous. Migrating a key will block the source node and the target node at the same time, and there will be performance problems during the migration process. Codis provides a solution to migrate data asynchronously, with faster migration speed and minimal impact on performance. Of course, the implementation solution is also more complicated.

Two: Slicing cluster solution: Codis \ Redis Cluster

Redis Cluster, the official slicing cluster solution provided by Redis. But before the official release of the Redis Cluster solution, Codis, which has been widely used in the industry, deserves attention.

1. The overall structure and basic process of Codis

The Codis cluster contains 4 types of key components.

  • codis server: This is a Redis instance that has undergone secondary development. Additional data structures are added to support data migration operations. It is mainly responsible for processing specific data read and write requests.
  • codis proxy: Receive client requests and forward them to codis server.
  • Zookeeper cluster: Save cluster metadata, such as data location information and codis proxy information.
  • codis dashboard and codis fe: together form a cluster management tool. Among them, codis dashboard is responsible for cluster management, including adding and deleting codis server, codis proxy and data migration. And codis fe is responsible for providing the web operation interface of the dashboard, which is convenient for us to manage the cluster directly on the web interface.

1.1 How Codis handles requests

First of all -- "In order for the cluster to receive and process requests: first use the codis dashboard to set the access addresses of the codis server and codis proxy. After the settings are completed, the codis server and codis proxy will start to receive connections.

Secondly -- "When the client wants to read and write data, the client directly establishes a connection with the codis proxy. The codis proxy itself supports the RESP interaction protocol of Redis. When the client accesses the codis proxy, it is no different from accessing the original Redis instance. Clients originally connected to a single instance can easily establish a connection with the Codis cluster.

Finally --"codis proxy receives the request, it will query the mapping relationship between the request data and the codis server, and forward the request to the corresponding codis server for processing. When the codis server finishes processing the request, it will return the result to the codis proxy, and the proxy will return the data to the client.

2. Key technical principles of Codis

2.1 How data is distributed in the cluster

In a Codis cluster, which codis server a data should be saved on is done through logical slot (Slot) mapping. Specifically, it is divided into two steps.

In the first step, the Codis cluster has a total of 1024 Slots, numbered from 0 to 1023. Manually assign these Slots to codis servers, and each server contains a part of Slots. You can also let the codis dashboard automatically allocate, for example, the dashboard distributes 1024 slots equally among all servers.

In the second step, when the client wants to read and write data, it will use the CRC32 algorithm to calculate the hash value of the data key, and take the hash value modulo 1024. The value after taking the modulus corresponds to the number of the Slot. At this point, according to the corresponding relationship between the Slot and the server assigned in the first step, we can know which server the data is saved on.

2.2 Example

The figure below shows the mapping and saving relationship of data, slot and codis server. Among them, Slot 0 and 1 are assigned to server1, Slot 2 is assigned to server2, Slot 1022 and 1023 are assigned to server8. When the client accesses key 1 and key 2, the CRC32 values ​​of these two data are 1 and 1022 after modulo 1024. Therefore, they will be saved on Slot 1 and Slot 1022, and Slot 1 and Slot 1022 have been allocated to codis server 1 and 8. The storage location of key 1 and key 2 is very clear.

The mapping relationship between the data key and the Slot is directly calculated by the client through CRC32 before reading and writing data, and the mapping relationship between the Slot and the codis server is completed through allocation, so a storage system needs to be used to save it. Otherwise, if the cluster If there is a fault, the mapping relationship will be lost.

The mapping relationship between Slot and codis server is called data routing table (referred to as routing table). After we allocate the routing table on the codis dashboard, the dashboard will send the routing table to the codis proxy, and at the same time, the dashboard will also save the routing table in Zookeeper. codis-proxy will cache the routing table locally. When it receives a client request, it can directly query the local routing table to complete the correct request forwarding.

In terms of the implementation method of data distribution, Codis and Redis Cluster are very similar, and both adopt the mechanism of mapping keys to slots and slots to instances

2.3 The difference between Codis and Redis Cluster data distribution

  1. Routing table in Codis: allocated and modified through codis dashboard, and saved in Zookeeper cluster. Once the data location changes (for example, there is an increase or decrease of instances), the routing table is modified, codis dashbaord will send the modified routing table to codis proxy , and the proxy can forward the request according to the latest routing information.

  1. In Redis Cluster, the data routing table is passed through the communication between each instance , and finally a copy will be saved on each instance. When data routing information changes, it needs to be transmitted through network messages among all instances. Therefore, if the number of instances is large, more cluster network resources will be consumed.

3. Cluster expansion and data migration

Codis cluster expansion includes two aspects: adding codis server and adding codis proxy.

3.1 Add codis server

Two steps:

  1. Start a new codis server and add it to the cluster;
  2. Migrate some data to the new server.

3.1.1 Basic process of data migration

Codis cluster performs data migration according to the granularity of Slot, and data migration is an important mechanism

  1. On the source server, Codis randomly selects a piece of data from the Slot to be migrated and sends it to the destination server.
  2. After the destination server confirms receipt of the data, it will return a confirmation message to the source server. At this time, the source server will locally delete the data that was just migrated.
  3. The first and second steps are the migration process of a single data. Codis will keep repeating this migration process until all the data in the Slot to be migrated is completed.

Codis implements two migration modes, namely synchronous migration and asynchronous migration

3.1.2 Synchronous Migration

Synchronous migration means that during the process of sending data from the source server to the destination server, the source server is blocked and cannot handle new request operations. This mode is easy to implement, but the migration process involves multiple operations (including data serialization on the source server, network transmission, deserialization on the destination server, and deletion on the source server). If the migrated data is a bigkey, The source server will be blocked for a long time and cannot process user requests in time.

3.1.3 Asynchronous Migration

In order to avoid data migration from blocking the source server, the second migration mode implemented by Codis is asynchronous migration

Two key characteristics of asynchronous migration

The first feature is

After the source server sends the data to the destination server, other request operations can be processed without waiting for the execution of the destination server's command. After the destination server receives the data and deserializes it and saves it locally, it sends an ACK message to the source server, indicating that the migration is complete. At this point, the source server locally deletes the data that was just migrated.

During this process, the migrated data will be set as read-only, so the data on the source server will not be modified, and naturally there will be no problem of "inconsistency with the data on the destination server"

The second feature is

For bigkey, asynchronous migration adopts the method of splitting instructions for migration. Specifically, for each element in the bigkey, use one instruction to migrate, instead of serializing the entire bigkey and then transferring it as a whole. This way of breaking it into parts avoids the problem of blocking the source server due to the serialization of a large amount of data during bigkey migration.

When bigkey migrates part of the data, if Codis fails, some elements of bigkey will be on the source server, while the other part will be on the destination server, which destroys the atomicity of the migration. Therefore, Codis will set a temporary expiration time for the bigkey element on the target server . If a failure occurs during the migration, the key on the target server will be deleted after expiration, which will not affect the atomicity of the migration. When the migration is completed normally, the temporary expiration time of the bigkey element will be deleted.

Second feature example:

If you want to migrate a List type data with 10,000 elements, when asynchronous migration is used, the source server will transmit 10,000 RPUSH commands to the destination server, and each command corresponds to the insertion of an element in the List. On the destination server, these 10,000 commands are executed sequentially to complete the data migration.

In order to improve the efficiency of migration, Codis allows multiple keys to be migrated at a time when migrating slots asynchronously. You can set the number of keys for each migration through the parameter numkeys of the asynchronous migration command SLOTSMGRTTAGSLOT-ASYNC

3.2 Add codis proxy

In the Codis cluster, the client is directly connected to the codis proxy. Therefore, when the number of clients increases, a proxy cannot support a large number of request operations, and a proxy needs to be added. It is easier to add a proxy, start the proxy directly, and then add the proxy to the cluster through the codis dashboard.

At this time, the access connection information of codis proxy will be saved on Zookeeper. Therefore, when a proxy is added, there will be the latest access list on Zookeeper, and the client can read the proxy access list from Zookeeper and send the request to the newly added proxy. In this way, the client's access pressure can be shared among multiple proxies, as shown in the figure below

4: Whether the client can directly interact with the cluster

When using a Redis single instance, as long as the client complies with the RESP protocol, it can interact with the instance and read and write data. However, when using a sliced ​​cluster, some functions are different from those of a single instance. For example, the data migration operation in the cluster does not exist on a single instance, and during the migration process, data access requests may be redirected (such as Redis the MOVE command in the Cluster).

The client needs to add support for command operations related to the cluster function. If you originally used a single-instance client and want to expand the capacity to use a cluster, you need to use a new client, which is not particularly friendly to the compatibility of business applications.

The Codis cluster was designed with full consideration for compatibility with existing single-instance clients.

Codis uses codis proxy to directly connect to clients, and codis proxy is compatible with single-instance clients. The cluster-related management work (such as request forwarding, data migration, etc.) is done by components such as codis proxy and codis dashboard, without client participation.

When the business application uses the Codis cluster, there is no need to modify the client. The client can be reused and connected to a single instance. It can not only use the cluster to read and write large-capacity data, but also avoid modifying the client to add complex operation logic, ensuring business Code stability and compatibility.

5. How to ensure cluster reliability?

Reliability is a core requirement for real business applications. For a distributed system, its reliability is related to the number of components in the system: the more components, the more potential risk points . Unlike Redis Cluster, which only contains Redis instances, Codis Cluster contains 4 types of components

Reliability assurance methods for different components of Codis.

5.1 Codis server guarantee reliability method

  1. The codis server is actually a Redis instance, but commands related to cluster operations have been added. Redis's master-slave replication mechanism and sentinel mechanism are both available on the codis server, so Codis uses a master-slave cluster to ensure the reliability of the codis server. To put it simply, Codis configures a slave library for each server and uses the sentinel mechanism for monitoring. When a failure occurs, the master-slave library can be switched to ensure the reliability of the server.
  2. In this configuration, each server becomes a server group, and each group is a server with one master and multiple slaves. The slots used for data distribution are also allocated according to the granularity of the group. At the same time, when codis proxy forwards the request, it also sends the write request to the main library of the corresponding group according to the corresponding relationship between the Slot and the group where the data is located, and sends the read request to the main library or the slave library in the group.

The figure below shows the Codis cluster architecture configured with server group. In the Codis cluster, we implement the master-slave switch of the codis server by deploying server group and sentinel cluster to improve cluster reliability.

5.2 Codis proxy and Zookeeper reliability

  1. When designing a Codis cluster, the source of information on the proxy comes from Zookeeper (such as the routing table). The Zookeeper cluster uses multiple instances to save data. As long as more than half of the Zookeeper instances can work normally, the Zookeeper cluster can provide services and ensure the reliability of these data.

  1. Therefore, codis proxy uses the Zookeeper cluster to save the routing table, and can make full use of Zookeeper's high reliability guarantee to ensure the reliability of codis proxy without any additional work. When the codis proxy fails, just restart the proxy directly. The restarted proxy can obtain the routing table from the Zookeeper cluster through the codis dashboard, and then it can receive client requests and forward them. This design also reduces the development complexity of the Codis cluster itself.

5.3 codis dashboard and codis fe reliability

They mainly provide configuration management and manual operation by administrators, and the load pressure is not high, so their reliability can be guaranteed without additional guarantees.

6. Suggestions for Slicing Cluster Scheme Selection

6.1 Difference between Codis and Redis Cluster

6.2 Two schemes in practical application


  1. From the perspective of stability and maturity, Codis was applied earlier and has mature production deployment in the industry. Although Codis introduces proxy and Zookeeper, which increases the complexity of the cluster, the stateless design of proxy and the stability of Zookeeper itself also provide a guarantee for the stable use of Codis. However, Redis Cluster was launched later than Codis. Relatively speaking, its maturity is weaker than Codis. If you want to choose a mature and stable solution, Codis is more suitable.

  1. From the perspective of business application client compatibility, clients connecting to a single instance can directly connect to codis proxy, while clients originally connecting to a single instance need to develop new functions if they want to connect to Redis Cluster. Therefore, if you use a large number of single-instance clients in your business applications and now want to apply slice clusters, it is recommended that you choose Codis, which can avoid modifying the clients in your business applications.

  1. From the perspective of using new commands and features of Redis, Codis server is developed based on open source Redis 3.2.8, so Codis does not support new commands and data types in subsequent open source versions of Redis. In addition, Codis does not implement all the commands of the open source Redis version, such as BITOP, BLPOP, BRPOP, and MUTLI, EXEC and other commands related to transactions. The list of unsupported commands is listed on the Codis official website, remember to check when you use it. So, if you want to use the new features of the open source Redis version, Redis Cluster is a suitable choice.

  1. From the perspective of data migration performance, Codis can support asynchronous migration. Asynchronous migration has less impact on the performance of cluster processing normal requests than synchronous migration. Therefore, if you have frequent data migration when applying clusters, Codis is a more suitable choice.

7. Summary of Codis and Redis Cluster

The Codis cluster includes four major components: codis server, codis proxy, Zookeeper, codis dashboard and codis fe.

  1. Codis proxy and codis server are responsible for processing data read and write requests. Among them, codis proxy connects with the client, receives the request, and forwards the request to the codis server, and the codis server is responsible for the specific processing of the request.
  2. codis dashboard and codis fe are responsible for cluster management, where codis dashboard performs management operations and codis fe provides a web management interface.
  3. The Zookeeper cluster is responsible for saving all metadata information of the cluster, including routing table, proxy instance information, etc. Here, there is something you need to pay attention to. In addition to using Zookeeper, Codis can also use etcd or the local file system to save metadata information.

Suggestions for using Codis: When you have multiple business lines to use Codis, you can start multiple codis dashboards, each dashboard manages a part of the codis server, and at the same time, use another dashboard to be responsible for the cluster management of a business line, so , you can use a Codis cluster to realize the isolation management of multiple business lines.

Assume that 80% of the key-value pairs stored in the Codis cluster are of Hash type, the number of elements in each Hash collection is 100,000 to 200,000, and the size of each collection element is 2KB. Do you think that migrating such a Hash collection data will affect the performance of Codis?

When Codis migrates data, the design scheme can ensure that the migration performance will not be affected.

1. Asynchronous migration: The source node sends the migrated data to the target node and then returns, and then processes client requests. This stage will not block the source node for a long time. After the target node successfully loads the migrated data, it sends an ACK command to the source node to inform it that the migration is successful.

2. The source node releases the key asynchronously: After the source node receives the ACK from the target node, the operation of deleting the key in the source instance and releasing the key memory will be executed in the background thread without blocking the source instance. (Yes, Codis supported lazy-free earlier than Redis, but it was only used in data migration).

3. Serialized transmission of small objects: Small objects are still migrated in a serialized manner, saving network traffic.

4. Bigkey migration in batches: bigkey is split into commands, packaged and migrated in batches (using the advantages of Pipeline), to improve the migration speed.

5. Migrate multiple keys at one time: Send multiple keys for migration at one time to improve migration efficiency.

6. Migration flow control: The size of the buffer will be controlled during migration to avoid filling up the network bandwidth.

7. Guarantee the atomicity of bigkey migration (compatibility with migration failure): send a DEL command to the target node before migration (retry can ensure idempotence), then split bigkey into commands, and set a temporary expiration time (to prevent migration failure from leaving garbage data on the target node), after successful migration, set the real expiration time on the target node. Codis is better than Redis Cluster in data migration, and Codis also has a very friendly operation and maintenance interface, which is convenient for DBA to perform operations such as adding and deleting nodes, master-slave switching, and data migration.

3. Communication overhead: the key factor limiting the scale of Redis Cluster

1: Why should the cluster size be limited?

The amount of data that Redis Cluster can store and the supported throughput are closely related to the instance size of the cluster. Redis officially gives the upper limit of the scale of Redis Cluster, that is, a cluster runs 1000 instances

A key factor here is that the communication overhead between instances will increase as the instance scale increases. When the cluster exceeds a certain scale (such as 800 nodes), the cluster throughput will decrease instead. Therefore, the actual size of the cluster will be limited .

2: Instance communication method and impact on cluster size

When Redis Cluster is running, each instance will save the corresponding relationship between the Slot and the instance (that is, the Slot mapping table), as well as its own status information.

In order for each instance in the cluster to know the state information of all other instances, the instances will communicate according to certain rules. This rule is the Gossip protocol

2.1 Gossip protocol

The working principle of the Gossip protocol can be summarized into two points. : When the instance is detected online\return a PONG message to the instance that sends the PING command

  1. First, each instance will randomly select some instances from the cluster according to a certain frequency, and send PING messages to the selected instances to detect whether these instances are online and exchange status information with each other. The PING message encapsulates the status information of the instance sending the message itself, the status information of some other instances, and the Slot mapping table.
  2. Second, after an instance receives a PING message, it will send a PONG message to the instance that sent the PING message. PONG messages contain the same content as PING messages.

The Gossip protocol can ensure that after a period of time, each instance in the cluster can obtain the state information of all other instances.

In this way, even if events such as new node joining, node failure, and slot change occur, the cluster status can be synchronized on each instance through the transmission of PING and PONG messages

3. Impact of communication

When using the Gossip protocol to communicate between instances, the communication overhead is affected by the size of the communication message and the communication frequency. The larger and more frequent the messages, the greater the corresponding communication overhead. If you want to achieve efficient communication, you can start from these two aspects to tune

3.1 Gossip message size

The message body of the PING message sent by the Redis instance is composed of the clusterMsgDataGossip structure, which is defined as follows:

typedef struct {
    char nodename[CLUSTER_NAMELEN]; //40 bytes
    uint32_t ping_sent; //4 bytes
    uint32_t pong_received; //4 bytes
    char ip[NET_IP_STR_LEN]; //46 bytes
    uint16_t port; //2 bytes
    uint16_t cport; //2 bytes
    uint16_t flags; //2 bytes
    uint32_t notused1; //4 bytes
} clusterMsgDataGossip;

Among them, the values ​​of CLUSTER_NAMELEN and NET_IP_STR_LEN are 40 and 46 respectively, indicating that the lengths of the two byte arrays of nodename and ip are 40 bytes and 46 bytes, and we add up the size of other information in the structure to get You can get the size of a Gossip message, which is 104 bytes.

When each instance sends a Gossip message, in addition to its own status information, it will also transmit the status information of one tenth of the cluster instances by default.

example

Therefore, for a cluster containing 1000 instances, when each instance sends a PING message, it will contain the status information of 100 instances, the total data volume is 10400 bytes, plus the information sent by the instance itself, A Gossip message is about 10KB. In order to allow the Slot mapping table to be propagated among different instances, the PING message also contains a Bitmap with a length of 16,384 bits. Each bit of the Bitmap corresponds to a Slot. If a certain bit is 1, it means that the Slot belongs to the current instance. After the Bitmap size is converted into bytes, the size of a PING message can be obtained by adding the 2KB instance status information and the slot allocation information, which is about 12KB.

The PONG message has the same content as the PING message, and its size is about 12KB. After each instance sends a PING message, it will also receive a returned PONG message, and the total of the two messages is 24KB.

From an absolute value point of view, 24KB is not very large, but if a single request normally processed by an instance is only a few KB, then the PING/PONG message transmitted by the instance to maintain a consistent cluster state will be larger than a single business request . Also, each instance sends PING/PONG messages to other instances. As the size of the cluster increases, the number of these heartbeat messages will increase, which will occupy a part of the network communication bandwidth of the cluster, thereby reducing the throughput of normal client requests of the cluster service.

3.2 Inter-instance communication frequency

After the instance of Redis Cluster is started, by default, 5 instances will be randomly selected from the local instance list every second, and then an instance that has not communicated for the longest time will be found from these 5 instances, and a PING message will be sent to the instance. This is the basic method for the instance to periodically send PING messages

There is a problem here: the instance selected by the instance that has not communicated for the longest time is selected from 5 randomly selected instances after all, which does not guarantee that this instance must be the instance that has not communicated for the longest time in the entire cluster. It may happen that some instances have not been sent PING messages, resulting in the cluster state they maintain has expired.

In order to avoid this situation, the Redis Cluster instance will scan the local instance list at a frequency of every 100ms. If it is found that the last time an instance receives a PONG message is greater than half of the configuration item cluster-node-timeout (cluster -node-timeout/2), a PING message will be sent to the instance immediately to update the cluster status information on the instance

When the cluster size expands, network communication delays between instances will increase due to network congestion or traffic competition among different servers. If some instances cannot receive PONG messages sent by other instances, it will cause frequent PING messages to be sent between instances, which in turn will bring additional overhead to cluster network communication

Summarize the number of PING messages sent by the order instance per second, as follows:

Number of PING messages sent = 1 + 10 * number of instances (the last time a PONG message was received exceeds cluster-node-timeout/2)

Among them, 1 means that a single instance normally sends a PING message every 1 second, and 10 means that the instance will perform 10 checks every 1 second, and after each check, a message will be sent to the instance whose PONG message times out.

example

Assuming that a single instance detection finds that 10 instances of PONG messages receive timeout every 100 milliseconds, then this instance will send 101 PING messages per second, accounting for about 1.2MB/s of bandwidth. If 30 instances in the cluster send messages at this frequency, it will take up 36MB/s of bandwidth, which will crowd out the bandwidth in the cluster used to serve normal requests

4. How to reduce the communication overhead between instances?

4.1 Reduce the size of the message transmitted by the instance

In order to reduce the communication overhead between instances, in principle, the size of the message transmitted by the instance (PING/PONG message, slot allocation information) can be reduced. However, because cluster instances rely on PING, PONG messages, and slot allocation information to maintain the cluster The unification of the state, once the size of the transmitted message is reduced, will lead to the reduction of communication information between instances, which is not conducive to cluster maintenance. Therefore, this method cannot be used to reduce the size of the message transmitted by the instance.

4.2 Reduce the frequency of sending messages between instances:

There are two frequencies at which messages are sent between instances.

  1. Each instance sends a PING message every 1 second. This frequency is not high. If the frequency is further reduced, the status of each instance in the cluster may not be able to be propagated in time.
  2. Each instance will do a detection every 100 milliseconds, and send PING messages to nodes that receive PONG messages for more than cluster-node-timeout/2. The frequency at which the instance checks every 100 milliseconds is the unified frequency of the default periodic inspection task of the Redis instance, and we generally do not need to modify it.

Only the cluster-node-timeout configuration item can be modified

The configuration item cluster-node-timeout defines the heartbeat timeout time for the cluster instance to be judged to be faulty, and the default is 15 seconds. If the cluster-node-timeout value is relatively small, then in a large-scale cluster, the PONG message reception timeout will occur more frequently, causing the instance to execute "send PING message to the PONG message timeout instance" 10 times per second "This operation

Therefore, in order to avoid excessive heartbeat messages crowding the cluster bandwidth, you can increase the cluster-node-timeout value, for example, to 20 seconds or 25 seconds. In this way, the timeout of receiving PONG messages will be alleviated, and the single instance will not need to frequently send heartbeats 10 times per second.

Do not adjust the cluster-node-timeout too large, otherwise, if the instance really fails, it needs to wait for the cluster-node-timeout time to detect the failure, which will prolong the actual failure recovery time , will affect the normal use of cluster services

In order to verify whether adjusting the value of cluster-node-timeout can reduce the cluster network bandwidth occupied by heartbeat messages, it is suggested that before and after adjusting the value of cluster-node-timeout, you can use the tcpdump command to capture the network packets of heartbeat information sent by the instance.

After executing the following command, you can capture the heartbeat network packet sent by the instance on the 192.168.10.3 machine from port 16379, and save the content of the network packet to the r1.cap file:

tcpdump host 192.168.10.3 port 16379 -i network card name -w /tmp/r1.cap

By analyzing the number and size of network packets, you can judge the bandwidth occupied by heartbeat messages before and after adjusting the value of cluster-node-timeout.

5. Summary of communication overhead:

A mechanism for communication between Redis Cluster instances using the Gossip protocol. When Redis Cluster is running, each instance needs to exchange information through PING and PONG messages. These heartbeat messages contain the status information of the current instance and some other instances, as well as Slot allocation information. This communication mechanism helps all instances in Redis Cluster have complete cluster state information.

However, as the cluster size increases, so does the amount of communication between instances. If you blindly expand the Redis Cluster, you may experience slow cluster performance. This is because large-scale inter-instance heartbeat messages in the cluster will occupy the bandwidth of the cluster for processing normal requests. Moreover, some instances may not be able to receive PONG messages in time due to network congestion. Each instance will periodically (10 times per second) detect whether this happens when it is running. Once it occurs, it will send these PONG messages immediately. Instances that time out send heartbeat messages.

The larger the cluster size, the higher the probability of network congestion. Correspondingly, the higher the probability of PONG message timeout, which will lead to a large number of heartbeat messages in the cluster and affect the normal requests of the cluster service. You can reduce the bandwidth occupied by heartbeat messages by adjusting the cluster-node-timeout configuration item. However, in practical applications, if you do not particularly need a large-capacity cluster, it is recommended to control the size of the Redis Cluster to 400~500 instances.

Assuming that a single instance can support 80,000 request operations per second (80,000 QPS), and each master instance is configured with one slave instance, then 400~500 instances can support 16 million to 20 million QPS (200/250 master instances* 80,000 QPS=16 million/20 million QPS), this throughput performance can meet the needs of many business applications

If we use a method similar to Codis to save slot allocation information, and store the cluster instance status information and slot allocation information on a third-party storage system (such as Zookeeper), will this method have any impact on the cluster size?

Answer: Assuming that we use Zookeeper as a third-party storage system to store cluster instance status information and slot allocation information, then the instances only need to communicate and exchange information with Zookeeper, and there is no need to send a large number of heartbeat messages between instances to synchronize the cluster status. This practice reduces the amount of network traffic used for heartbeating between instances, which helps achieve large-scale clusters.

Also, network bandwidth can be focused on servicing client requests. However, in this case, when an instance obtains or updates cluster status information, it needs to interact with Zookeeper, and Zookeeper's network communication bandwidth requirements will increase. Therefore, when using this method, it is necessary to guarantee a certain network bandwidth for Zookeeper, so as to prevent Zookeeper from being unable to communicate with instances quickly due to bandwidth limitations.

Guess you like

Origin blog.csdn.net/qq_45656077/article/details/129702758