Related to distributed algorithms, using Redis to solve 100-200 million data caches

Interview question: 100 to 200 million data needs to be cached. How to design a storage case?

Answer: It is 100% impossible to use a single machine. It must be distributed storage. How to implement it with redis?

Generally, there are three solutions in the industry:

Hash remainder partition

Insert image description here
200 million records are 200 million k, v. If we cannot do it on a single machine, we must distribute it to multiple machines. Assume that there are three machines forming a cluster. Each read and write operation by the user is based on the formula: hash(key)%N number of machines. , calculate the hash value and use it to define which node the data is mapped to.
Advantages : Simple, crude, direct and effective. You only need to estimate the data and plan the nodes, such as 3, 8, or 10, to ensure data support for a period of time. Use the Hash algorithm to put a certain part of the requirements on the same server, so that each server processes a fixed part of the requests (and maintains the information of these requests), which plays the role of load balancing + divide and conquer.

Disadvantages : The original planned disadvantages are that it is more troublesome to expand or shrink the capacity. No matter the expansion or contraction, every data change will cause the nodes to change, and the mapping relationship needs to be recalculated. This is not a problem when the number of servers is fixed. If elastic hole melting is required or a fault occurs, the original modulus formula will change: hash (key)/3 will change. At this time, the result of the remainder operation on the address will change greatly, and the server obtained according to the formula will also become uncontrollable.

A certain redis machine is down. Due to the change in the number of machines, the hash remainder will be reshuffled and all data will be reshuffled.

Hash algorithm partitioning

**Consistent Hash Ring:** Consistent hashing must have a hash function and generate hash values ​​according to the algorithm. All possible hash values ​​of this algorithm will form a full set, and this set can become a hash space [0, 2 32-1], this is a linear space, but in the algorithm, we connect it end to end through appropriate logical control [0=2 32], so that it logically forms a circular space.

​ It also follows the modulo method. The node modulo method introduced earlier is to modulo the number of nodes (servers). The consistent hash algorithm takes modulo 2 32. Simply put, the consistent hash algorithm forms the entire hash value space into a virtual ring. For example, assume that the value space of the hash function H is 0-2 32-1 ( That is, the hash value is a 32-bit unsigned integer). The entire hash ring is as shown below: the entire space is organized in a clockwise direction. The point directly above the ring represents 0, and the first point to the right of the 0 point represents 1 , and so on, until 2 32-1, that is, the first point to the left of the 0 point represents 2 32-1, 0 and 2 32-1 are integrated in the direction of the 0 point, we consider this composed of 2 32 points The ring is called hash
Insert image description here
ring node mapping : mapping each ip node in the cluster to a certain position on the ring

Use hash for each server. Specifically, you can choose the server's IP or host name as the keyword hash, so that each machine can determine its position on the hash ring. Add 4 nodes nodeA, B, C, D, and calculate the hash function of the ip address (hash (ip)). After hashing the ip address, the position in the ring space is as follows

Insert image description here
When we need to store a kv key-value pair, we first calculate the hash value of the key, hash(key), use the same function hash to calculate the hash value of this key and determine the position of this data in the ring, from this position along the ring As the hour hand "walks", the first server it encounters is the server it should locate, and the key-value pair is stored on that node.

For example, we have four data objects objectA, objectB, objectC, and objectD. After hash calculation, the positions in the ring space are as follows: According to the consistent hash algorithm, data A will be assigned to nodeA, and data b will be assigned to nodeA. On nodeB, c will be assigned to nodeC, and d will be assigned to nodeD.

Insert image description here

advantage

Fault tolerance : Assuming that nodec is down, you can see that objects A, B, and D will not receive hero coins at this time, and only object c is redirected to nodeD. Generally, in the ideological hash algorithm, if a server is unavailable, the affected data is only the data from this server to the previous server in its ring space (that is, the first server encountered while walking in the clockwise direction) ), the others will not be affected. To put it simply, if c crashes, the affected knowledge data between B and C will be transferred to D for storage.

Insert image description here
Scalability : As the amount of data increases, a node node needs to be added. The position of x is between A and B. The impact will be on the data between A and X. The data from A to Just go up and it will not cause the hash to be fetched and all the data to be reshuffled.

Insert image description here

shortcoming

Data skew problem in hash ring

When the consistent hash algorithm has too few service nodes, it is easy to cause data skew (most of the cached objects are cached on a certain server) due to uneven node distribution. For example, there are two servers in the system:

Insert image description here

Summarize

In order to migrate as little data as possible when the number of nodes changes

Arrange all the storage grounding points on the hash ring connected end to end. After calculating the hash, each storage point will be found to be stored in the adjacent storage point clockwise. When a node joins or exits, it only affects subsequent nodes adjacent to the node clockwise on the hash ring.

Advantages: Adding and deleting only affect clockwise adjacent nodes in the hash ring and have no impact on other nodes.

Disadvantages: The distribution of data is related to the location of the nodes. Because these nodes are not evenly distributed on the hash ring, the data cannot be evenly distributed when stored.

Hash slot partition

1. Why does the data skew problem of consistent hashing algorithm occur? The hash slot is essentially an array, and the data [0,2^14-1] forms the hash slot space.

2. What can be done: To solve the problem of uniform distribution, another layer is added between data and nodes. This layer is called hash slot (slot), which is used to manage the relationship between data and nodes. It is now quite Slots are placed on the nodes, and data is placed in the slots. Slots solve the problem of granularity, which is equivalent to making the granularity larger, which facilitates data movement. Hash solves the problem of mapping. The hash value of the key is used to calculate the slot, which facilitates data distribution.
Insert image description here

3. How many hash slots?

A cluster can only have 16384 slots, numbered 0-16383 (0-2^14-1). These slots will be allocated to all master nodes in the cluster, and there is no requirement for the allocation strategy. You can specify which numbered slots are assigned to which master node. The cluster will record the corresponding relationship between nodes and slots. After solving the relationship between nodes and slots, you need to find the hash value of the pair, and then take the remainder of 16384. The remainder will fall into the corresponding slot after a few keys. slot=CRC16(key)%16384. Move data in units of slots. Because the number of slots is fixed, it is easier to process, so the data movement problem is solved.

There are 16384 hash slots built into the redis cluster. Redis will map the hash slots to different nodes roughly equally according to the number of nodes. When a key-value needs to be placed in the redis cluster, Redis first uses the crc16 algorithm to calculate the key. A result, and then calculate the remainder of the result to 16384, so that each key will correspond to a hash slot numbered between 0-16484, which is mapped to a certain node. In the following code, key A and B are on node2, and key C falls on node3.

Insert image description here

Guess you like

Origin blog.csdn.net/weixin_49750432/article/details/133275166