Consistent HASH algorithm
In a distributed cache system, to evenly distribute data to different machines in the cache server cluster, it is necessary to use the key of the cached data to calculate the hash value, and then divide the hash value by the number of server nodes to calculate modulo The outgoing data needs to fall on that server node. This algorithm is very simple and can also achieve uniform distribution of data, but when adding or reducing data nodes, all cached data will be invalidated.
traditional modulo
For example, 10 pieces of data, 3 nodes, if the modulo method is used, that is
- node a: 0,3,6,9
- node b: 1,4,7
- node c: 2,5,8
When a node is added, the data distribution changes to
- node a:0,4,8
- node b: 1,5,9
- node c: 2,6
- node d: 3,7
Summary: Data 3, 4, 5, 6, 7, 8, and 9 all need to be relocated when adding nodes, and the cost is too high.
Consistent hashing
The most critical difference is that a hash operation is performed on both the node and the data, and then the hash values of the node and the data are compared, and the node that is closest to the node for the data is used as the storage node. This ensures that when nodes increase or decrease, the least amount of data is affected. Or take the example just now, (using the ascii code of a simple string as the hash key):
Ten pieces of data, calculate their respective hash values
- 0:192
- 1:196
- 2:200
- 3:204
- 4:208
- 5:212
- 6:216
- 7:220
- 8:224
- 9:228
There are three nodes, and their respective hash values are calculated
- node a: 203
- node g: 209
- node z: 228
At this time, compare the hash values of the two. If it is greater than 228, it will be assigned to the previous 203, which is equivalent to the entire hash value being a ring. The corresponding mapping result:
- node a: 0,1,2
- node g: 3,4
- node z: 5,6,7,8,9
At this time, by adding node n, the hash value of node n can be calculated:
- node n: 216
At this time, the corresponding data will be migrated:
- node a: 0,1,2
- node g: 3,4
- node n: 5,6
- node z: 7,8,9
At this time, only 5 and 6 need to be migrated
In addition, if only three hash values are calculated at this time, it is easy to be unbalanced when compared with the hash value of the data. Therefore, the concept of virtual nodes is introduced. By adding the ID suffix to the three nodes In other ways, each node calculates n hash values and places them evenly on the hash ring, so that the hash values calculated from the data can be compared.
Using this algorithm for data distribution can greatly reduce the scale of data migration when adding or removing nodes.
virtual node
The server nodes are distributed according to the hash. Sometimes there will be unevenness, which will lead to uneven data distribution. By adding virtual nodes, the total number of server nodes will be greatly increased, so that they will be scattered on the hash ring more evenly.