Distributed storage and consistent hash

 

 

 

what is hash

Hash is a hash algorithm, also known as a hash algorithm. The definition of Baidu Baike is

The hash algorithm maps arbitrary-length binary values ​​to shorter fixed-length binary values, and this small binary value is called a hash value. The hash value is a unique and extremely compact numerical representation of a piece of data.

1. This sentence has several important points. First, it is binary of any length. In Java, it can represent all objects (serialization)

2. Fixed length, so that hashmap etc. can perform bit operations according to high and low bits, and at the same time can provide a unified way (a feeling of agreement)

3. The unique value of the data, so that the hashcode can be used as the basis for search (the basis for fast search)

Why hash

Why should I first talk about what would happen if not.

Csdn has such an interesting article about it, we have a bunch of pigs, how to find the corresponding one according to the weight. Without the idea of ​​hashing, we would compare each pig, but would you do the same if there are 1000 pigs? Introducing hash, the weight of each pig is hashed to a hashcode, and the hashcode will be mapped to the corresponding pigsty. We only need to compare the pigs in each pigsty, and the ideal situation is that each pigsty has the same number of pigs ( Note: One for each pigpen is good, but the space consumption is huge)

http://blog.csdn.net/ok7758521ok/article/details/4003476)

In Java, hash is also adopted in this way. The modulo of the hashcode and the number of buckets (of course, the time is through bit operations, which has higher performance) is naturally mapped to specific buckets.

About distributed storage

When hash encounters distributed, the hashmap storage of a single machine can no longer meet our key-value needs, what should we do, we need to distribute the storage content to different physical machines, then we need a way to map keys to different machines Method, we think of hash, we can regard the physical machine as a bucket, adopt the same idea as the hashmap implementation, and naturally map to different machines by taking the modulus with the number of physical machines.

Ok, get it done, distributed can indeed be achieved. But now the question is here. What if one of the machines hangs up or another machine is added? At this time, two things happen:

    1. Without making any changes, the hung data will not be restored, and the newly added machine will not be used

    2. Rehash In this case, the number of buckets will change, all values ​​will be remapped, and finally data will be stored. There are two problems. At the time of rehash, all keys will be remapped. At this time, for large concurrent The situation is catastrophic. All requests will not go through any cache. The server is at risk of crashing. Moreover, the old data is still there and will not be used, which wastes storage space.

So what to do

 

Introduce consistent hash

Consistent hashing is such a hash algorithm. Simply put, when removing/adding a node (machine, ip), it can change the existing key mapping relationship as little as possible, and meet the requirements of monotonicity () as much as possible .

hash loop

Any hash value has a fixed length, so you can carry all the hash values ​​through a loop (why use a loop will say later)

Mapping

The most important step of hashing is to map objects to corresponding buckets. Compared with the usual hashing practices, consistent hashes are more special. Consistent hashes do not directly map keys to buckets, but map keys and buckets separately. The corresponding hash value node to the loopback

 

        Mapping key


 

        Mapping bucket

Next is the most important step, to map the key to the corresponding bucket

Bucket search

Now that the cache and the object have been mapped to the hash value space through the same hash algorithm, the next thing to consider is how to map the object to the cache.

In this ring space, if you start from the key value of the object in a clockwise direction until you encounter a cache, then store the object in the cache, because the hash value of the object and the cache is fixed, so the cache must It is unique and certain. Didn't you find a way to map objects and caches? !

Continuing the above example (see Figure 3), then according to the above method, object object1 will be stored on cache A; object2 and object3 correspond to cache C; object4 corresponds to cache B;

 

benefit

We have talked about so many consistent hash algorithms, so what exactly does it bring, we consider adding and deleting

    Add to

    

We add a new node D. In a clockwise manner, object2 that was originally mapped to C will be mapped to D, and object3 will still be mapped to C. This addition will only affect object2, which is actually between B and D. Object, this impact is very small compared to traditional methods

delete

 

Similar to adding, deleting will only affect the objects between A and B

 

Virtual node (Quoted completely from: http://my.oschina.net/jsan/blog/49702)

Another indicator that considers the Hash algorithm is Balance, which is defined as follows:

Balance

Balance means that the hash result can be distributed to all buffers as much as possible, so that all buffer space can be used.

The hash algorithm does not guarantee absolute balance. If there are fewer caches, objects cannot be evenly mapped to the cache. For example, in the above example, when only cache A and cache C are deployed, among the four objects, Cache A only stores object1, while cache C stores object2, object3, and object4; the distribution is very uneven.

In order to solve this situation, consistent hashing introduces the concept of "virtual node", which can be defined as follows:

A "virtual node" (virtual node) is a replica of an actual node in the hash space. An actual node corresponds to a number of "virtual nodes", and this corresponding number becomes the "number of replications", "virtual nodes" Arrange by hash value in the hash space.

Still taking the case of deploying only cache A and cache C as an example, we have seen in Figure 4 that the cache distribution is not uniform. Now we introduce virtual nodes and set the "number of copies" to 2, which means there will be 4 "virtual nodes" in total, cache A1, cache A2 represent cache A; cache C1, cache C2 represent cache C; Assuming an ideal situation, see Figure 6.

virtual nodes

Figure 6 The mapping relationship after the introduction of "virtual nodes"

 

At this time, the mapping relationship between objects and "virtual nodes" is:

objec1->cache A2 ; objec2->cache A1 ; objec3->cache C1 ; objec4->cache C2 ;

Therefore, the objects object1 and object2 are mapped to cache A, and object3 and object4 are mapped to cache C; the balance has been greatly improved.

After the introduction of "virtual node", the mapping relationship has been transformed from {object -> node} to {object -> virtual node}. The mapping relationship when querying the cache of the object is shown in Figure 7.

map

Figure 7 Query object cache

 

The hash calculation of the “virtual node” can adopt the method of adding a numeric suffix to the IP address of the corresponding node. For example, suppose the IP address of cache A is 202.168.14.241.

Before introducing "virtual nodes", calculate the hash value of cache A:

Hash(“202.168.14.241”);

After introducing the "virtual node", calculate the hash values ​​of the "virtual node" points cache A1 and cache A2:

Hash(“202.168.14.241#1”);  // cache A1

Hash(“202.168.14.241#2”);  // cache A2

Guess you like

Origin blog.csdn.net/zjc801/article/details/81563920