Android interview Hash common algorithm


The Hash part is divided into three parts to explain. Visitors can read the corresponding blog according to the classification:

  1. Detailed explanation of Hash principle of development interview
  2. Develop common algorithms for interview Hash
  3. Development Interview Hash Interview Questions

博客书写不易,您的点赞收藏是我前进的动力,千万别忘记点赞、 收藏 ^ _ ^ !

This part mainly explains several Hash algorithms that are often used in actual development.

1. Consistent Hash Algorithm

The consistent Hash algorithm is a distributed hash (DHT) implementation algorithm proposed by the Massachusetts Institute of Technology in 1997. The design goal is to solve the hot spot problem in the Internet. The original intention is very similar to CARP. Consistent Hash fixes the problems caused by the simple hash algorithm used by CARP, so that distributed hashing (DHT) can be truly applied in the P2P environment.

The principle of Consistent Hashing is simply that when removing/adding a cache, it can change the existing key mapping as little as possible, and meet the monotonic requirements as much as possible.

scenes to be used

For example, if you have N cache servers (hereinafter referred to as cache), how to map an object to N caches? You are likely to use a method similar to the following general H(key)=Key%N to calculate the hash of the object The value is then evenly mapped to N caches.

Using simple residual hash algorithm cannot solve the distributed problem. If a large amount of user data is stored in distributed storage and uses Hash(Key)=object%N, N refers to N cache servers or nodes, this kind of hash algorithm does not satisfy distributed required. We analyze as follows:

  1. If the server numbered a among N cache servers fails, and a needs to be removed from the server group. At this time, the number of cache servers becomes N-1, and all objects are mapped to the cache server. The calculation formula becomes hash(object)%N-1, yes, it affects the mapping relationship between all objects and the cache server.

  2. Similarly, due to increased access, a cache server needs to be added. At this time, the number of cache servers is N+1, and the mapping formula becomes hash(object)%N+1, which means that almost all caches are invalid.

  3. As the hardware capabilities are getting stronger and stronger, you may want to make the nodes added later do more work, but the Hash(Key)=object%N algorithm cannot be used to effectively allocate tasks

For example, cases 1 and 2 mean that almost all caches suddenly fail. This is a disaster for the server, and floods of access will directly rush to the back-end server.

In a distributed cluster, adding and deleting machines, or automatically falling off the cluster after a machine failure is the most basic function of distributed cluster management. If the commonly used hash(object)%N algorithm is used, then after a machine is added or deleted, a lot of original data cannot be found. This seriously violates the principle of monotonicity, so a consistent hash algorithm appears to solve Problems encountered in distributed.

Consistent hash algorithm requirements

In a dynamically changing Cache environment, a good consistent hash algorithm should meet the following aspects:
1. Balance (Balance)
Balance means that the result of the hash can be distributed as much as possible in all the buffers (Cache), so All the buffer space can be used. Many hash algorithms can meet this condition.

2. Monotonicity (Monotonicity)
Monotonicity means that if some content has been allocated to the corresponding buffer by hashing, a new buffer is added to the system. The result of the hash should be able to ensure that the original allocated content can be mapped to the original or new buffer, but not to other buffers in the old buffer set.

Simple hash algorithms often cannot meet the monotonicity requirements, such as the simplest linear hash: x = (ax + b) mod §, in the above formula, P represents the size of the entire buffer. It is not difficult to see that when the buffer size changes (from P1 to P2), all the original hash results will change, which does not meet the monotonicity requirement. The change of the hash result means that when the buffer space changes, all the mapping relationships need to be updated in the system.

In the P2P system, the buffer change is equivalent to Peer joining or exiting the system. This situation occurs frequently in the P2P system, which will bring a huge calculation and transmission load. Monotonicity requires the hash algorithm to be able to deal with this situation.

3. Spread
In a distributed environment, the terminal may not see all the buffers, but only a part of them. When the terminal wants to map the content to the buffer through the hash process, the buffer range seen by different terminals may be different, which leads to inconsistent hash results. The final result is that the same content is mapped to different buffers by different terminals District. This situation should obviously be avoided, because it causes the same content to be stored in different buffers, reducing the efficiency of system storage.

The definition of dispersion is the severity of the occurrence of the above situation. A good hash algorithm should be able to avoid inconsistencies as much as possible, that is, minimize dispersion.

4. Load (Load) The
load problem is actually looking at the dispersion problem from another angle. Since different terminals may map the same content to different buffers, a particular buffer may also be mapped to different contents by different users. Like decentralization, this situation should also be avoided, so a good hash algorithm should be able to minimize the buffer load.

5. Smoothness (Smoothness)
Smoothness refers to the smooth change of the number of cache servers and the smooth change of cache objects.

Hash ring space

According to the commonly used hash algorithm, the corresponding key is hashed into a space with 232 buckets, that is, the number space of 0~(232)-1. Now we can connect these numbers head to tail and imagine a closed loop. The entire space is organized in a clockwise direction, and 0 and 2^32-1 overlap in the zero point. As shown below:
Insert picture description here

1. Now we process the data (object) through a certain hash algorithm and map it to the ring.
Suppose there are four objects object1, object2, object3, and object4 to calculate the corresponding key value through a specific Hash function, and then hash it to the hash for exchange on. As shown in the following figure:
Hash(object1)=key1;
Hash(object2)=key2;
Hash(object3)=key3;
Hash(object4)=key4;
Insert picture description here

2. Mapping the machine to the ring through the hash algorithm. Adding
new machines to the distributed cluster using the consistent hash algorithm is based on the principle of using the same Hash algorithm as the object storage to map the machine to the Hash ring. (Generally, the hash calculation of the machine uses the machine's IP or unique alias as the input value).

In this ring space, if you start from the key value of the object in a clockwise direction until you encounter a cache, then store the object in the cache, because the hash value of the object and the cache is fixed, so the cache must It is unique and certain. In this way, all objects are stored in the machine closest to you.

Assuming that there are three machines NODE1, NODE2, and NODE3, the corresponding KEY value is obtained through the hash algorithm and mapped to the ring. The schematic diagram is as follows:
Hash(NODE1)=KEY1;
Hash(NODE2)=KEY2;
Hash(NODE3)= KEY3; From the
Insert picture description here
above figure, we can see that the object and the machine are in the same hash space. In this way, turning clockwise object1 (object) is stored in NODE1 (machine), object3 (object) is stored in NODE2 (machine), object2 object4 (object) is stored in NODE3 (machine).

In such a deployment environment, the hash ring will not change. Therefore, by calculating the hash value of the object, the corresponding machine can be quickly located, so that the true storage location of the object can be found.

Machine delete and add

The most improper part of the ordinary hash remainder algorithm is that a large number of object storage locations will become invalid after the addition and deletion of machines, which greatly does not satisfy the monotonicity. Let's analyze how the consistent hashing algorithm is processed.

1. Deletion of nodes (machines)
Take the distributed cluster above as an example. If NODE2 fails and is deleted, then according to the clockwise migration method, object3 will be migrated to NODE3, which is only the mapping position of object3. After the change, other objects have not changed, as shown in the figure below:
Insert picture description here


  1. Adding a node (machine) If a new node NODE4 is added to the cluster, KEY4 is obtained through the corresponding Hash algorithm and mapped to the ring, as shown in the following figure:
    Insert picture description here
    By following the clockwise migration rule, then object2 is migrated to NODE4, Other objects still maintain this original storage location. Through the analysis of the addition and deletion of nodes, the consistent hashing algorithm maintains monotonicity while minimizing data migration. Such an algorithm is very suitable for distributed clusters and avoids the migration of large amounts of data. , Reducing the pressure on the server.

Balance analysis

According to the above graphical analysis, the consistent hash algorithm satisfies the monotonicity and load balancing characteristics and the decentralization of the general hash algorithm, but this cannot be regarded as the reason for its widespread application because it lacks balance.

The following will analyze how the consistent hash algorithm satisfies the balance. The hash algorithm does not guarantee balance. For example, if only NODE1 and NODE3 are deployed above (NODE2 is deleted), object1 is stored in NODE1, and object2, object3, and object4 are stored in NODE3, which causes a very An unbalanced state. In the consistent hash algorithm, in order to satisfy the balance as much as possible, it introduces virtual nodes.

Virtual node A
virtual node is a replica of an actual node (machine) in the hash space. An actual node corresponds to a number of "virtual nodes". This corresponding number is also called the "number of replications". "Virtual nodes" are arranged by hash values ​​in the hash space.

In the case where only NODE1 and NODE3 are deployed above (NODE2 is deleted graph) as an example, the distribution of the previous objects on the machine is very uneven, now we take 2 copies (2 copies per node) as an example, so There are 4 virtual nodes in the entire hash ring, and the final object mapping relationship diagram is as follows:
Insert picture description here

According to the above figure, we can see the mapping relationship of objects: object1->NODE1-1, object2->NODE1-2, object3->NODE3-2, object4->NODE3-1. Through the introduction of virtual nodes, the distribution of objects is more balanced .

So in actual operation, how does the real object query work? The conversion of objects from hash to virtual node to actual node is as follows:
Insert picture description here

The hash calculation of the "virtual node" can use the IP address of the corresponding node plus a numeric suffix. For example, assume that the IP address of NODE1 is 192.168.1.100. Before introducing the "virtual node", calculate the hash value of cache A:

Hash("192.168.1.100"); After
introducing "virtual node", calculate the hash value of "virtual node" NODE1-1 and NODE1-2:
Hash("192.168.1.100#1"); // NODE1-1
Hash ("192.168.1.100#2"); // NODE1-2

Virtual nodes correspond to actual nodes, and you can customize how to allocate them.

to sum up

  1. The consistent hash algorithm only needs to relocate a small part of the data in the ring space for the increase or decrease of nodes, which has good fault tolerance and scalability.
  2. While maintaining the monotonicity of the consistent hashing algorithm, data migration is minimized. Such an algorithm is very suitable for distributed clusters, avoiding a large amount of data migration and reducing the pressure on the server.

2. SimHash algorithm

Insert picture description here
When comparing the similarity of web pages or information text, the ideal hash function needs to generate the same or similar hash values ​​for almost the same input content. In other words, the similarity of the hash value must directly reflect the similarity of the input content, so traditional hash methods such as md5 cannot meet our needs.

Simhash, as a kind of locality sensitive hash (locality sensitive hash), is a method used by Google to remove duplicates of massive text. It finally converts a text into a 64-bit byte, which is called a feature word for the time being. Whether the two texts are the same or not can be judged by comparing the Hamming distance between the two.

The main idea is to reduce dimensionality, map high-dimensional feature vectors to low-dimensional feature vectors, and use the Hamming Distance of the two vectors to determine whether the article is repeated or highly similar.

Locality Sensitivity
Assuming that two strings have a certain similarity, after hashing, the similarity can still be maintained, which is called local sensitivity hash.

SimHash algorithm ideas

The simhash algorithm is divided into 5 steps: word segmentation, hashing, weighting, merging, and dimensionality reduction. The specific process is as follows:
1. Word segmentation
A given sentence is segmented to obtain an effective feature vector, and then set 1- for each feature vector. 5 and other 5 levels of weight (if a text is given, then the feature vector can be a word in the text, and its weight can be the number of times the word appears).

For example, given a sentence: "CSDN blog structure method and algorithm of the author July", after the word segmentation: "CSDN&blog&structure&of&method&algorithm&zhi&dao&&author&July", and then for each The feature vector gives weight: CSDN(4) Blog(5) Structure(3) of (1) Method(2) Algorithm(3) of(1) Tao(2)(1) Author(5) July(5) .
The number in parentheses represents the importance of the word in the entire sentence. The larger the number, the more important it is.

2. Hash
calculates the hash value of each feature vector through the hash function, and the hash value is an n-bit signature composed of the binary number 01.
For example, the hash value Hash (CSDN) of "CSDN" is 100101, and the hash value Hash (blog) of "blog" is "101011". In this way, the string becomes a series of numbers.

Here, the Hash algorithm for calculating the feature words is selected by itself, but the generated length is generally the same length, and some are directly 64-bit length, which happens to be stored in the long type.

3. Weighting
is based on the hash value, weighting all the feature vectors, that is, W = Hash * weight, and when encountering 1, the hash value and the weight are positively multiplied, and when 0, the hash value and the weight are negatively multiplied .

For example, weight the hash value "100101" of "CSDN" to get: W(CSDN) = 100101 4 = 4 -4 -4 4 -4 4, weight the hash value of "blog" "101011" to get: W(blog)= 101011 5 = 5 -5 5 -5 5 5, the rest of the feature vectors are similar to this operation.

4. Combine
Accumulate the weighted results of the above-mentioned feature vectors to become only one sequence string. Take the first two feature vectors for example, such as "4 -4 -4 4 -4 4" of "CSDN" and "5 -5 5 -5 5 5" of "Blog" for accumulation, and get "4+5 -4± 5 -4+5 4±5 -4+5 4+5", get "9 -9 1 -1 1 9".

5. Dimensionality reduction
For the accumulation result of n-bit signature, if it is greater than 0, set it to 1, otherwise it is set to 0, so as to get the simhash value of the sentence, and finally we can judge the similarity of different sentences based on the Hamming distance of simhash .

For example, by reducing the dimension of "9 -9 1 -1 1 9" calculated above (a bit greater than 0 is recorded as 1, and less than 0 is recorded as 0), the 01 string obtained is "1 0 1 0 1 1", thus Form their simhash signature.

Hamming distance

Hamming Distance, also known as Hamming distance, in information theory, the Hamming distance between two strings of equal length is the number of different characters in the corresponding positions of the two strings. In other words, it is the number of characters that need to be replaced to transform one string into another. For example: The Hamming distance between 1011101 and 1001001 is 2.

The string edit distance we often say is the general form of Hamming distance. In this way, by comparing the Hamming distance of the simHash value of multiple documents, the similarity can be obtained.

Hamming distance calculation

  1. Suppose to obtain two pieces of text A's Simhash(A)=100111 and B's Simhash(B)=101010
  2. The Hamming distance between the two is hamming_distance(A, B) = count_1(A xor B) = count_1(001101) =

Whether the Hamming distance between A and B is less than or equal to n, the value of n is generally 3 based on experience, and less than or equal to 3 indicates that the text is very similar

Large-scale data Hamming distance calculation
In the case of large-scale data, if the Hamming distance of two texts of 64-bit SimHash is calculated using the method of comparing each digit to find the text with the Hamming distance less than or equal to 3, This will consume a lot of time and resources.

So how to search for records within 3 Hamming distance from the massive sample database?

  1. One solution is to find all combinations of changes within 3 bits of the 64-bit simhash code of the text to be queried
  2. Another solution is a combination of all the sample simhash codes in the pre-generated library within 3 digits.
    These two solutions are either high in time complexity or complicated in space. Can there be a solution that can achieve the excellent time and space complexity? Balance?

A better algorithm idea to balance the time complexity and space complexity of calculating Hamming distance is:

  1. Divide the 64-bit SimHash into four parts. If two SimHash are similar (Hamming distance is less than or equal to 3), according to the pigeon nest principle, there must be one part that is exactly the same.
  2. If there is already a corresponding part that is the same, then calculate the Hamming distance in the part

SimHash application

After each document gets the SimHash signature value, then calculate the Hamming distance of the two signatures.
According to empirical values, for a 64-bit SimHash value, a Hamming distance within 3 can be considered to have a higher similarity.

For example, compare the contents of multiple documents

  1. Perform keyword extraction on Doc (including word segmentation and weight calculation), and extract n (keywords, weight) pairs, that is, multiple (feature, weight) in the figure. Recorded as feature_weight_pairs = [fw1, fw2… fwn], where fwn = (feature_n,weight_n).

  2. Hash the features in each feature_weight_pairs. Then the hash_weight_pairs is accumulated vertically, if the bit is 1, then +weight, if it is 0, then -weight, and finally generate bits_count numbers, greater than 0 mark 1, and less than 0 mark 0

  3. Finally, it is converted into a 64-bit byte. To judge the repetition, you only need to judge whether the distance between their feature words is <n (n is generally 3 based on experience), and then you can judge whether the two documents are similar.
    Insert picture description here

    When two texts have only one word change, if ordinary Hash is used, the results of the two times will be greatly changed, and the local sensitivity of SimHash will cause only part of the data to change.
    Insert picture description here

GeoHash function

geohash is a set of spatial geographic information coding system invented by Gustavo Niemeyer, which can transform the longitude and latitude information of a geographic location into a string of shorter numbers and letters. Geohash is a hierarchical spatial data structure, which can divide the space into grid-like buckets by using "Z-curve", which is generally called "space filling curve".

It can be understood that geohash is an algorithmic idea. Geohash is to represent two-dimensional coordinate points with a string of strings, so that all elements will be mounted on a line, and the two-dimensional coordinates close to each other will be mapped to one-dimensional The distance between the points will be very close. Find nearby target elements by comparing the similarity of geohash values.

The basic principle of geohash is to understand the earth as a two-dimensional plane and decompose the plane recursively into smaller sub-blocks. Each sub-block has the same code within a certain range of latitude and longitude. This method is simple and rude, and can meet the needs of small-scale data. Search for latitude and longitude.

3. GeoHash usage example

The latitude interval of the earth is [-90,90], the coordinates of Shanghai Daning International Plaza (121.458797, 31.280291).

1. Coding dimension
The latitude of Daning International Plaza is 31.280291, and the latitude 31.280291 can be approximated and coded by the following algorithm:

  • 1) The interval [-90,90] is divided into [-90,0),[0,90], which is called the left and right interval. It can be determined that 39.928167 belongs to the right interval [0,90] and is marked as 1;
  • 2) Then the interval [0,90] is divided into [0,45), [45,90], it can be determined that 31.280291 belongs to the left interval [0,45), and it is marked as 0;
  • 3) The recursive process 31.280291 always belongs to a certain interval [a, b]. With each iteration, the interval [a, b] is always shrinking, and is getting closer and closer to 31.280291;
  • 4) If the given latitude x (31.280291) belongs to the left interval, then record 0, if it belongs to the right interval, record 1, so as the algorithm proceeds, a sequence 1010110001 will be generated, and the length of the sequence is related to the number of divisions of the given interval . | Column 1
bit min mid max
1 -90.000 0.000
0 0.000 45.000 90.000
1 0.000 22.500 45.000
0 22.500 33.750 45.000
1 22.500 28.125 33.750
1 28.125 30.9375 33.750
0 30.9375 32.34375 33.750
0 30.9375 31.640625 32.34375
0 30.9375 31.2890625 31.640625
1 30.9375 31.1090625 31.2890625

2. Coding accuracy
Similarly, the longitude interval of the earth is [-180,180], and longitude 121.458797 can be coded.

bit min mid max
1 -180 0 180
1 0 90 180
0 90 135 180
1 90 112.5 135
0 112.5 123.75 135
1 112.5 118.125 123.75
1 118.125 120.9375 123.75
0 120.9375 122.34375 123.75
0 120.9375 121.640625 122.34375
1 120.9375 121.2890625 121.640625

3. Coding combination
Through the above calculation, the code generated by latitude is 10101 10001, and the code generated by longitude is 11010 11001. The longitude is placed in the even-numbered bits, and the latitude is placed in the odd-numbered bits (from right to left). Combine the two codes to generate a new string: 11100 11001 11100 00011.

Finally, use the 32 letters 0-9, bz (remove a, i, l, o) for base32 encoding, first convert 11100 11001 11100 00011 to decimal, which corresponds to 28, 25, 28, 3, and the corresponding code in decimal It is wtw3.

Decimal 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Base32 0 1 2 3 4 5 6 7 8 9 b c d e f g
Decimal 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Base32 h j k m n p q r s t u v w x and with

In the same way, the decoding algorithm that converts the code into latitude and longitude is the opposite. For nearby points, it is compared with wtw3, and if they are similar, the content will be similar.

Advantages and disadvantages of GeoHash

Advantages:
1. Use one field to store the latitude and longitude; when searching, only one index is needed, which is more efficient.
2. The coded prefix can indicate a larger area, and it is very convenient to find nearby ones. In SQL, LIKE'wm3yr3%' can query all nearby locations.
3. Fuzzy coordinates and privacy protection can be achieved through coding accuracy.

Disadvantages: Two operations are required for distance and sorting (run in the filtered results, actually quite fast)

博客书写不易,您的点赞收藏是我前进的动力,千万别忘记点赞、 收藏 ^ _ ^ !

Related Links

  1. Detailed explanation of Hash principle of development interview
  2. Development Interview Hash Interview Questions
  3. Getting started with Android CameraX
  4. The difference between Finish, OnBackPressed, OnDestroy

Guess you like

Origin blog.csdn.net/luo_boke/article/details/106756297