Android interview Hash algorithm questions


The Hash part is divided into three parts to explain. Visitors can read the corresponding blog according to the classification:

  1. Detailed explanation of Hash principle of development interview
  2. Develop common algorithms for interview Hash
  3. Development Interview Hash Interview Questions

博客书写不易,您的点赞收藏是我前进的动力,千万别忘记点赞、 收藏 ^ _ ^ !

1. Hash Top Search

Content description

1) The search engine will record all the search strings used by the user each time through the log file, and the length of each query string is 1-255 bytes.
2) Suppose there are currently 10 million records (the repetition of these query strings is relatively high, although the total is 10 million, but if the repetition is removed, it will not exceed 3 million.
2) The higher the repetition of a query string , Indicating that the more users query it, the more popular it is.

Claim

Please count the most popular 10 query strings, and the required memory cannot exceed 1G.

analysis

There are two ways of thinking: the scheme of using sorting and the method of using hash

1. Sorting method
1) The first algorithm we think of is sorting. First, sort all the queries in this log, and then traverse the sorted queries to count the number of times each query appears.

2) But there is a clear requirement in the question, that is, the memory cannot exceed 1G, 10 million records, each record is 255Byte, according to 255 1000 10000/(1024 1024 1024)=2.37G memory size

3) When the amount of data is relatively large and the memory cannot be loaded, we can use the outer sort method to sort, here we can use merge sort, merge sort has a relatively good time complexity O(nlogn)

4) After sorting, we traverse the already ordered Query files, count the number of occurrences of each Query, and write them to the file again. The time complexity is O(n).

5) Overall time complexity O(n)+O(nlogn)=O(nlogn)

2. Hash Table Method
There is a better solution for data statistics than sorting, which is to use hash table.

Although there are 10 million Query in the question, due to the high degree of repetition, in fact there are only 3 million Query, and its size is 2.37*0.3=0.71G, which meets the memory requirement of 1G. The lookup speed of Hash Table is very fast and close to O(1).

In this way, our algorithm is: maintain a HashTable whose Key is the Query string and Value is the number of occurrences of the Query, and read a Query each time. If the string is not in the Table, then add the string and set the Value value to 1; If the string is in the Table, then the count of the string is increased by one. Finally, we completed the processing of the massive data in O(n) time complexity.

Comparison : Compared with the sorting method, Hash Table has an order of magnitude increase in time complexity, which is O(n). And the hash Table solution only needs to perform one IO data operation, which is more advantageous in terms of operability

supplement

Algorithm: Part of the sorting
problem requires finding the Top 10, so we don’t need to sort all the queries. We only need to maintain an array of 10 sizes, initialize 10 Query, and calculate the number of times according to each query. Sort from big to small, and then traverse the 3 million records. Each read record is compared with the last Query in the array. If it is smaller than this Query, then continue to traverse. Otherwise, the last data in the array is eliminated and added to the current Query.

Finally, when all the data is traversed, the 10 Query in this array are the Top 10 we are looking for. It is not difficult to analyze that, in this way, the worst time complexity of the algorithm is N*K, where K refers to top.

Algorithm: Heap
In the second algorithm, after each comparison is completed, the required operation complexity is K, because the elements are inserted into a linear table, and the sequential comparison is used.

Here we pay attention to the fact that the array is in order. We can use the binary method to search every time we search, so the complexity of the operation is reduced to logK, but the problem that follows is data movement, because of the movement The number of data has increased. However, this algorithm is still improved over Algorithm 2.

Based on the above analysis, do we think of a data structure that can quickly find and move elements quickly? The answer is yes, that is heap.
With the help of the heap structure, we can find and adjust/move in log time. So up to this point, our algorithm can be improved to this, maintaining a small root heap of K (10 in this topic), and then traversing 3 million Query, and comparing them with the root elements.

Compared with some algorithms, we use the minimum heap data structure instead of the array, reducing the time complexity of finding the target element from O(K) to O(logK). Using the heap data structure, the final time complexity is reduced to N'logK.

to sum up

The best solution is to use the Hash table to count the number of occurrences of each query O(N), and then use the heap data structure to find the top 10, the time complexity is N*O(logK), so the total time complexity is O (N)+N O(logK)。 N=1000万 ,N= 3 million

2. SimHash application

Content description

How to design an algorithm to compare the similarity of two articles?

Claim

Use the fastest and most effective way to compare the similarity of two articles.

analysis

Our first choice and the two easiest solutions are

  1. One solution is to segment the two articles separately to obtain a series of feature vectors, and then calculate the distance between the feature vectors (you can calculate the Euclidean distance, Hamming distance or the cosine of the angle, etc.) between them. Judge the similarity of two articles by the size of the distance.
  2. Another solution is traditional hashing. We consider generating a fingerprint (finger print) for each web document through a hash method.

Compare the two bills
+ take the first method. It would be good if you only compare the similarity of two articles, but if it is a huge amount of data, there are millions or even hundreds of millions of web pages, and you are required to calculate the similarity of these web pages. Do you still calculate the distance or the cosine of the angle between any two web pages? Presumably you will not.

+The traditional encryption method md5 mentioned in the second scheme is designed to make the entire distribution as uniform as possible, but if the input content changes even slightly, the hash value will change greatly. If only one text is changed, the Hash value is very different and it is not easy to compare.
H^md5(“the cat sat on the mat”)=415542861
H^md5(“the cat sat on the mat”)=668720516

The ideal hash function needs to generate the same or similar hash value for almost the same input content. In other words, the similarity of the hash value must directly reflect the similarity of the input content, so traditional hash methods such as md5 cannot satisfy us. Demand.

The simhash algorithm is proposed in a paper "detecting near-duplicates for web crawling" published by GoogleMoses Charikar , which is specifically used to solve the task of deduplication of hundreds of millions of web pages.
Please check my blog "Common Algorithms for Developing Interview Hash" for specific usage and principles .

to sum up

Using Simhash can perfectly and quickly solve the web page de-duplication judgment processing.

博客书写不易,您的点赞收藏是我前进的动力,千万别忘记点赞、 收藏 ^ _ ^ !

Related Links

  1. Detailed explanation of Hash principle of development interview
  2. Develop common algorithms for interview Hash
  3. Do you understand the relationship between ART, Dalvik and JVM?
  4. What do you know about hot updates?

Guess you like

Origin blog.csdn.net/luo_boke/article/details/106693341