Mass data problem

Reprinted from: https://blog.csdn.net/Better_JH/article/details/77197318

Question: Given a log file of more than 100G, the IP addresses are stored in the log, and an algorithm is designed to find the K IP addresses with the most occurrences. 1. Divide the 100G file into 100 small files. Use the string-to-integer function (such as BKDRHash) to convert the IP address into an integer key, and then use index=key%100 to assign the same IP to the same file. 2. Read these 100 files into memory in turn, and count the number of ips. Arrange in descending order, the top K are the IPs with the most occurrences in the file.  3. Build a small heap of the top K IPs with the most occurrences in the first file, and then read the second file and count its occurrences The occurrences of the top K IPs with the most are compared with the data in the heap. If there are more than the occurrences of IPs in the heap, the small heap is updated and the top element in the original heap is replaced. Then read the third file, and so on...  Problem: Given two files, each with 10 billion URLs, we only have 1G memory, and find the intersection of the two files. The exact and approximate algorithms are given respectively. Precise algorithm: Hash segmentation  1. Divide each file into 1000 parts, use the function of string to integer (BKDRHash) to convert the query into an integer key, and then use index=key%1000 to divide the same query into the same file. 2. Load two small files with the same number into memory in turn for comparison. For example, for A0 && B0, we can traverse A0 and store the url in hash_map. Then traverse B0. If the url is in the hash_map, it means that the url exists in both A and B, and it can be saved to a file. Approximate algorithm: Bloom filter  1. Use different string-to-integer functions to convert the query into an integer key, and then use a bitmap to map multiple bits to achieve. When judging, each bit must be valid. However, each bit is overlapped and mapped, which can accurately judge the absence of the situation, and there is a misjudgment when judging the existence. 



 





Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325508883&siteId=291194637