Mass data processing

1. Given a log file with a size of more than 100G, the ip address is stored in the log, and an algorithm is designed to find the ip address with the most occurrences.
Obviously, it is obvious that the 100G log file cannot be directly loaded into the memory for processing, so the first thing that can be thought of is that the large file needs to be divided, and the premise needs to ensure that the same recurring ip will appear in the same file. .
Suppose we divide this large file into 1024, so that the average size of each file is about 100M, use the hash algorithm to map the ip address, the obtained value modulo a 1024, and map the same ip to the same file.
When processing files, use their hash (ip) as the key value and the number of occurrences as the value to construct a map, and use the sorting algorithm to sort the number of occurrences.
In this way, we can get the ip with the most occurrences of each small file, and finally, obviously, sort the ip with the most frequent screening of these small files, and get the ip address with the most occurrence.

2. Similar to the above problem, how to find the ip of topk
This problem is to find the ip of topk. Obviously, we need to use the heap to deal with it. To find the topkip with the largest number of occurrences, we can build a small heap with a number of k, and then let the Small files are entered into the small heap in turn, and those larger than the top of the heap are added to the heap. In this way, the ip left in the last heap is the ip address of the topk of the number of occurrences.

3. Given two files, each with 10 billion queries, and we only have 1G memory, how to find the intersection of the two files, respectively, give an approximate algorithm and an exact algorithm.
Assuming that each query has an average of 50 bytes, then 10 billion queries will occupy about 500G of memory.
Precise algorithm:
processing huge data obviously needs to use the idea of hash, so each file can be divided into 1024 parts, the same query can be mapped to the same number of files using the hash algorithm, and then use the query in the second file to Iterate over the first file to see if it exists.
Approximate algorithm: (within the allowable range of error, Bloom can be used)
The bite bits that can be represented by 1G memory are 1024*1024*1024*32 (about 8 billion), which means that 10 billion queries left or not are not enough, Then you can divide 10 billion into two equal parts, construct two Blooms, divide a query into 5 bits to indicate presence and absence, and then let the query of the second file use the same 5 hash algorithms to determine 5 bit to see if it is between two blooms.
Note: Bloom's absence is accurate, but there is an error.

4. Give thousands of files, each file size is 1K~100M. Given n words, design an algorithm for each word to find all files that contain them. You only have 100K of memory.
Under normal circumstances, we will use map to organize the words that appear in each file with map, and then use the words to be searched to traverse these files in turn to see if they appear, but this question is obviously not feasible in this way, the memory is Obviously not enough, we can learn from the idea of inverted index to solve this problem. We can use the word to be searched as the key value, use a linked list of pointers to the file as the value map, and take out the words from the file in turn to traverse the graph. If it exists, then hang the pointer of the file on the linked list.

5. There is a dictionary containing N English words, and now an arbitrary string is given, and an algorithm is designed to find all English words that contain this string.
Find the English words that contain any given substring. If you traverse each English word one by one, the efficiency is obviously extremely inefficient, so you need to organize the English word and then traverse it. The dictionary tree can be very good Organize these words and then traverse them in turn. This is a typical practice of exchanging space for time, which is obviously different from the previous space saving.
Note: If there are readers who are not familiar with dictionary trees, you can refer to the following blog: http://blog.csdn.net/jiutianhe/article/details/8076835

Mass data processing

Guess you like