Big Data problem classic scene

Article Directory

1, ip massive data, take the highest frequency (finite type)
2, massive log search history, take the highest frequency (infinite type)
3, large file topN word frequency
4, multiple large files, word frequency sorting
5, two common to find large files (if permissible error?)
6, lower memory limit can not find the number of repetitions
7, vast amounts of data to find the existence of a number
8, looking for the largest number of n

1, ip massive data, take the highest frequency (finite type)

Specific issues:
　　extract the most visited one day that IP from massive logs.
Modulo (optional) + hash. Because the number of IP is limited, up to 2 ^ 32, will consider using a hash ip directly into memory, and then statistics.
Program:
　　IP write individually to a large file. IP is 32, there are a maximum of 2 ^ 32 IP. The same mapping method can be used, such as mold 1000, to map the entire large file 1000 small files, and then find the maximum frequency of occurrence of each IP small text (can be used hash_map frequency statistics, and then find the maximum frequency of a few a) and the corresponding frequency. Then in the 1000's largest IP, find the maximum frequency of IP, it is also desired.

2, massive log search history, take the highest frequency (infinite type)

Question:
　　All the search engines by searching string to the log file each time a user retrieval are recorded, each the length of the query string of 1-255 bytes. Assuming that there are ten million records (the total number is 10 million, but as to the weight, not to exceed 3 million. Please statistics of the 10 most popular query string, required memory usage can not exceed 1G.

hash / trie tree + heap.
1. The first of these mass data preprocessing, in O (N) time to complete a sorting Hash table;
2 by the heap data structure to identify Top K, time complexity is N'logK. That is, with the stack structure, we can find and adjust / move within a log order of time. Therefore, maintaining a K (the title is 10) rootlets heap size, the final time complexity is: O (N) + N ' * O (logK), (N 10,000,000, N' 3,000,000) .
2. The use of the trie (instead of the hash table), the key field memory the number of occurrences of the query string, no zero. Finally, with a minimum of 10 elements to push on the appearance frequency order.

3, large file topN word frequency

Question:
　　There is a 1G size of a file, which each line is a word, the word does not exceed the size of 16 bytes, the size of the memory limit is 1M. Returns the highest frequency of 100 words.
Modulo divided into n parts (where if there is more than a memory may also be subdivided) + hash / trie tree + heap
embodiment :
　　sequential read file, for each word x, take the hash (x)% 5000, and in accordance with the stored value 5000 small files (denoted as x0, x1, ... x4999) in. So that each file is probably around 200k.
If the file exceeds 1M any size, can continue to divide down in a similar way, the size of small files until decomposition of no more than 1M. For each small file, and the corresponding statistical word frequency of each document that appears in ( may employ a trie / hash_map etc. ), and the frequency of maximum 100 words (containing 100 may be the smallest stack nodes) appears taken, and to 100 words and corresponding frequency stored in the file, so that they get a 5000 document. The next step is to put the 5000 file merge process (similar to merge sort) of the.

4, multiple large files, word frequency sorting

Question: There are 10 files, each file 1G, each line of each file are stored in the user's query, query may be repeated for each file. We ask you to sort by frequency of query.
Modulo (optional) + hashmap / tire tree + heap / merge / sort MapReduce
1. reads file 10 sequentially, in accordance with the hash (query) 10% of the query result is written to another file 10 (referred to) in . Such newly generated for each file size is also about. 1G (assuming random hash function). Looking for a 2G memory about the machine, in order to use hash_map (query, query_count) to count the number of times each query arise. Use Fast / heap / merge sort to sort by number of occurrences. The query and sorted corresponding query_cout output to a file. This resulted in 10 sorted file (denoted). These 10 files merge sort (the combination of external sorting and sorting).
2. The total amount of general query is limited, but more the number of repetitions of it, for all possible query, one-time can be added to the memory. In this way, we can use the trie / hash_map other direct to count the number of times each query appears, and then press the number of occurrences do a quick / heap / merge sort on it.
3. Scheme 1, but the hash in the finished, divided into a plurality of files, the plurality of files to be processed, distributed architecture to process (such as the MapReduce), and finally merge.

5, two common to find large files (if permissible error?)

Question:
　　Given a, b two files, each storing 5 billion url, url each accounted for 64 bytes each, the memory limit is 4G, let you find out a, common url b file?
+ Hash_set partition algorithm allows for some error or Bloom filter
1. Divide and Conquer Algorithm:
　　estimated size of each file security 5G × 64 = 320G, greater than the memory limit, consider the method of partition. Traverse the file a, obtains hash (url)% 1000 for each url, then in accordance with the value of the acquired url 1000 are stored in the small file. So that each small file of approximately 300M. Similarly traverse the file B, after such treatment, it is all possible in the same url corresponding small files (a0vsb0, ..., a999vsb999) in no small files corresponding not have the same url. 1000 pairs of obtaining small files in the same url, wherein the url to a small file stored in hash_set. Then for each url another small files, to see if it is in just built hash_set, if so, is a common url, saved to a file inside it.
2. Bloom filter:
　　if a certain error rate allowed, you can use Bloom filter, 4G memory can probably represent 34 billion bit. Will be one of the files in the url using Bloom filter 34 billion mapped to this bit, and then one by one to read url another file, check whether the Bloom filter, if it is, then the url should be common url (note that there will be some Error rate).

6, lower memory limit can not find the number of repetitions

Problem:
　　find an integer not repeat in 250 million integers, note, not enough memory to accommodate the 250 million integer.
Bitmap or partition
1. 2-Bitmap:
　　 using 2-Bitmap (each apportionment 2bit, 00 indicates absence, means one 01, 10 represents a plurality of times, meaningless 11) for, totaling memory RAM, can accept. Then scan the 250 million integer, see Bitmap corresponding to bits if it becomes 01, 01 becomes 10, 10, 00 remain unchanged. When done as described, see bitmap, the corresponding bit is an integer of 01 to the output.
2. Divide and Conquer:
　　can also be used with a similar problem the first method, a method of dividing small files. Then find an integer not repeat in a small file and sorted. Then to merge, attention removing duplicate elements.

7, vast amounts of data to find the existence of a number

Problem: to integer 4 billion of non-repetition of unsigned int, no row over the sequence, and then give a number, how to quickly determine the number of whether the number of them that 4 billion?
Quick Exhaust +-half or bitmap or press the number of search from the highest part of the beginning of each screening
1. Quick sort + binary search.
2.
　　Application 512M of memory, a bit represents a bit unsigned int value. Read 4 billion number, set the corresponding bit position, read the number to be queried to see if the corresponding bit to 1, indicates the presence of 1, 0 means does not exist.
3. Comparison of bits
　　every number is assumed to represent the number 4000000000 which starts in a file with a 32-bit binary.
The number is then divided into two categories these 4000000000: most significant bit is the highest bit is 0 and 1, respectively, and these two types of files written to two, wherein a number of file number <= 20 billion, while the other a> = 2000000000 (binary), with the highest number of comparison you want to find and then enter the appropriate file and then search. And then this file is divided into two categories: the next highest and next highest bit 0 bit 1, and these two types are written to a file, wherein a number of file number <= 1 billion, and the other> = 1 billion, with the highest number of times you want to find compare and then enter the appropriate file and then search. ... and so on, you can find it, and time complexity is O (logn), 2 complete program.

8, looking for the largest number of n

Question: The number of 100w find the maximum number of 100.
1. containing a minimal element stack 100 is completed. Complexity is O (100w * lg100).
2. Using quick sort of thinking, after each division only consider larger than the shaft part, know in part more than 100 times larger than than the shaft, using the traditional sort sorting algorithm, take the top 100. Complexity is O (100w * 100).
3. Using a local-out method. Select the first 100 elements, and ordering, referred to as a sequence L. Then scan the remaining elements x, and sorted 100 elements smallest element ratio if larger than this minimum to be, then this smallest element deleted, and the x by inserting sort thinking, inserted into the sequence L in. In cycles, we know scanned all the elements. Complexity is O (100w * 100).

Long.JK

Published 23 original articles · won praise 24 · views 3055

Private letter concerns