Xi'an Shangxue Tang Exercise 09.10 | Java Programming written interview questions

1. Given a, b two files, each storing 5 billion url, url each accounted for 64 bytes each, the memory limit is 4G, let you find out a, common url b file?
Scheme 1: security can be estimated size of each file 50G × 64 = 320G, far greater than the memory limit of 4G. It is impossible to be completely loaded into memory processing. Consider ways to divide and conquer.
Traverse the file a, is obtained for each url, then based on the values obtained are stored in the url of the small file 1000 (referred to) in the. So that each small file of approximately 300M.
Traverse the file b, and taking the same manner as a url 1000 are stored in the respective small files (referred to). After this treatment, all the same url are possible in a corresponding small files (), the small files do not correspond to not have the same url. We then calculated the same as long as 1,000 small file url can be.
Each of the smaller files when evaluated in the same url, which can be stored in a small file url to the hash_set. Then for each url another small files, to see if it is in just built hash_set, if so, is a common url, saved to a file inside it.
Scenario 2: If a certain error rate allowed, you can use Bloom filter, 4G memory can probably represent 34 billion bit. Will be one of the files in the url using Bloom filter 34 billion mapped to this bit, and then one by one to read url another file, check whether the Bloom filter, if it is, then the url should be common url (note that there will be some Error rate).
2. There are 10 files, each file 1G, each line of each file are stored in the user's query, query may be repeated for each file. We ask you to sort by frequency of query.
Scheme 1:
sequentially reads file 10, in accordance with the hash (query) 10% of the query result is written to another file 10 (referred to) in the. Such newly generated for each file size is also about. 1G (assuming random hash function).
Looking for a 2G memory about the machine, in order to use hash_map (query, query_count) to count the number of times each query arise. Use Fast / heap / merge sort to sort by number of occurrences. The query and sorted corresponding query_cout output to a file. This resulted in 10 sorted file (denoted).
These 10 files merge sort (the combination of external sorting and sorting).
Scenario 2:
the total amount of general query is limited, but the number of repetitions more just, possible for all of the query, one-time can be added to the memory. In this way, we can use the trie / hash_map other direct to count the number of times each query appears, and then press the number of occurrences do a quick / heap / merge sort on it.
Scheme 3:
Similar to Scheme 1, but the hash in the finished, divided into a plurality of files, the plurality of files to be processed, distributed architecture to process (such as the MapReduce), and finally merge.
3. 1G has a size of a document which each row is a word, the word size is not more than 16 bytes, the size of the memory limit is 1M. Returns the highest frequency of 100 words.
Sequentially reading the file, for each word x, take, and then to deposit small files 5000 (referred to) in accordance with this value. So that each file is probably around 200k. If the file exceeds 1M any size, we can continue to divide down in a similar way to know the size of the decomposition of small files no more than 1M. For each small file, and the corresponding statistical word frequency of each document that appears in (the trie may be used / hash_map etc.), and the frequency of maximum 100 words (containing 100 may be the smallest stack nodes) appears taken, and to 100 words and corresponding frequency stored in the file, so that they get a 5000 document. The next step is to put the 5000 file merge process (similar to merge sort) of the.
4. massive log data to extract most of the day to visit a number of Baidu that IP.
IP addresses have a maximum of 2 ^ 32 = 4G optional values cases, it is not fully loaded into memory processing;
consider using divide and conquer the idea, according to% 1024 value of a Hash IP's (IP), the mass IP logs are stored in the 1024 small file so that each small file containing a maximum of 4MB IP addresses;
explain here why using Hash% 1024 value (IP), if not, direct classification, then, could such a situation, there is a IP are present in every small file, and this IP is not necessarily the largest number in the small file, then the end result might be problematic choose, so here with a 1024% value Hash (IP), so by calculating the Hash value of IP, IP will definitely put a same file, of course, a different IP Hash value may be the same, there is a small file.
For each small files, you can build an IP-key, the number of occurrences for the value of Hash Map, simultaneously recorded the highest number of IP addresses that currently appear;
you can get the number of occurrences of up to 1024 small files that IP, and then based on conventional sorting algorithm obtained the highest number of IP appears on the whole.

Guess you like

Origin blog.51cto.com/14512197/2437139