Large enterprise data optimization program ---- Case Study

"I stumbled on a giant cow artificial intelligence course, could not help but share to everyone. Tutorial is not only a zero-based, user-friendly, and very humorous, like watching a fiction! Think too much bad, so to others. the point where you can jump to the tutorial.. "

Big Data Collection catalog interview, please click

table of Contents

1. massive log data to extract the most visited Baidu times a day that IP.
2. All strings retrieved by the search engine will retrieve the log file each time the user uses are recorded, the length of each of the query string 1-255 bytes.
3. 1G has a size of a document which each row is a word, the word size is not more than 16 bytes, the size of the memory limit is 1M. Returns the highest frequency
4. 10 files, each file 1G, each line of each file is stored in the user's query, query each file may be repeated. We ask you to sort by frequency of query.
5. Given a, b two files, each storing 5 billion url, url each accounted for 64 bytes each, the memory limit is 4G, let you find out a, common url b file?
6. find no repeat of the 250 million integer integer, note, not enough memory to accommodate the 250 million integer.
7. Tencent face questions: integer 4 billion to not repeat the unsigned int, no row over the sequence, and then give a number, how to quickly determine the number of whether the number of them that 4 billion?
8. how to find the number of repetitions in one of the most massive data?
9. tens of millions or hundreds of millions of data (duplicate), in which the highest number of statistics money N data appears.
10. a text file, about one million lines, one word per line, requires the statistics of the top 10 most frequently occurring words, please give thought, given the time complexity analysis.
11. The limited memory topN statistics
12. Multi-file sorting


1. massive log data to extract the most visited Baidu times a day that IP.

       The first is this day, and is a log of IP access Baidu's taken out, one by one to write a large file. Noted that IP is 32, there are a maximum of 2 ^ 32 IP. The same mapping method can be used, such as mold 1000, to map the entire large file 1000 small files, and then find the maximum frequency of occurrence of each IP small text (can be used hash_map frequency statistics, and then find the maximum frequency of a few a) and the corresponding frequency. Then in the 1000's largest IP, find the maximum frequency of IP, it is also desired. Or set forth as follows (snow Eagle): ALGORITHM: divide and conquer + the Hash
       (. 1) the IP address of a maximum of 2 ^ 32 = 4G optional values cases, it is not fully loaded into memory handling;
       (2) may be considered a "divide and conquer "the idea, according to 1024% value Hash IP address (IP), to IP logs are stored in the massive 1024 small files. Thus, each small file contains up to 4MB IP addresses;
       (3) for each small file, IP may be constructed as a key, for the value of the number of occurrences Hash map, while the largest number of records that the current IP address appeared;
       (4 ) can get the highest number of small files appear in the 1024 IP, and then get the highest number of overall IP appears on the basis of conventional sorting algorithm;

2. All strings retrieved by the search engine will retrieve the log file each time the user uses are recorded, the length of each of the query string 1-255 bytes.

       Assuming that there are ten million records (repetition of these query string is relatively high, although the total number is 10 million, but if you remove duplicate after no more than 3 million. The higher the repetition of a query string, indicating it queries the more users, which is more popular.), you statistics hottest 10 query string, required memory usage can not exceed 1G.
       Typical Top K algorithm, or attributed to it in this article inside, please see: XI, from start to finish thorough analysis of Hash table algorithms. Hereinafter, the final algorithm is given: the first step, the first batch of pre-mass data, complete with a Hash table statistics in O (N) time (written before sorting, hereby revised .July, 2011.04.27) ;
       a second step, by the heap data structure to identify Top K, time complexity is N'logK. That is, with the stack structure, we can find and adjust / move within a log order of time. Therefore, maintaining a K (the title is 10) rootlets heap size, then
after traversing 3,000,000 Query, respectively, and the root element comparison Therefore, our final time complexity is: O (N) + N ' * O (logK), (N 10,000,000, N '3,000,000). ok, more details, please refer to the original.
       Or: using a trie, the key field memory string queries arise, 0 is no. Finally, most of the elements 10
small push to the frequency appears to be sorted.

3. 1G has a size of a document which each row is a word, the word size is not more than 16 bytes, the size of the memory limit is 1M. Returns the highest frequency

       Program: sequential read file, for each word x, take the hash (x)% 5000, and 5000 according to the value stored small file (denoted as x0, x1, ... x4999) in. So that each file is probably around 200k.
       If the file exceeds 1M any size, can continue to divide down in a similar way, the size of small files until decomposition of no more than 1M.
       For each small file, and the corresponding statistical word frequency of each document that appears in (the trie may be used / hash_map etc.), and the frequency of maximum 100 words (containing 100 may be the smallest stack nodes) appears taken, and to 100 words and corresponding frequency stored in the file, so that they get a 5000 document. The next step is to put the 5000 file merge process (similar to merge sort) of the.

4. 10 files, each file 1G, each line of each file is stored in the user's query, query each file may be repeated. We ask you to sort by frequency of query.

       Typical algorithms or TOP K, solutions are as follows:
Program 1:
       reading the order file 10, in accordance with the hash (query) 10% of the query result is written to another file 10 (referred to) in the. Such newly generated for each file size is also about. 1G (assuming random hash function). Looking for a 2G memory about the machine, in order to use hash_map (query, query_count) to count the number of times each query arise. Use Fast / heap / merge sort to sort by number of occurrences. The query and sorted corresponding query_cout output to a file. This resulted in 10 sorted file (denoted). These 10 files merge sort (the combination of external sorting and sorting).
Scenario 2:
       the total amount of general query is limited, but the number of repetitions more just, possible for all of the query, one-time can be added to the memory. In this way, we can use the trie / hash_map other direct to count the number of times each query appears, and then press the number of occurrences do a quick / heap / merge sort on it.
Scheme 3:
       Similar to Scheme 1, but the hash in the finished, divided into a plurality of files, the plurality of files to be processed, distributed architecture to process (such as the MapReduce), and finally merge.

5. Given a, b two files, each storing 5 billion url, url each accounted for 64 bytes each, the memory limit is 4G, let you find out a, common url b file?

Scheme 1:
       may estimate the size of each file security 5G × 64 = 320G, far greater than the memory limit of 4G. It is impossible to be completely loaded into memory processing. Consider ways to divide and conquer.
       Traverse the file a, is obtained for each url hash (url)% 1000, and based on the values obtained are stored in the url of the small file 1000 (referred to as a0, a1, ..., a999) the. So that each small file of approximately 300M. Traverse the file b, and taking the same manner as a url are stored in the small files 1000 (denoted as b0, b1, ..., b999) .
       After this treatment, all the same url are possible in a corresponding small files (a0vsb0, a1vsb1, ..., a999vsb999 ) , the small files do not correspond to not have the same url. We then calculated the same as long as 1,000 small file url can be. Each of the smaller files when evaluated in the same url, which can be stored in a small file url to the hash_set. Then for each url another small files, to see if it is in just built hash_set, if so, is a common url, saved to a file inside it.
Scenario 2:
       If you allow a certain error rate, you can use Bloom filter, 4G memory can probably represent 34 billion bit. Will be one of the files in the url using Bloom filter 34 billion mapped to this bit, and then one by one to read url another file, check whether the Bloom filter, if it is, then the url should be common url (note that there will be some Error rate). Bloom filter will be described in detail later in this BLOG.

6. find no repeat of the 250 million integer integer, note, not enough memory to accommodate the 250 million integer.

Scheme 1:
       using 2-Bitmap (each apportionment 2bit, 00 indicates absence, means one 01, 10 represents a plurality of times, meaningless 11) for, totaling memory 2 ^ 32 * 2 bit = 1 GB of memory, further acceptable. Then scan the 250 million integer, see Bitmap corresponding to bits if it becomes 01, 01 becomes 10, 10, 00 remain unchanged. When done as described, see bitmap, the corresponding bit is an integer of 01 to the output.
Scheme 2:
       may also be employed with a method similar to the first problem, a method of dividing the small files. Then find an integer not repeat in a small file and sorted. Then to merge, attention removing duplicate elements.

7. Tencent face questions: integer 4 billion to not repeat the unsigned int, no row over the sequence, and then give a number, how to quickly determine the number of whether the number of them that 4 billion?

       6 questions similar to the first, quick sort + binary search when my first reaction. The following are other better ways:
        Scheme 1:
        application memory 512M, a representative of one bit unsigned int value. Read 4 billion number, set the corresponding bit position, read the number to be queried to see if the corresponding bit to 1, indicates the presence of 1, 0 means does not exist. dizengrong:
        Scenario 2:
       The problem in the "Programming Pearls" There is a good description, we can refer to the following ideas, explore: and because 2 ^ 32 is 40 million, so given a number may, or may not where; here we have 4 billion each with the number of 32-bit binary to represent, assuming that the number 4 billion in the beginning of a file.
Then the number of these 4000000000 into two categories: 1. The most significant bit is 0; 2 highest bit is 1.
And these two types are written to a file, wherein a number of file number <= 2000000000, and the other> = 2000000000
(which is equivalent to the binary);
bit and to find the number of comparison of the corresponding file and then enters search again, and then this file is divided into two categories: the second highest bit to a next highest bit.
and these two types are written to a file in which a file the counted number <= 1 billion, and the other> = 1000000000
(which is equivalent to the binary);
with the next highest bit comparison to find the corresponding file and proceeds to search again.
.......
So, you can find it, and time complexity is O (logn), 2 complete program. Attachment: Here, again briefly, the bitmap
method:
using a bitmap array shaping method determines whether duplicate set of duplicate determination is one of the common programming tasks, when the amount of data set
When we usually expect little larger scanning several times, then a double round-robin not desirable.
Bit map method is more suitable to the case in which the practice is to create a set length according to the maximum element max
max + 1 of the new array, and then re-scan the original array, the first few encounters several give the location of the new array 1 , in case a new array 5 gave
the sixth element is set to 1, so next time you encounter 5 wanted to find out when set, the sixth element of the new array is already
1, and this shows the data before and certainly duplicate data exists. This then set to zero when a new approach to initializing the array
processing method similar to the so called bitmap bit map method. It's the worst case number of operations to 2N. If the maximum value that is known array wiles
give new array of fixed-length, then the efficiency can be doubled.

8. how to find the number of repetitions in one of the most massive data?

       Scheme 1: first the hash, then the modulo mapping file is small, most of a determined number of repetitions each small file, and recording the number of repetitions. Then find the data obtained in the previous step is repeated a number of times the most is the desired (with particular reference to the preceding questions).

9. tens of millions or hundreds of millions of data (duplicate), in which the highest number of statistics money N data appears.

       1 solution: tens of millions or billions of data, and now the machine's memory should be able to save enough. So consider hash_map / binary search / red-black trees to count the number. Then is the highest number of data before removing the N appeared to be complete by the second stack mechanism mentioned problems.

10. a text file, about one million lines, one word per line, requires the statistics of the top 10 most frequently occurring words, please give thought, given the time complexity analysis.

Scenario 1:
       This question is to consider the time efficiency. Count the number of times each word appears with the trie, the time complexity is O (the n- Le) (Le stabilization representing a word length). Then find out the top 10 most frequent words can be used to achieve the heap in front of the problem has been talked about, the time complexity is O (the n- LG10). So the total time complexity is O (n- Le) (n-O with the LG10) of which a large.
Attached, find the maximum number 100w 100 in number.
Scheme 2:
       In the foregoing problems, we have already mentioned, with a minimum of 100 elements having complete stack. Complexity is O (100W lg100).
Option 3:
       using quick sort of idea, considering only larger than the portion of the shaft after each division, known larger than the portion of the shaft in more than 100 when using the traditional sort sorting algorithm, take the top 100. Complexity is O (100W
100).
Scheme 4:
       using a local-out method. Select the first 100 elements, and ordering, referred to as a sequence L. Then scan the remaining elements x, and sorted 100 elements smallest element ratio, if this minimum is larger than that, then this smallest element delete
other, and the x by inserting sort thinking, inserted into L of sequence. In cycles, we know scanned all the elements. Complexity is O (100w * 100).

11. The limited memory topN statistics

Demand
       has a 1G size of a file, which each line is a word, the word does not exceed the size of 16 bytes, the size of the memory limit is 1M. Returns the highest frequency of 100 words.

Scheme
       sequential read file, for each word x, take the hash (x)% 5000, and stored in accordance with the value of small files 5000 (denoted as x0, x1, ... x4999) in. So that each file is probably around 200k. If the file exceeds 1M any size, can continue to divide down in a similar way, the size of small files until decomposition of no more than 1M. For each small file, and the corresponding statistical word frequency of each document that appears in (the trie may be used / hash_map etc.), and the frequency of maximum 100 words (containing 100 may be the smallest stack nodes) appears taken, and to 100 words and corresponding frequency stored in the file, so that they get a 5000 document. The next step is to put the 5000 file merge process (similar to merge sort) of the.

12. Multi-file sorting

Demand
       has 10 files, each file 1G, each line of each file are stored in the user's query, each file
query may be repeated. We ask you to sort by frequency of query. Typical TOP K or algorithm.

Scheme 1:
       sequentially reads file 10, in accordance with the hash (query) 10% of the query result is written to another file 10 (referred to) in the. Such newly generated for each file size is also about. 1G (assuming random hash function). Looking for a 2G memory about the machine, in order to use hash_map (query, query_count) to count the number of times each query arise. Use Fast / heap / merge sort to sort by number of occurrences. The query and sorted corresponding query_cout output to a file. This resulted in 10 sorted file (denoted). These 10 files merge sort (the combination of external sorting and sorting).

Scenario 2:
       the total amount of general query is limited, but the number of repetitions more just, possible for all of the query, one-time can be added to the memory. In this way, we can use the trie / hash_map other direct to count the number of times each query appears, and then press the number of occurrences do a quick / heap / merge sort on it.

Scheme 3:
       Similar to Scheme 1, but the hash in the finished, divided into a plurality of files, the plurality of files to be processed, distributed architecture to process (such as the MapReduce), and finally merge.

Published 422 original articles · won praise 357 · Views 1.24 million +

Guess you like

Origin blog.csdn.net/silentwolfyh/article/details/103863810