[Big data] Mass data processing method

 1. Massive log data, extract the IP with the most visits to Baidu on a certain day.  
  The first is this day, and the IPs in the logs of accessing Baidu are taken out and written into a large file one by one. Note that IP is 32-bit, and there are at most 2^32 IPs. You can also use the mapping method, such as modulo 1000, to map the entire large file into 1000 small files, and then find the IP with the highest frequency in each small file (you can use hash_map for frequency statistics, and then find out the most frequent IP ) and the corresponding frequency. Then, among the 1000 largest IPs, find the IP with the largest frequency, which is the requirement.

  Or elaborate as follows (Snow Eagle):

  Algorithm Thought: Divide and Conquer + Hash

  1. The IP address has a maximum of 2^32=4G values, so it cannot be completely loaded into the memory for processing;

  2. Consider adopting the idea of ​​"divide and conquer", and store massive IP logs in 1024 small files according to the Hash(IP)%1024 value of the IP address. This way, each small file contains up to 4MB IP addresses;

  3. For each small file, you can construct a Hash map with IP as the key and the number of occurrences as the value, and record the IP address with the most current occurrences;

  4. The IP with the most occurrences in the 1024 small files can be obtained, and then the IP with the most occurrences in general can be obtained according to the conventional sorting algorithm;

  2. The search engine will record all the search strings used by the user for each search through the log file, and the length of each query string is 1-255 bytes.

  Suppose there are currently 10 million records (the repetition of these query strings is relatively high, although the total number is 10 million, but if the repetition is removed, it will not exceed 3 million. The higher the repetition degree of a query string, the higher the repetition rate of a query string, the better The more users, the more popular it is.), please count the 10 most popular query strings, and the required memory should not exceed 1G.

  The final algorithm given is:

  The first step is to preprocess this batch of massive data, and use the Hash table to complete the statistics in O(N) time (previously written as sorting, hereby revised. July, 2011.04.27);

  The second step is to find Top K with the help of the heap data structure, and the time complexity is N'logK.

  That is, with the help of the heap structure, we can find and adjust/move in log order time. Therefore, maintain a small root heap of size K (10 in this topic), and then traverse 3 million queries and compare them with the root elements. Therefore, our final time complexity is: O(N) + N'*O (logK), (N is 10 million, N' is 3 million). ok, for more details, please refer to the original text.

  Or: a trie tree is used, and the keyword field stores the number of occurrences of the query string, and it is 0 if it does not appear. Finally use the smallest push of 10 elements to sort the frequency of occurrence.

  3. There is a file with a size of 1G, each line in it is a word, the size of the word does not exceed 16 bytes, and the memory limit size is 1M. Returns the 100 most frequent words.

  Scheme: Read the file sequentially. For each word x, take hash(x)%5000, and then store it in 5000 small files (recorded as x0, x1,...x4999) according to this value. So each file is about 200k.

  If some of the files exceed 1M in size, you can continue to divide them in a similar way until the size of the decomposed small files does not exceed 1M.

  For each small file, count the words that appear in each file and the corresponding frequency (trie tree/hash_map, etc. can be used), and take out the 100 words with the highest frequency (the minimum heap with 100 nodes can be used), And save the 100 words and the corresponding frequencies into the file, so that another 5000 files are obtained. The next step is to merge these 5000 files (similar to merge sort).

4. There are 10 files, each file is 1G, and each line of each file stores the user's query, and the query of each file may be repeated. You are asked to sort by the frequency of the query.

  Still a typical TOP K algorithm, the solution is as follows:

  plan 1:

  Read 10 files in sequence, and write the query to another 10 files (recorded as) according to the result of hash(query)%10. The size of each newly generated file is also about 1G (assuming the hash function is random).

  Find a machine with a memory of about 2G, and use hash_map(query, query_count) to count the number of occurrences of each query in turn. Sort by occurrence using quick/heap/mergesort. Output the sorted query and corresponding query_cout to a file. This results in 10 sorted files (denoted as ).

  Merge sort these 10 files (inner sort combined with outer sort).

  Scenario 2:

  Generally, the total number of queries is limited, but the number of repetitions is relatively large. Maybe all queries can be added to the memory at one time. In this way, we can directly count the number of occurrences of each query by using trie tree/hash_map, etc., and then do fast/heap/merge sort by the number of occurrences.

  Scenario 3:

  Similar to scheme 1, but after hashing is done and divided into multiple files, it can be handed over to multiple files for processing, processed in a distributed architecture (such as MapReduce), and finally merged.

  5. Given two files a and b, each store 5 billion urls, each url occupies 64 bytes, the memory limit is 4G, let you find the common url of a and b files?

  Option 1: It can be estimated that the size of each file is 5G×64=320G, which is much larger than the memory limit of 4G. So it is impossible to fully load it into memory for processing. Consider taking a divide and conquer approach.

  Traverse file a, obtain hash(url)%1000 for each url, and then store the urls in 1000 small files (referred to as a0, a1,..., a999) according to the obtained values. So each small file is about 300M.

  Traverse file b, and store the urls into 1000 small files (denoted as b0,b1,...,b999) in the same way as a. After this processing, all possibly the same urls are in the corresponding small files (a0vsb0, a1vsb1,...,a999vsb999), and it is impossible for non-corresponding small files to have the same url. Then we only ask for the same url in 1000 pairs of small files.

  When finding the same url in each pair of small files, you can store the url of one of the small files in hash_set. Then traverse each url of another small file to see if it is in the hash_set just constructed. If it is, then it is the common url, and it can be stored in the file.

  Option 2: If a certain error rate is allowed, Bloom filter can be used, and 4G memory can represent about 34 billion bits. Use Bloom filter to map the url in one of the files to the 34 billion bits, and then read the url of the other file one by one to check whether it is the same as the Bloom filter. If so, the url should be a common url (note that there will be certain Error rate).

  Bloom filter will be explained in detail in this blog in the future.

   6. Find non-repeating integers among 250 million integers. Note that the memory is not enough to hold these 250 million integers.

  Option 1: Use 2-Bitmap (2 bits are allocated for each number, 00 means non-existent, 01 means once, 10 means multiple times, 11 is meaningless), a total of 2^32 * 2 bit=1 GB of memory is required, and acceptable. Then scan the 250 million integers to see the corresponding bits in the Bitmap. If it is 00 to 01, 01 to 10, and 10 to remain unchanged. After the description, check the bitmap and output the integer whose corresponding bit is 01.

  Option 2: A method similar to Question 1 can also be used to divide small files. Then find the non-repeating integers in the small file and sort them. Then merge again, taking care to remove duplicate elements.

  7. Tencent interview question: Give 4 billion unrepeated unsigned int integers, which are not sorted, and then give a number. How to quickly determine whether this number is among the 4 billion numbers?

  Similar to question 6 above, my first reaction was quicksort + binary search. Here are other better ways:

  Scheme 1: oo, apply for 512M of memory, one bit represents an unsigned int value. Read in 4 billion numbers, set the corresponding bit, read in the number to be queried, and check whether the corresponding bit is 1. If it is 1, it means it exists, and if it is 0, it means it does not exist.

  dizengrong:

  Option 2: This problem is well described in "Programming Pearl". You can refer to the following ideas to discuss:

  And because 2^32 is more than 4 billion, a given number may or may not be in it;

  Here we represent each of the 4 billion numbers in 32-bit binary

  Suppose these 4 billion numbers start out in one file.

  Then divide the 4 billion numbers into two categories:

  1. The highest bit is 0

  2. The highest bit is 1

  And write these two categories into two files, one of which has a number <= 2 billion, and the other >= 2 billion (this is equivalent to halving);

  Compare with the highest bit of the number to be searched and then enter the corresponding file and search again

  This file is then divided into two categories:

  1. The second highest bit is 0

  2. The second highest bit is 1

  And write these two categories into two files, one of which has a number <= 1 billion, and the other >= 1 billion (this is equivalent to halving);

  Compare with the second most significant bit of the number to be searched and then enter the corresponding file to search again.

  …….

  By analogy, it can be found, and the time complexity is O(logn), and the scheme 2 is finished.

  Attachment: Here, I will briefly introduce the bitmap method:

  Use the bitmap method to determine whether there are duplicates in an integer array

  It is one of the common programming tasks to judge the existence of duplication in a collection. When the amount of data in the collection is relatively large, we usually want to scan a few times less. In this case, the double loop method is not desirable.

  The bitmap method is more suitable for this situation. Its method is to create a new array with a length of max+1 according to the largest element max in the collection, then scan the original array again, and give 1 to the first position of the new array when encountered. , if you encounter 5, set 1 to the sixth element of the new array, so when you encounter 5 again and want to set it, you will find that the sixth element of the new array is already 1, which means that the data this time must be the same as before. There is duplication of data. This method of initializing a new array with zeros followed by one is similar to the processing method of bitmaps, so it is called bitmap method. Its worst-case number of operations is 2N. If the maximum value of the array is known and the new array can be fixed in advance, the efficiency can be doubled.

  Welcome, there are better ideas or methods to communicate with each other.

  8. How to find the one with the most repetitions in the massive data?

  Option 1: Do the hash first, then map the modulo into small files, find the one with the most repetitions in each small file, and record the repetitions. Then find out the one with the largest number of repetitions in the data obtained in the previous step (refer to the previous question for details).

  9. Tens of millions or hundreds of millions of data (there are repetitions), and count the N data that appear the most frequently.

  Option 1: Tens of millions or hundreds of millions of data, the current machine's memory should be able to store. So consider using hash_map/search binary tree/red-black tree to count the number of times. Then it is to take out the top N data with the most occurrences, which can be done with the heap mechanism mentioned in question 2.

  10. A text file with about 10,000 lines and one word per line is required to count the top 10 most frequently occurring words. Please give your thoughts and time complexity analysis.

  Option 1: This question is to consider time efficiency. The trie tree is used to count the number of occurrences of each word, and the time complexity is O(n*le) (le represents the average length of the word). Then it is to find out the top 10 words that appear most frequently, which can be implemented with a heap. As mentioned in the previous question, the time complexity is O(n*lg10). So the total time complexity is the larger of O(n*le) and O(n*lg10).

  Attached, find the largest 100 numbers among the 100w numbers.

  Option 1: In the previous question, we have already mentioned that it is done with a min-heap with 100 elements. The complexity is O(100w*lg100).

  Option 2: Adopt the idea of ​​quick sorting, only consider the part larger than the axis after each division, know that when the part larger than the axis is more than 100, use the traditional sorting algorithm to sort, and take the first 100. The complexity is O(100w*100).

  Option 3: The partial elimination method is adopted. Select the first 100 elements and sort them, denoted as sequence L. Then scan the remaining element x at a time, and compare it with the smallest element among the sorted 100 elements. If it is larger than the smallest element, delete the smallest element, and insert x into the sequence using the idea of ​​insertion sort. L in. Loop in turn until all elements are scanned. The complexity is O(100w*100).

This article is reproduced in:
http://support.huawei.com/huaweiconnect/enterprise/zh/thread-334009.html

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325724225&siteId=291194637