How many of the classic interview questions and answering skills of large numbers in 2018 have you seen?

How many of the classic interview questions and answering skills of large numbers in 2018 have you seen?

1. Massive log data, extract the IP with the most visits to Baidu on a certain day.
Solution: The first is to take out the IPs in the log of this day and access Baidu, and write them into a large file one by one. Note that IP is 32-bit, and there are at most 2^32 IPs. You can also use the mapping method, such as modulo 1000, to map the entire large file into 1000 small files, and then find the IP with the highest frequency in each small file (you can use hash_map for frequency statistics, and then find out the most frequent IP ) and the corresponding frequency. Then, among the 1000 largest IPs, find the IP with the largest frequency, which is the requirement.

2. The search engine will record all the search strings used by the user for each search through the log file, and the length of each query string is 1-255 bytes.
Suppose there are currently 10 million records (the repetition of these query strings is relatively high, although the total number is 10 million, but if the repetition is removed, it will not exceed 3 million. The higher the repetition degree of a query string, the higher the repetition rate of a query string, the better The more users, the more popular it is.), please count the 10 most popular query strings, and the required memory should not exceed 1G.
Solution: The first step is to preprocess this batch of massive data, and use the Hash table to complete the sorting in O(N) time; the
second step is to use the heap data structure to find Top K, with a time complexity of N 'logK. That is, with the help of the heap structure, we can find and adjust/move in log order time. Therefore, maintain a small root heap of size K (10 in this topic), and then traverse 3 million queries and compare them with the root elements. Therefore, our final time complexity is: O(N) + N'*O( logK), (N is 10 million, N' is 3 million).
Or: a trie tree is used, and the keyword field stores the number of occurrences of the query string, and it is 0 if it does not appear. Finally use the smallest push of 10 elements to sort the frequency of occurrence.

3. There is a file with a size of 1G, each line in it is a word, the size of the word does not exceed 16 bytes, and the memory limit size is 1M. Returns the 100 most frequent words.
Solution: Read the file sequentially. For each word x, take hash(x)%5000, and then store this value in 5000 small files (recorded as x0, x1,...x4999). So each file is about 200k.
If some of the files exceed 1M in size, you can continue to divide them in a similar way until the size of the decomposed small files does not exceed 1M. For each small file, count the words that appear in each file and the corresponding frequency (trie tree/hash_map, etc. can be used), and take out the 100 words with the highest frequency (the minimum heap with 100 nodes can be used), And save the 100 words and the corresponding frequencies into the file, so that another 5000 files are obtained. The next step is to merge these 5000 files (similar to merge sort).

4. There are 10 files, each file is 1G, and each line of each file stores the user's query, and the query of each file may be repeated. You are asked to sort by the frequency of the query.
Solution: Option 1: Read 10 files sequentially, and write the query into another 10 files (denoted as ) according to the result of hash(query)%10. The size of each newly generated file is also about 1G (assuming the hash function is random). Find a machine with a memory of about 2G, and use hash_map(query, query_count) to count the number of occurrences of each query in turn. Sort by occurrence using quick/heap/mergesort. Output the sorted query and corresponding query_cout to a file. This results in 10 sorted files (denoted as ).
Merge sort these 10 files (inner sort combined with outer sort).

Option 2: Generally, the total number of queries is limited, but the number of repetitions is relatively large. Maybe all queries can be added to the memory at one time. In this way, we can directly count the number of occurrences of each query by using trie tree/hash_map, etc., and then do fast/heap/merge sort by the number of occurrences.

Option 3: Similar to Option 1, but after hashing and dividing into multiple files, it can be handed over to multiple files for processing, using a distributed architecture (such as MapReduce), and finally merged.
How many of the classic interview questions and answering skills of large numbers in 2018 have you seen?
5. Given two files a and b, each store 5 billion urls, each url occupies 64 bytes, and the memory limit is 4G, allowing you to find the common url of a and b files.
Solution: Solution 1: It can be estimated that the size of each file is 5G×64=320G, which is much larger than the 4G memory limit. So it is impossible to fully load it into memory for processing. Consider taking a divide and conquer approach.
Read through file a, obtain hash(url)%1000 for each url, and then store the urls in 1000 small files (recorded as a0, a1, ..., a999) according to the obtained values. So each small file is about 300M.
Read through file b, and store the urls into 1000 small files (denoted as b0,b1,...,b999) in the same way as a. After this processing, all possibly the same urls are in the corresponding small files (a0vsb0, a1vsb1,...,a999vsb999), and it is impossible for non-corresponding small files to have the same url. Then we only ask for the same url in 1000 pairs of small files.
When finding the same url in each pair of small files, you can store the url of one of the small files in hash_set. Then traverse each url of another small file to see if it is in the hash_set just constructed. If it is, then it is the common url, and it can be stored in the file.
Option 2: If a certain error rate is allowed, Bloom filter can be used, and 4G memory can represent about 34 billion bits. Use Bloom filter to map the url in one of the files to the 34 billion bits, and then read the url of the other file one by one to check whether it is the same as the Bloom filter. If so, the url should be a common url (note that there will be certain Error rate).

6. Find non-repeating integers among 250 million integers. Note that the memory is not enough to hold these 250 million integers.
Solution: Option 1: Use 2-Bitmap (2 bits are allocated for each number, 00 means it does not exist, 01 means it appears once, 10 means multiple times, and 11 is meaningless), which requires a total of memory and memory, which is acceptable. Then scan the 250 million integers to see the corresponding bits in the Bitmap. If it is 00 to 01, 01 to 10, and 10 to remain unchanged. After the description, check the bitmap and output the integer whose corresponding bit is 01.
Option 2: A method similar to Question 1 can also be used to divide small files. Then find the non-repeating integers in the small file and sort them. Then merge again, taking care to remove duplicate elements.

7. Tencent interview questions: Give 4 billion non-repeating unsigned int integers, which are not sorted, and then give a number. How to quickly determine whether this number is among the 4 billion numbers?
Solution: Apply for 512M In memory, a bit represents an unsigned int value. Read in 4 billion numbers, set the corresponding bit, read in the number to be queried, and check whether the corresponding bit is 1. If it is 1, it means it exists, and if it is 0, it means it does not exist.
dizengrong: Option 2: Because 2^32 is more than 4 billion, a given number may or may not be in it; here we represent each of the 4 billion numbers in 32-bit binary assuming these 4 billion numbers The numbers start out in a file.
Then divide these 4 billion numbers into two categories: 1. The highest bit is 0 2. The highest bit is 1 and write these two categories into two files, where the number of numbers in one file is <= 2 billion, And the other >=2 billion (this is equivalent to halving); compare it with the highest bit of the number to be searched and then enter the corresponding file to search
and then divide the file into two categories: 1. The second highest bit is 0 2. The second highest bit is 1
and the two types are written into two files respectively, the number of numbers in one file is <= 1 billion, and the other is >= 1 billion (this is equivalent to halving); and The second highest bit of the number to be searched is compared and then entered into the corresponding file to search again. ...... And so on, it can be found, and the time complexity is O(logn), the scheme 2 is finished.

Attachment: Here, I will briefly introduce the bitmap method: Using the bitmap method to judge whether there is a duplicate in the ××× array It is one of the common programming tasks to judge the existence of duplication in the collection. When the amount of data in the collection is relatively large, we usually want to do less After several scans, the double round robin method is not advisable.
The bitmap method is more suitable for this situation. Its method is to create a new array with a length of max+1 according to the largest element max in the collection, then scan the original array again, and give 1 to the first position of the new array when encountered. , if you encounter 5, set 1 to the sixth element of the new array, so when you encounter 5 again and want to set it, you will find that the sixth element of the new array is already 1, which means that the data this time must be the same as before. There is duplication of data. This method of initializing a new array with zeros followed by one is similar to the processing method of bitmaps, so it is called bitmap method. Its worst-case number of operations is 2N. If the maximum value of the array is known and the new array can be fixed in advance, the efficiency can be doubled.
How many of the classic interview questions and answering skills of large numbers in 2018 have you seen?
8. How to find the most repeated one in the massive data?
Solution: first do the hash, then find the modulo mapping to small files, find the one with the most repetitions in each small file, and record the number of repetitions. Then find out the one with the largest number of repetitions in the data obtained in the previous step (refer to the previous question for details).

9. Tens of millions or hundreds of millions of data (there are repetitions), and count the N data that appear the most frequently.
Solution: Tens of millions or hundreds of millions of data, the current machine's memory should be able to store. So consider using hash_map/search binary tree/red-black tree to count the number of times. Then it is to take out the top N data with the most occurrences, which can be done with the heap mechanism mentioned in question 2.

10. A text file with about 10,000 lines and one word per line is required to count the top 10 most frequently occurring words. Please give your thoughts and time complexity analysis.
Solution: Option 1: This question is about time efficiency. The trie tree is used to count the number of occurrences of each word, and the time complexity is O(n le) (le represents the average length of the word). Then it is to find the top 10 words that appear most frequently, which can be implemented with a heap. As mentioned in the previous question, the time complexity is O(n lg10). So the total time complexity is the larger of O(n le) and O(n lg10).
Attached, find the largest 100 numbers among the 100w numbers.
Option 1: In the previous question, we have already mentioned that it is done with a min-heap with 100 elements. The complexity is O(100w lg100).
Option 2: Adopt the idea of ​​quick sorting, only consider the part larger than the axis after each division, know that when the part larger than the axis is more than 100, use the traditional sorting algorithm to sort, and take the first 100. The complexity is O(100w
100).

    1. Option 3: The partial elimination method is adopted. Select the first 100 elements and sort them, denoted as sequence L. Then scan the remaining element x at a time, and compare it with the smallest element among the sorted 100 elements. If it is larger than the smallest element, delete the smallest element, and insert x into the sequence using the idea of ​​insertion sort. L in. Loop in turn until all elements are scanned. The complexity is O(100W*100)

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325244458&siteId=291194637
Recommended