Get all the data interview questions in one go

     Originally published in:

 

      It was raining again in Shenzhen on weekends. Let’s talk about massive data today.

 

      Massive data is a frequent visitor to BAT written test interviews. TMD has followed suit, and other companies have followed suit. In actual work, similar issues will indeed be involved.

      Mass data is difficult to process quickly in terms of time. In terms of space, it is difficult to load into memory at once. For massive amounts of data, we need to estimate time and space, rather than brute force.

 

       Common ideas for massive data processing are as follows:

       1. Hash divide and conquer, reclassify

       2. Hash map, O(1) search

       3. Hash map, statistical count

       4. bitmap, save space

       5. bloom filter, save space

       6. Trie tree, save time

       7. File bucket, turning big into small

       8. Heap, find top K

       9. Quick sort, partial solution

       10. Outer sorting, multiple merge

 

       In many scenarios, a combination of the above-mentioned methods and tools is required. You may also use other tricks, such as XOR, such as pure mathematics.

       Next, we will directly look at the issues related to massive data.

 

Question 1

      Question: There are 10 billion uint32 elements, determine whether there are the same elements among them. Memory limit: 1K

 

      If you are not sensitive enough, you will fall into a dead end on this issue. Any programmer should know some common sense within 3 seconds, such as:

      a. There are more than 80,000 seconds in a day

      b. The maximum value of uint32 is close to 4.3 billion

 

     Therefore, according to the drawer principle, there must be the same element among the 10 billion uint32 elements.

 

 

Question 2

      Question: There are 4 billion uint32 elements, determine whether there are the same elements among them. Memory limit: 1.2G

 

     4 billion uint32 elements cannot be directly loaded into the memory. You can consider using 2 bitmaps, with a total memory usage of 1G. For bitmap knowledge, please refer to:

      Written interview questions: Find the intersection of integers (using bitmap)

     Specifically, it is to create two bitmaps, bm1 and bm2, and use them to mark element x, which can record four possible situations of x:

bm1[x] bm2[x] meaning
0 0 x appears 0 times
0 1 x appears 1 time
1 0 x appears twice
1 1 x appears multiple times

       

 

Question 3     

      Question: There are 4 billion uint32 elements, determine whether there are the same elements among them. Memory limit: 0.6G

      

      Obviously, 2 bitmaps require 1G of memory, which does not meet the requirements. How to do it? You can use a bitmap. For bitmap knowledge, please refer to:

      Written interview questions: Find the intersection of integers (using bitmap)

     Specifically, it reads 4 billion elements into the bitmap for marking, counts the number of elements in the bitmap, and then compares the count with 4 billion to determine whether the same elements exist.

 

 

Question 4   

      Question: There are 4 billion uint32 elements, try to remove duplicates. Memory limit 0.6G

   

      Obviously, you can read it directly into the bitmap, and the deduplication can be realized directly, and the memory consumption is 512M. For the knowledge of bitmap, please refer to:

      Written interview questions: Find the intersection of integers (using bitmap)

 

 

Question 5

      Question: There are 4 billion unique uint32 elements, try to sort them. Memory limit 0.6G

 

     Since the elements are not repeated, they can be read directly into the bitmap and then traversed directly to achieve sorting. The memory consumption is 512M. Please refer to the knowledge of bitmap:

      Written interview questions: Find the intersection of integers (using bitmap)

 

 

Question 6

      Question: There are 4 billion uint32 elements, try sorting. Memory limit 0.6G

 

     Elements may have duplicate values, so bitmap is not suitable. At this time, you can consider using bucket sorting. Buckets are a natural tool for massive data processing. Refer to:

      Huashan Lunjian barrel sorting

 

     Of course, if you know something about multi-way merge sorting, you can also consider multi-way merge sorting. In fact, the idea of ​​buckets is also used in multi-channel merging, but between each bucket, it needs to be sorted through merging. Obviously, multiple merge sorting is not as good as using the above-mentioned bucket sorting directly.

 

 

Question 7

      Question: There are 4 billion unique uint32 elements, find top K. The memory limit is 0.6G

 

      Since the elements are not repeated, they can be read directly into the bitmap. In fact, sorting is realized, so it is easy to get top K. For bitmap knowledge, please refer to:

     Written interview questions: Find the intersection of integers (using bitmap)

 

 

Question 8

      Question: There are 4 billion uint32 elements, ask for top K. The memory limit is 0.6G

 

      Elements may be repeated, so it is not suitable to use bitmap directly. The heap is very suitable for this situation, the reference is as follows:

      Written interview questions: ask for top-K (use pile)

      

 

Question 9

      Question: There are 4 billion unique uint32 elements, find the median. Memory limit 0.6G

       

      Since the elements are not repeated, it can be read directly into the bitmap, and sorting is actually achieved, and then finding the median is easy. The memory consumption is 512M. For the knowledge of bitmap, please refer to:

      Written interview questions: Find the intersection of integers (using bitmap)

 

 

Question 10

      Question: There are 4 billion uint32 elements, find the median. Memory limit 0.6G

       

     Elements may be repeated, so it is not suitable to use bitmap directly. At this time, you can consider using buckets. Refer to the following:

     Written interview questions: find the median of massive data

 

 

Question 11

      Question: There are 4 billion uint32 elements. Determine whether n exists in these 4 billion elements. Memory limit 0.6G

       

     Start bitmap directly, read 4 billion elements into bitmap, and then judge whether the value of bitmap[n] is 1, and the memory consumption is 512M, please refer to the knowledge of bitmap:

     Written interview questions: Find the intersection of integers (using bitmap)

 

 

 

Question 12

     Question: A file has 4 billion URLs. It is allowed to have a certain misjudgment rate to determine whether a given URL is in it. Memory limit 0.6G

  

     Bloom Filter is born for this kind of scene, direct reference:

     Determine whether the element exists in the set (using Bloom Filter)

 

 

Question 13

      Question: File A has 4 billion uint32 elements, and File B has 40,000 uint32 integers. Find the intersection. Memory limit 0.6G

  

      For this question, use bitmap directly and refer directly to:

      Written interview questions: Find the intersection of integers (using bitmap)

 

 

Question 14

      Question: File A has 4 billion uint32 elements, and File B has 40,000 uint32 integers. Find the intersection. Memory limit 10M

  

     Because the available memory is very small, bitmap cannot be used. You can consider the idea of ​​hash divide and conquer, to reduce the size, just look at the idea of ​​question 15 below.

 

 

Question 15

      Question: A file has 4 billion URLs and B file has 40,000 URLs. Find the intersection. Memory limit 100M

  

     Obviously, if you need to change the big into the small, you can use the hash divide and conquer idea, directly refer to:

     Written interview question: find the intersection of urls (using hash divide and conquer)

 

 

Question 16

      Question: A file has a large number of source IPs for hacker attacks, ask for the most frequent IP. 

  

      For this problem, we must first calculate the frequency of each IP, and then find the top K for the frequency, as follows:

      Step1: Divide the same IP into the same small file bucket through hash divide and conquer.

      Step2: Use the hash map to count the number of times the IP appears in each small file bucket, and find the most frequent IP in each small file bucket.

      Step3: Traverse each small file bucket to find the IP with the most occurrences globally.

 

 

Question 17

      Question: Among the 100 files on a certain server, the source IP of the hacker's access is recorded, and the 10 most frequent IPs are required.

  

     First, use hash divide and conquer to assign the same IP in 100 files to the same file, and then use a similar method to question 16 above to find the top 10.

 

 

Question 18

      Question: The file has a total of 10,000 lines, with one word in each line, and the 10 most frequent words are counted.

  

     There are only 10,000 rows in total, and the memory can completely hold all the data, so it can be processed directly in the memory, and the hash map is used to count the frequency, and then the frequency is sorted.

 

      In addition, you can use the trie tree to do it, directly refer to:

      Implementation of search engine prompt function (using trie tree)

      Note that in addition to counting, the trie tree can also be used for search, prefix matching, de-duplication and sorting, which is very useful.

     

 

 

 

Guess you like

Origin blog.csdn.net/stpeace/article/details/109543892