Interviewer: You have been working for three years, how do you have no grasp of massive data processing

As a big data development engineer, during job interviews, I often encounter problems with massive data processing.

Zhang Gong is a programmer. Recently, he went to a well-known Internet company for an interview. The interviewer asked this question:

There are two files a and b, each storing 5 billion URLs, each URL occupies 64 bytes, and the memory limit is 4G. How can I find out the common URL in these two documents?

Zhang Gong didn't answer for a while, and the interviewer said: You have been engaged in big data development for more than three years, this question shouldn't trouble you.

When the interviewer said so, Mr. Zhang was embarrassed.

In fact, when the interviewer asks this question, it is nothing more than to examine two points:

  • Check whether job applicants understand the idea of ​​divide and conquer strategy?

  • How to use the idea of ​​divide and conquer strategy to solve problems in different scenarios?

Before solving this problem, let's take a look at the idea of ​​divide and conquer strategy.

What is a divide and conquer strategy (Divide and Conquer)

  1. Divide or reduce the original problem into smaller sub-problems

  2. Solve each sub-problem recursively or iteratively (solve independently)

  3. Synthesize the solutions of the sub-problems to get the solution of the original problem

note:

  1. The sub-problem is exactly the same as the original problem

  2. Subproblems can be solved independently of each other

  3. The sub-problem can be solved directly when the recursion stops (the sub-problem is small enough, we can have a direct solution algorithm)

Now that we understand the idea of ​​divide and conquer strategy, let's go back and look at the questions raised by the interviewer:

Each URL occupies 64 bytes, so the space occupied by 5 billion URLs is about 320GB.

5, 000, 000, 000 * 64B ≈ 5GB * 64 = 320GB

Due to the memory size limitation, only 4G, we cannot load all URLs into the memory for processing at one time. According to the divide and conquer strategy idea, we can divide the URL in a file into multiple small files according to a certain characteristic, so that each The size of a small file does not exceed 4G, so this small file can be read into the memory for processing.

solution

First traverse the a file, calculate hash(URL)% 1000 for the traversed URL, and store the traversed URL into a0, a1, a2, ..., a999 small files according to the calculation result, so that the size of each file is about 300MB. In this regard, you may have questions, why is it 1000? Rather than 2000 or 3000, this is mainly calculated based on the memory size and the file size to be divided and conquered. We can roughly divide the 320G size into 1000 copies, so that each file size is about 300MB.

Next, we use the same method to traverse the b file, and also store the traversed URLs into b0, b1, b2, ..., b999 small files.

After such processing, all possible URLs are in the corresponding small files, that is, a0 corresponds to b0, a1 corresponds to b1, ..., a999 corresponds to b999, and small files that do not correspond can not have the same URL.

Then, we only need to ask for the same URL for each of these 1000 pairs of small files.

Then iterate over ai( [0,999]) and store the URL in a HashSet collection. Then traverse each URL in bi and check whether it exists in the HashSet collection. If it exists, it means that this is the common URL we are looking for. We can save these URLs in a separate file.

to sum up

  • Divide and conquer, and take the remainder of the hash;

  • HashSet statistics for each sub-file.

Regarding the idea of ​​divide and conquer strategy, we must pay attention to summarizing and accumulating in daily work, checking for omissions, and constantly improving our knowledge system.

Due to the author's limited level, the deficiencies in the article are unavoidable, and I have the right to offer suggestions. Please criticize and correct any improprieties.

This article uses time for space. I wonder if you have any other better solutions. Welcome to exchange!

Recently recommended

The 35-year-old programmer was considered too old during the interview and was turned down.

For more surprises, please long press the QR code to identify and follow

Guess you like

Origin blog.csdn.net/X8i0Bev/article/details/107852645