Summary of big data processing interview questions

Summarized from various online books.
The following is from The Beauty of Data Structures and Algorithms:

  1. Suppose we have 100,000 URL access logs, how to sort the URLs according to the number of visits?
    Traverse 100,000 pieces of data, use the URL as the key, and the number of visits as the value, store it in the hash table, and record the maximum number of visits K at the same time, and the time complexity is O(N).
    Note: the key is the URL, and the result of hash(key) is the subscript of the storage location of the value in the hash array, but there may be conflicts, then find the next suitable location according to the probe or linked list method. After finding the hash (key) and resolving the conflict, put the value in this position.
    If K is not very large, bucket sorting can be used, and the time complexity is O(N). If K is very large (for example, greater than 100,000), use quick sort, and the complexity is O(NlogN).
  2. There are two string arrays, each array contains about 100,000 strings, how to quickly find out the same strings in the two arrays?
    Build a hash table with the first string array, the key is a string, and the value is the number of occurrences. Then iterate through the second string array, and use the string as the key to search in the hash table. If the value is greater than zero, it means that the same string exists. Time complexity O(N).
  3. Suppose there are 100,000 headhunters on Liepin.com, each headhunter can accumulate points by doing tasks (such as posting jobs), and then use the points to download resumes. Assuming you are an engineer at Liepin, how do you store the 100,000 headhunting IDs and point information in memory so that it can support the following operations:
  1. Quickly search, delete, and update the headhunter's points information according to the headhunter's ID;
  2. Find the list of headhunter IDs whose points are in a certain interval;
  3. Find the list of headhunter IDs ranked between xth and yth in ascending order of points.

1) => ID is in the hash table, so the headhunter can be found in O(1);
2) => Points are stored in a jump table, which supports interval queries;
3) => TBD. Is it necessary to use heap? segment tree or bit indexed tree?

  1. How to count the number of occurrences of "search keywords"?
    Suppose we have a 1T log file, which records the user's search keywords, and we want to quickly count the number of times each keyword is searched, what should we do?
    Let's analyze it. There are two difficulties in this problem. The first is that the search log is too large to be stored in the memory of a machine. The second difficulty is that if only one machine is used to process such a huge amount of data, the processing time will be very long.
    To address these two difficulties, we can first shard the data and then use multiple machines to process it to improve the processing speed. The specific idea is this: In order to improve the processing speed, we use n machines to process in parallel. We read each search keyword in turn from the log file of the search record, and calculate the hash value through the hash function, and then take the modulus with n, and the final value is the machine number that should be assigned.
    In this way, search keywords with the same hash value are assigned to the same machine. That is to say, the same search keyword will be assigned to the same machine. Each machine will count the number of occurrences of keywords separately, and finally combine them to get the final result.
    In fact, the processing here is also the basic design idea of ​​MapReduce.

5. How to quickly determine whether a picture is in the gallery?
How to quickly determine whether a picture is in the gallery? We talked about this example in the last section, do you remember it? At that time, I introduced a method, that is, to take a unique identifier (or information summary) for each picture, and then build a hash table.
Assuming that we now have 100 million images in our gallery, it is obvious that building a hash table on a single machine will not work. Because the memory of a single machine is limited, and building a hash table with 100 million images obviously far exceeds the upper limit of the memory of a single machine.
We can also shard the data and then use multi-machine processing. We prepare n machines, so that each machine only maintains the hash table corresponding to a certain part of the picture. Each time we read a picture from the gallery, calculate the unique identifier, and then take the modulo with the number of machines n, the obtained value corresponds to the machine number to be allocated, and then send the unique identifier and picture path of the picture to the corresponding The machine builds the hash table.
When we want to judge whether a picture is in the gallery, we use the same hash algorithm to calculate the unique identifier of the picture, and then take the modulo with the number n of machines. Assuming that the obtained value is k, then look it up in the hash table built by the machine numbered k.
Now, let's estimate how many machines are needed to build a hash table for these 100 million pictures.
Each data unit in the hash table contains two pieces of information, the hash value and the path of the image file. Suppose we calculate the hash value through MD5, the length is 128 bits, which is 16 bytes. The upper limit of file path length is 256 bytes, and we can assume the average length is 128 bytes. If we use the linked list method to resolve conflicts, we also need to store pointers, which only occupy 8 bytes. Therefore, each data unit in the hash table occupies 152 bytes (this is just an estimate, not accurate).
Assuming that the memory size of a machine is 2GB, and the load factor of the hash table is 0.75, that machine can build a hash table for about 10 million (2GB*0.75/152) pictures. So, if you want to index 100 million images, you need about a dozen machines. In engineering, this kind of estimation is still very important, allowing us to have a general understanding of the resources and funds that need to be invested in advance, and to better evaluate the feasibility of the solution.
In fact, we can use multi-machine distributed processing to deal with this kind of massive data processing. With the help of this idea of ​​sharding, you can break through the limitations of single-machine memory, CPU and other resources.

Guess you like

Origin blog.csdn.net/roufoo/article/details/131279087