Hundreds of millions of data processing methods

Hundreds of millions of data processing methods

1. Mass log data, extract the IP with the most visits to Baidu on a certain day.

The number of IPs is still limited, up to 2^32, so you can consider using hash to store IPs directly into memory, and then perform statistics.

Let me introduce this plan in detail: First, it is this day, and the IPs in the logs of accessing Baidu are taken out and written into a large file one by one. Note that the IP is 32 bits, and there are at most 2^32 IPs. You can also use the mapping method, such as modulo 1000, map the entire large file to 1000 small files, and then find the IP with the highest frequency in each small article (you can use hash_map for frequency statistics, and then find out the most frequent ones. A) and the corresponding frequency. Then, among the 1000 largest IPs, find the IP with the largest frequency, which is what you want.

2. The search engine will record all the search strings used by the user each time through the log file, and the length of each query string is 1-255 bytes.

Suppose there are currently 10 million records (the repetition of these query strings is relatively high, although the total is 10 million, but if the repetition is removed, it will not exceed 3 million. The higher the repetition of a query string, it means that it is queried The more users there are, the more popular it is.), please count the most popular 10 query strings, and the required memory cannot exceed 1G.

The typical Top K algorithm is explained in this article. In the article, the final algorithm given is: The first step is to preprocess this batch of massive data, and the Hash table is used to complete the sorting in O(N) time; then the second step is to use the heap data structure to find the Top K, the time complexity is N'logK. That is, with the help of the heap structure, we can find and adjust/move in log time. Therefore, maintain a small root heap with the size of K (10 in this question), and then traverse 3 million Query and compare them with the root elements respectively. Therefore, our final time complexity is: O(N) + N'*O (LogK), (N is 10 million, N'is 3 million). ok, for more details, please refer to the original text. Or: using a trie tree, the keyword field stores the number of occurrences of the query string, and the number of occurrences of the query string is 0. Finally, the minimum push of 10 elements is used to sort the frequency of occurrence.

3. There is a file with a size of 1G, each line in it is a word, the size of the word does not exceed 16 bytes, and the memory limit is 1M. Return the 100 most frequent words.

Solution: In the sequential reading file, for each word x, take hash(x)%5000, and then store this value in 5000 small files (denoted as x0, x1....x4999). So each file is about 200k.
If some of the files exceed 1M in size, you can continue to divide according to a similar method until the size of the small files │ decomposed does not exceed 1M. For each small file, count the words appearing in each file and the corresponding frequency (trie tree/hash_map, etc. can be used), and take out the 100 words with the largest frequency (the smallest heap containing 100 nodes can be used), And save 100 words and the corresponding frequency into the file, so that another 5000 files are obtained. The next step is the process of merging these 5000 files (similar to merging and sorting).

4. There are 10 files, each file is 1G, each line of each file stores the user's query, and the query of each file may be repeated. You are required to sort by the frequency of the query.

It is a typical TOPK algorithm. The solution is as follows:
Solution 1: Read 10 files in sequence, and write query to another 10 files (denoted as) according to the result of hash(query)%10. In this way, the size of each newly generated file is about 1G (assuming that the hash function is random). Find a machine with a memory of about 2G, and use hash_map (query, query_count) to count the number of times each query appears. Use fast/heap/│ merge sort to sort according to the number of occurrences. Output the sorted query and corresponding query_cout to the file. This results in 10 sorted files (denoted as). The 10 files are merged and sorted (combined inner sorting and outer sorting).

Solution 2: Generally, the total number of queries is limited, but the number of repetitions is relatively large. It is possible that all queries can be added to the memory at one time. In this way, we can use trie tree/hash_map to directly count the number of occurrences of each query, and then do quick/heap/merge sorting according to the number of occurrences.

Scheme 3: Similar to scheme 1, but after the hash is completed and divided into multiple files, it can be handed over to multiple files for processing, using a distributed architecture (such as MapReduce), and finally merged.

5. Given two files a and b, each stores 5 billion urls, each url occupies 64 bytes each, and the memory limit is 4G, let you find the common url of file a and b?

Solution 1: It can be estimated that the size of each file is 5Gx64=320G, which is much larger than the memory limit of 4G. So it is impossible to completely load it into memory for processing. Consider adopting a divide and conquer approach.
Traverse file a, get hash(url)%1000 for each url, and then store the url in 1000 small files (denoted as a0, a1,..., a999) according to the obtained value. So each small file is about 300M.
Traverse the file b, and store the URLs in 1000 small files (denoted as │ b0, b1..., b999) in the same way as a. After this processing, all possible same URLs are in the corresponding small files (aOvsb0, a1vsb1..., a999vsb999), and it is impossible for the small files that do not correspond to have the same URL. Then we only ask for the same URL in 1000 pairs of small files.
When seeking the same url in each pair of small files, the url of one of the small files can be stored in hash_set. │ Then traverse each url of another small file to see if it is in the hash_set just constructed, if it is, then it is the common url, and it is enough to save it in the file.

Solution 2: If a certain error rate is allowed, Bloom filter can be used. 4G memory can represent approximately 34 billion bits. Use the Bloom filter to map the URL in one of the files to the 34 billion bits, and then read the URL of another file one by one to check whether it is the same as the Bloom filter. If it is, then the URL should be a common URL (note that there will be some Error rate).

6. Find non-repeated integers among 250 million integers. Note that the memory is not enough to hold these 250 million integers.

Scheme 1: Use 2-Bitmap (each number is allocated 2 bits, 00 means no existence, 01 means one occurrence, 10 │ means multiple times, 11 is meaningless), and a total of memory is required, which is acceptable. Then scan the 250 million integers, check the corresponding bits in the Bitmap, if it is 00 changes to 01, 01 changes to 10, and 10 remains unchanged. After finishing the description, check the bitmap and output the integer whose corresponding bit is 01.

Option 2: You can also use a method similar to the first question to divide small files. Then find the integers that are not repeated in the small file and sort them. Then merge again, taking care to remove duplicate elements.

7. Give 4 billion non-repeated unsigned int integers that are not sorted, and then give a number. How to quickly judge whether this number is among the 4 billion numbers?

Similar to question 6 above, my first reaction is quick sort + binary search. Here are other better methods:

Scheme 1: oo, apply for 512M of memory, one bit represents an unsigned int value. Read in 4 billion numbers, set the corresponding bit, read in the number to be queried, and check whether the corresponding bit is 1, 1 means it exists, and 0 means it does not exist.

Solution 2: This problem is well described in "Programming Pearls", you can refer to the following ideas to discuss: Also because 2^32 is more than 4 billion, a given number may or may not be in it; Here we use 32-bit binary to represent each of the 4 billion numbers. Suppose that these 4 billion numbers are initially placed in a file.
Then divide these 4 billion numbers into two categories: 1. The highest bit is 0 2. The highest bit is 1, and these two types are written into two files respectively, where the number of numbers in one file is <= 2 billion, And the other >= 2 billion (this is equivalent to half); compare with the highest bit of the number you want to find and then enter the corresponding file to search and then divide this file into two categories: 1. The second highest bit is 0 2. The second highest bit is 1.
Write these two types into two files respectively, where the number of numbers in one file is <= 1 billion, and the other >= 1 billion (this is equivalent to half); Compare the next highest bit of the searched number and then enter the corresponding file to search again. ………By analogy, it can be found, and the time complexity is O(logn), and solution 2 is finished.

Attachment: Here, I will briefly introduce the bitmap method: Use the bitmap method to judge whether there is repetition in the plastic array. Judging the repetition in the collection is one of the common programming tasks. When the amount of data in the collection is relatively large, we usually hope to do it a few times. Scanning, at this time, the double loop method is not advisable.
The bitmap method is more suitable for this situation. Its approach is to create a new array with a length of max+1 according to the largest element max in the collection, and then scan the original array again, and give the new array the first position when it encounters it. , If you encounter 5, set the sixth element of the new array to 1, so that next time you encounter 5 and want to set the bit, you will find that the sixth element of the new array is already 1. This shows that the data this time must be the same as before There is a duplication of data. This method of resetting a new array to zero and one after it is similar to the bitmap processing method is called bitmap method. In the worst case, the number of operations is 2N. If the maximum value of the array is known, and the length of the new array can be fixed in advance, the efficiency can be doubled.

8. How to find the most repeated one among the massive data?

Scheme 1: Make a hash first, then map the modulus to a small file, find the one with the most repeated times in each small file, and record the number of repetitions. Then find the one with the most repetitions among the data obtained in the previous step (refer to the previous question for details).

9. Tens of millions or hundreds of millions of data (with repetitions), count the N data that appear the most frequently.

Scheme 1: Tens of millions or hundreds of millions of data should be stored in the memory of the current machine. So consider using hash_map/search binary tree/red-black tree to count the number of times. Then it is to take out the top N data with the most occurrences, which can be done with the heap mechanism mentioned in Question 2.

10. A text file has about 10,000 lines and each line contains one word. The top 10 most frequently appearing words are required to be counted. Please give your thoughts and time complexity analysis.

This question is about time efficiency. Use the trie tree to count the number of occurrences of each word, and the time complexity is O(n * le) (le represents the leveling length of the word). Then it is to find the top 10 words that appear most frequently, which can be realized by using a heap. As mentioned in the previous question, the time complexity is O(n * lg10). So the total time complexity is the larger of O(n * le) and O(n * lg10).

Attached, find the largest 100 number among 100w numbers.

Solution 1: In the previous question, we have already mentioned that it is done with a minimum heap containing 100 elements. The complexity is O(100w*lg100).

Solution 2: Using the idea of ​​quick sorting, only consider the part that is larger than the axis after each segmentation. When it is known that the part that is larger than the axis is more than 100, the traditional
sorting algorithm is used to sort, and the first 100 are selected. The complexity is O(100w*100).

Option 3: Use partial elimination method. Select the first 100 elements, sort them, and record them as sequence L. Then scan the remaining element x once, and compare it with the smallest element among the sorted 100 elements. If it is larger than the smallest element, then delete the smallest element, and insert x using the idea of ​​insertion sort Go to sequence L. Loop in turn until all the elements have been scanned. The complexity is O (100w*100).

==================================

1.Bloom filter

Scope of application : It can be used to implement a data dictionary, perform data weighting, or set intersection

Basic principles and key points :
The principle is very simple, bit array + k independent hash functions. Set the bit array of the value corresponding to the hash function to 1. If you find that all the corresponding bits of the hash function are 1 during the search, it means that it exists. Obviously, this process does not guarantee that the search result is 100% correct. At the same time, it does not support deleting an inserted keyword, because the corresponding position of the keyword will affect other keywords. So a simple improvement is counting Bloom filter. Replace the bit array with a counter array to support deletion.
There is also a more important question, how to determine the size of the bit array m and the number of hash functions according to the number of input elements n. When the number of hash functions k=(In2) (m/n), the error rate is the smallest. When the error rate is not greater than E, m must be at least equal to n Ig (1/E) to represent any set of n elements. But m should be larger, because it is necessary to ensure that at least half of the bit array is 0, then m should be >=nlg(I/E)*lge is about nlg(1/E) 1.44 times (Ig means base 2 Logarithm).
For example, if we assume that the error rate is 0.01, then m should be approximately 13 times n. Then k is about 8.
Note that the units of m and n are different, m is the unit of bits, and n is the unit of the number of elements (to be precise, the number of different elements). Usually the length of a single element has many bits. Therefore, the use of bloom filter memory is usually saved.

Extension :
Bloom filter maps the elements in the set to the bit array, and whether k (k is the number of hash functions) mapping bits are all 1s to indicate that the element is not in the set. Counting bloom filter (CBF) expands each bit in the bit array into a counter, thereby supporting the deletion of elements. Spectral Bloom Filter (SBF) associates it with the number of occurrences of collection elements. SBF uses the minimum value in the counter to approximate the frequency of occurrence of the element.

Problem example :
Give you two files A and B, each storing 5 billion URLs, each URL occupies 64 bytes, the memory limit is 4G, let you find the common URL of A and B files. What if there are three or even n files?
According to this question, let’s calculate the memory usage, 4G=2^32 is about 4 billion *8 is about 34 billion, n=5 billion, if the error rate is 0.01. It is about 65 billion bits. Currently, 34 billion is available, which is not much different, which may increase the error rate. In addition, if these urlips are-corresponding, they can be converted to ip, which is much simpler.

2.Hashing

Scope of application : basic data structure for quick search and deletion, usually the total amount of data can be put into the memory

Basic principles and key points : The
choice of hash function, for character strings, integers, permutations, and specific corresponding hash methods.
Collision processing, one is open hashing, also called zipper method; the other is closed hashing, also called open addressing, opened addressing.

Extension :
The d in eft hashing means multiple. Let's simplify this problem first and take a look at 2-left hashing. 2-left hashing refers to dividing a hash table into two halves of equal length, called T1 and T2, respectively. T1 and T2 are respectively equipped with a hash function, h1 and h2. When storing a new key, two hash functions are used for calculation at the same time, and two addresses h1 [key] and h2 [key] are obtained. At this time, you need to check the position of h1lkey] in T1 and the position of h2[key] in T2, which position has stored more (collision) keys, and then store the new key in a position with less load. If there are the same on both sides, for example, both positions are empty or a key is stored, the new key is stored in the T1 sub-table on the left, and 2-left is also derived from this. When looking up a key, you must hash twice and look up two locations at the same time.

Examples of problems :
1) Massive log data, extract the IP that visits Baidu the most on a certain day.
The number of IPs is still limited, up to 2^32, so you can consider using hash to store IPs directly into memory, and then perform statistics.

3.bit-map

Scope of application : data can be quickly searched, duplicated, and deleted. Generally speaking, the data range is less than 10 times of int

Basic principles and key points :
Use bit arrays to indicate whether certain elements exist, such as 8-digit phone numbers

Extension :
bloom filter can be seen as an extension of bit-map

Examples of problems :
1) It is known that a certain file contains some telephone numbers, each of which is 8 digits, and the number of different numbers is counted.
8 bits up to 99 999 999, about 99m bits, about 10 m bytes of memory is enough.
2) Find the number of non-repeated integers among 250 million integers, and the memory space is not enough to accommodate these 250 million integers.
Extend the bit-map and use 2bit to represent a number, 0 means not appearing, 1 means once, and 2 means 2 or more times. Or we don't use 2bit for representation, we can use two bit-maps to simulate this 2bit-map.

4. Heap

Scope of application : n is large before mass data, and n is relatively small, the heap can be put into memory

Basic principles and key points : The
maximum heap is small before the first n, and the minimum heap is large before the first n. Methods, such as finding the first n smallest, we compare the current element with the largest element in the largest heap, and if it is less than the largest element, we should replace that largest element. In this way, the final n elements are the smallest n elements. It is suitable for large data volume, and the first n is small, and the size of n is relatively small, so that all the first n elements can be obtained by scanning once, and the efficiency is very high.

Expansion :
Double heap, a maximum heap combined with a smallest heap, can be used to maintain the median.

Problem example :
1) Find the largest top 100 numbers among 100w numbers. Just use a minimum heap of 100 elements.

5. Double-layer bucket division----In fact, it is essentially the idea of ​​[divide and conquer], focusing on the technique of division!

Scope of application : the k-th largest, median, non-repeated or repeated numbers

Basic principles and key points :
Because the range of elements is large, the direct addressing table cannot be used, so the range is determined step by step through multiple divisions, and then finally within an acceptable range. It can be reduced multiple times, the double layer is just an example.

Examples of problems :
1) Find the number of non-repeated integers among 250 million integers, and the memory space is not enough to accommodate these 250 million integers.
A bit like the pigeon nest principle, the number of integers is 2 32, that is, we can divide these 2 32 numbers into │ 2^8 areas (for example, a single file represents one area), and then separate the data into different Area, and then different areas can be solved directly by using bitmap. In other words, as long as there is enough disk space, it can be easily solved.
2) Find the median of 500 million ints.
This example is more obvious than the one above. First, we divide the int into 2^16 regions, and then read the number of data statistics that fall into each region. Then we can judge the median │ region based on the statistical results, and know the area in this region. The largest number is exactly the median. Then in the second scan, we only count those numbers that fall in this area.
In fact, if int is not int64, we can reduce it to an acceptable level after three such divisions. That is, you can first divide int64 into 2 24 areas, and then determine the largest number of the area, divide the area into 2 20 sub-regions, and then determine the largest number of the sub-region, and then the number in the sub-region The number is only 2^20, you can directly use the direct addr table for statistics.

6. Database Index

Scope of application : add, delete, modify and check large amounts of data

Basic principles and key points : Use data design and implementation methods to process the addition, deletion, modification, and investigation of massive amounts of data.

7. Inverted index

Scope of application : search engine, keyword query

Basic principles and key points :
Why is it called an inverted index? An indexing method is used to store the mapping of a certain │ word in a document or a group of documents under a full-text search.
Taking English as an example, here
is the text to be indexed: TO="itis what it is" T1="what is it" T2 = "it is a banana"
We can get the following inverted document index:
"a" :{2}"banana":{2}"is":{0,1,2}"it":{0,1,2}"what":{0,1} search conditions "what"," "is" and "it" will correspond to the intersection of sets.
A forward index was developed to store a list of words in each document. The query of the forward index often satisfies the full-text query of every document in order and frequent and the verification of every word in the verification document. In the forward index│, the document occupies the center position, and each document points to a sequence of index items it contains. That is to say, the document points to the words it contains, and the inverted index means that the words point to the documents that contain it. It is easy to see this reverse relationship.

Problem example : Document retrieval system, query those documents that contain a certain word, such as the keyword search of common academic papers.

8. Outer sorting

Scope of application : sorting of big data, de-duplication

Basic principles and key points : The merge method of outer sorting, the principle of permutation selection loser tree, the optimal merge tree

Examples of problems :
1). There is a file with a size of 1G, each line in it is a word, the size of the word does not exceed 16 bytes, and the memory limit is 1M. Return the 100 most frequent words.
This data has obvious characteristics. The word size is 16 bytes, but the memory is only 1m for hashing, which is not enough, so it can be used for sorting. The memory can be used as an input buffer.

9.trie tree

Scope of application : Large amount of data, many repetitions, but small data types can be put into memory

Basic principles and key points : Implementation method, representation of node children

Expansion : Compression is achieved.

Examples of problems :
1). There are 10 files, each file is 1G, each line of each file stores the user's query, and the query of each file may be repeated. You are required to sort by the frequency of the query.
2). 10 million strings, some of which are the same (repetitive), you need to remove all the duplicates, and keep the non-duplicated strings. How do I design and implement it?
3). Search for popular queries: The repetition of the query string is relatively high. Although the total number is 10 million, if the repetition is removed, it will not exceed 3 million, and each will not exceed 255 bytes.

10. Distributed processing mapreduce

Scope of application : Large amount of data, but small data types can be put into memory

Basic principles and key points : hand over data to different machines for processing, data division, and result reduction.

Problem example :
1). The canonical example application of MapReduce is a process to count the appearances of each different word in a set of documents:
2). Massive data is distributed among 100 computers, think of a way to efficiently count the appearances of this batch of data TOP10. 3). There are a total of N machines, and there are N numbers on each machine. Each machine can store O(N) at most and operate on them. How to find the median of N^2 numbers?

Analysis
of classic problems Tens of millions or billions of data (with repetitions), the top N data with the most occurrences are counted. There are two cases: they can be read into the memory once, but not once.
Available ideas: trie tree + heap, database index, sub-set statistics, hash, distributed calculation, approximate statistics, outer sorting. The
so-called whether it can be read into the memory at one time should actually refer to the amount of data after removing the duplication. If the data can be put into memory after deduplication, we can create a dictionary for the data, for example, through map, hashmap, trie, and then │ directly perform statistics. Of course, when updating the number of occurrences of each piece of data, we can use a heap to │ maintain the top N data with the most occurrences. Of course, this leads to an increase in the number of maintenance. It is not as efficient as finding the top N after complete statistics.
If the data cannot be put into the memory. On the one hand, we can consider whether the above dictionary method can be improved to adapt to this situation. The change that can be made is to store the dictionary on the hard disk instead of the memory, which can refer to the storage method of the database.
Of course, there is a better way, which is to use distributed computing, which is basically a map-reduce process. First, according to the data value or the value of the data hash (md5), the data can be divided into different │ machines according to the range. It is better to allow the data to be read into the memory at one time after being divided, so that different machines are responsible for processing various numerical ranges, which are actually maps. After getting the results, each machine only needs to take out the top N data with the most occurrences of their own, and then summarize and select the top N data with the most occurrences among all the data. This is actually the reduce process.
In fact, you may want to directly divide the data into different machines for processing, so that the correct solution cannot be obtained. Because one data may be evenly divided into different machines, while another may be completely aggregated on one machine, and there may also be the same number of data. For example, if we want to find the top 100 with the most occurrences, we will distribute 10 million data on 10 machines, and find the top 100 with the most occurrences of each machine. After merging, we cannot guarantee to find the real 100th, because for example The 100th with the most occurrences may have 10,000, but it is divided into 10 machines, so that there are only 1,000 on each machine. Assuming that the machines ranked before the 1000 are all separately distributed in one machine The above, for example, there are 1001, so this one that originally has 10,000 will be eliminated. Even if we let each machine select the 1000 with the most occurrences and merge it, there will still be errors, because there may be a large number of 1001 The occurrence of clustering. Therefore, the data cannot be randomly divided into different machines, but they must be mapped to different machines for processing according to the value after the hash, so that different machines can process a range of values.
The outer sorting method consumes a lot of IO, and the efficiency is not very high. The above distributed method can also be used in the stand-alone version, that is, the total data is divided into multiple different sub-files according to the range of values, and then processed one by one. After the processing is completed, a merger of these words and their frequency of occurrence is carried out. In fact, a merge process of outer sorting can be used.
In addition, approximate calculations can also be considered, that is, we can combine natural language attributes to use only those words that appear the most in reality as a dictionary, so that this scale can be put into memory.

Guess you like

Origin blog.csdn.net/qq_32727095/article/details/113742570