Mass Data Processing (3) - Summary of Basic Methods of Mass Data Processing

Source: https://blog.csdn.net/lili0710432/article/details/48142791

Collection of massive data processing

 

The so-called massive data processing means that the amount of data is too large to be solved quickly in a relatively short period of time and cannot be loaded into the memory at one time. This paper summarizes the methods to solve such problems on the basis of the predecessors. So what is the solution? 
In terms of time complexity, we can use clever algorithms with appropriate data structures, such as Bloom filter/Hash/bit-map/heap/database or inverted index/trie tree. In terms of space complexity, divide and conquer / hash mapping.

The basic methods of massive data processing can be summarized as follows:

  1. Divide and conquer / hash map + hash statistics + heap / fast / merge sort;
  2. Double barrel division;
  3. Bloom filter/Bitmap;
  4. Trie tree/database/inverted index;
  5. external sorting;
  6. Hadoop/Mapreduce for distributed processing.

Prerequisite basic knowledge: 
1 byte= 8 bits. 
Int shaping is generally 4 bytes with a total of 32 bits. 
2^32=4G. 
1G=2^30=1.07 billion.


1 divide and conquer + hash map + fast/merge/heapsort

Question 1

Given two files a and b, each store 5 billion urls, each url occupies 64 bytes, the memory limit is 4G, let you find the common url of a and b files?

Analysis : 5 billion*64=320G size space. 
Algorithm idea 1 : hash decomposition + divide and conquer + merge

  1. Traverse file a, obtain hash(url)/1024 for each url according to a certain hash rule, and then store the urls in 1024 small files (a0~a1023) according to the obtained values. So each small file is about 300M. If the hash result is very concentrated and a file ai is too large, you can perform secondary hashing on ai (ai0~ai1024).
  2. This way the url is hashed into 1024 different levels of directories. The files can then be compared separately, a0VSb0...a1023VSb1023. When finding the same url in each pair of small files, you can store the url of one of the small files in hash_map. Then traverse each url of another small file to see if it is in the hash_map just constructed. If it is, then it is the common url, and it can be stored in the file.
  3. Merge the same urls in the 1024 level directories.

Question 2

There are 10 files, each file is 1G, and each line of each file stores the user's query, and the query of each file may be repeated. You are asked to sort by the frequency of the query. 
Solution 1 : hash decomposition + divide and conquer + merge

  1. Read 10 files a0~a9 sequentially, and write the query to another 10 files (recorded as b0~b9) according to the result of hash(query)%10. The size of each newly generated file is also about 1G (assuming the hash function is random).
  2. Find a machine with a memory of about 2G, and use hash_map(query, query_count) to count the number of occurrences of each query in turn. Sort by occurrence using quick/heap/mergesort. Output the sorted query and corresponding query_cout to a file. In this way, 10 sorted files c0~c9 are obtained.
  3. Merge sort the 10 files c0~c9 (combination of inner sorting and outer sorting). Each time, take m data of c0~c9 files and put them in the memory, and merge 10m data, even if the merged data is stored in the d result file. If the m data corresponding to ci are all merged, the m data are taken from the remaining data of ci and reloaded into the memory. Until all data of all ci files are merged. 
    Solution 2 : If the total number of queries in the Trie tree 
    is limited, but the number of repetitions is relatively large, it may be possible to add all the queries to the memory at one time. Under this assumption, we can use trie tree/hash_map, etc. to directly count the number of occurrences of each query, and then do fast/heap/merge sort by the number of occurrences.

Question 3 :

There is a file with a size of 1G, each line in it is a word, the size of the word does not exceed 16 bytes, and the memory limit size is 1M. Returns the 100 most frequent words. 
Similar question: How to find the one with the most repetitions in the massive data?

Solution idea : hash decomposition + divide and conquer + merge

  1. In the sequential read file, for each word x, it is stored in 4096 small files according to hash(x)/(1024*4). So each file is about 250k. If some of the files exceed 1M in size, you can continue to divide them according to the hash until the size of the small files obtained by decomposition does not exceed 1M.
  2. For each small file, count the words appearing in each file and the corresponding frequency (trie tree/hash_map, etc. can be used), and take out the 100 words with the highest frequency (the minimum heap with 100 nodes can be used), And save 100 words and the corresponding frequency into the file. This resulted in another 4096 files.
  3. The next step is to merge these 4096 files. (similar to merge sort)

Question 4


Massive log data, extract the IP solution with the most visits to Baidu on a certain day  : hash decomposition + divide and conquer + merge

  1. Take out the IPs in the log of Baidu's visit on this day, and write them into a large file one by one. Note that IPs are 32-bit and there are at most 2^32 IPs. The same hash mapping method, such as modulo 1024, can be used to map the entire large file into 1024 small files.
  2. Then find out the IP with the highest frequency in each small article (you can use hash_map for frequency statistics, and then find out the ones with the highest frequency) and the corresponding frequency.
  3. Then, among the 1024 largest groups of IPs, find the IP with the largest frequency, which is what is required.

Question 5

Massive data is distributed among 100 computers, and we need to find a way to efficiently count the TOP10 of this batch of data.

Solving Ideas : Divide and Conquer + Merge. 
Note that TOP10 is the maximum or minimum value. If the frequency TOP10 is taken, it should be hashed first.

  1. Find the TOP10 on each computer, using a heap containing 10 elements (TOP10 is small, use the largest heap, TOP10 is large, use the smallest heap). For example, to find the size of TOP10, we first take the first 10 elements and adjust them into the smallest heap. If found, then scan the following data and compare it with the top element of the heap. If it is larger than the top element of the heap, replace the top of the heap with this element, and then Adjust to min heap. The last element in the heap is the TOP10 largest.
  2. After finding the TOP10 on each computer, then combine the TOP10 on the 100 computers, a total of 1,000 data, and then use the similar method above to find the TOP10.

Question 6

Find unique integers among 250 million integers, the memory is not enough to hold these 250 million integers.

Solution 1  : Hash decomposition + divide and conquer + merge 
250 million int data hash into 1024 small files a0~a1023, if the size of a small file is larger than the memory, perform multi-level hashing. Read each small file into memory, find the data that only appears once, and output it to b0~b1023. Finally, the data can be merged.

Solution 2  : 2-Bitmap 
If the memory is enough to 1GB, use 2-Bitmap (2 bits are allocated for each number, 00 means no existence, 01 means once, 10 means multiple times, 11 is meaningless) to carry out, a total of 2^ of memory is required 32*2bit=1GB memory. Then scan the 250 million integers to see the corresponding bits in the Bitmap. If it is 00 to 01, 01 to 10, and 10 to remain unchanged. After the description, check the bitmap and output the integer whose corresponding bit is 01. 
Note that if you are finding duplicate data, you can use a 1-bitmap. The first time the bit is changed from 0 to 1, and the second time the corresponding bit is 1, it means that the data is repeated, and it can be output.

Question 7

There are a total of N machines, and each machine has N numbers. Each machine stores at most O(N) numbers and operates on them. How to find the median of N^2 numbers?

Solution 1  : hash decomposition + sorting

  1. Divide the numbers into N range segments in ascending order and hash them. Assume the data range is 2^32 unsigned int type. In theory, the first machine should store in the range of 0~(2^32)/N, and the i-th machine should store in the range of (2^32)*(i-1)/N~(2^32)*i /N. The hash process can scan N numbers on each machine, put the numbers belonging to the first section on the first machine, and put the numbers belonging to the second section on the second machine, ..., belonging to the first section. The number of N segments is put on the Nth machine. Note that this process should be O(N) for the number stored on each machine.
  2. Then we count the number of numbers on each machine in turn, and accumulate them at a time until the kth machine is found. The accumulated number on this machine is greater than or equal to (N^2)/2, and on the k-1th machine The accumulated number of is less than (N^2)/2, and this number is recorded as x. Then the median we are looking for is in the kth machine, in the (N^2)/2-xth position. Then we sort the number of the kth machine and find the (N^2)/2-xth number, that is, the complexity of the median is O(N^2).

Solution to Idea 2:  Divide and Conquer + Merge 
First sort the numbers on each machine. After sorting, we use the idea of ​​merge sort to merge the numbers on the N machines to get the final sorting. Finding the (N^2)/2th is what you want. The complexity is O(N^2 * lgN^2).

2 Trie tree + red-black tree + hash_map

Here Trie tree, red-black tree or hash_map can be considered as one of the specific implementation methods of the divide and conquer algorithm in the first part.

Question 1

Tens of millions or hundreds of millions of data (there are repetitions), and count the N data that appear the most frequently.

Solution : red-black tree + heap sort

  1. If it is tens of millions or hundreds of millions of int data, the current machine 4G memory may be stored. So consider using hash_map/search binary tree/red-black tree to count the number of repetitions.
  2. Then take out the top N data with the most occurrences, and use a min heap containing N elements to find the N data with the highest frequency.

Question 2

10 million strings, some of which are duplicates, need to remove all the duplicates, and keep the strings without duplicates. How to design and implement it?

Solution : trie tree. 
It is more suitable to use a trie tree for this question, and hash_map should also work.

Question 3

A text file with about 10,000 lines and one word per line is required to count the top 10 most frequently occurring words. Please give your thoughts and time complexity analysis.

Solution : Trie tree + heap sorting 
This question is to consider time efficiency. 
1. Use a trie tree to count the number of occurrences of each word. The time complexity is O(n*len) (len represents the average length of the word). 
2. Then find the top 10 words that appear most frequently, which can be implemented with a heap. As mentioned in the previous question, the time complexity is O(n*lg10). 
The total time complexity is the larger of O(n*le) and O(n*lg10).

Question 4

搜索引擎会通过日志文件把用户每次检索使用的所有检索串都记录下来,每个查询串的长度为1-255字节。假设目前有一千万个记录,这些查询串的重复读比较高,虽然总数是1千万,但是如果去除重复和,不超过3百万个。一个查询串的重复度越高,说明查询它的用户越多,也就越热门。请你统计最热门的10个查询串,要求使用的内存不能超过1G。

解决思想 : trie树 + 堆排序 
采用trie树,关键字域存该查询串出现的次数,没有出现为0。最后用10个元素的最小推来对出现频率进行排序。

3 BitMap或者Bloom Filter

3.1 BitMap

BitMap说白了很easy,就是通过bit位为1或0来标识某个状态存不存在。可进行数据的快速查找,判重,删除,一般来说适合的处理数据范围小于8*2^32。否则内存超过4G,内存资源消耗有点多。 
问题1

已知某个文件内包含一些电话号码,每个号码为8位数字,统计不同号码的个数。

解决思路: bitmap 
8位最多99 999 999,需要100M个bit位,不到12M的内存空间。我们把0-99 999 999的每个数字映射到一个Bit位上,所以只需要99M个Bit==12MBytes,这样,就用了小小的12M左右的内存表示了所有的8位数的电话

问题2

2.5亿个整数中找出不重复的整数的个数,内存空间不足以容纳这2.5亿个整数。

解决思路:2bit map 或者两个bitmap。 
将bit-map扩展一下,用2bit表示一个数即可,00表示未出现,01表示出现一次,10表示出现2次及以上,11可以暂时不用。 
在遍历这些数的时候,如果对应位置的值是00,则将其置为01;如果是01,将其置为10;如果是10,则保持不变。需要内存大小是2^32/8*2=1G内存。 
或者我们不用2bit来进行表示,我们用两个bit-map即可模拟实现这个2bit-map,都是一样的道理。

3.2 Bloom filter

Bloom filter可以看做是对bit-map的扩展。 
参考july大神csdn文章 Bloom Filter 详解

4 Hadoop+MapReduce

参考引用july大神 csdn文章 
MapReduce的初步理解 
Hadoop框架与MapReduce模式

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326008621&siteId=291194637