2019 autumn trick review your notes - massive data processing (finishing)

1, massive log data to extract most of the day to visit a number of Baidu that IP.

source

ALGORITHM: divide and conquer + the Hash
1.IP address a maximum of 2 ^ 32 = 4G optional values cases, it is not fully loaded into memory processing; 
2 could be considered a "divide and conquer" thinking, according to Hash (IP) IP address 1024% value (hash map), the mass IP 1024 are stored in the small log file. Thus, each small file contains up to 4MB IP addresses; 
3. For each small file, IP may be constructed as a key, for the value of the number of occurrences Hash map, while the largest number of records that the current IP address appeared;
4 can 1024 get the highest number of small files appear in the IP, then get sorting algorithm based on conventional IP appears most frequently in general;
it should be noted that the same IP after a certain Hash mapping in the same group , so this method You must be able to count the frequency of the largest IP appears.

Integer data on the hard disk 2. 100G, 1G memory. 100G find this data inside all unique data. How much aid requires a minimum of memory.

Source: Tencent interview

Solution 1: Bitmap

int type integer a total of 2 32 Ge, we apply 2 32  bits, which is 2 32 bits with a representation of a number if there is a 0 indicates not appear, 1 appeared. For example, there have been 4, we have these two 32 -bit 4 position is 1.10000 appeared to put the 10,000th position 1. Such 2 32  bits = 2 32 /2 =. 8 29 B, we know 2 10 equal to about 1000, two 30 approximately equal to 10 . 9 . Therefore, 2 29  B 2 = -1  * 10 . 9  B = 2 -1  * 500MB = 1GB

expand

If the number represents the number of occurrences of each statistic can occur 0 times, appear more than once, and twice appeared with two figures appear more than once. 00 does not appear, indicating the occurrence of 01 1, and 10 appear twice, represent more than 11 twice. Then the number of scanning 10000000000, see the corresponding position of the bitmap, if it is 00, 01, and 11. The then incremented by one; if it is 11, will remain unchanged. Finally, the bitmap scanned again, outputs 01, 10 corresponding to the number. 2 ^ 32 * 2 = 8Gbits = 1GB.

So how did this case to find this number then? I, for example, for example, we read a number at this time are: 64, 64 corresponding to the position where the bit is: 64 * 2 = 128, 127 and 128 that is jointly 
marked its emergence state. Other so on. Whenever we read a number, we just went to find its corresponding bit position, the first readout bit bit, do the record, is already 01, it came again, then
it should be changed to 10. Finally, we come to this result: scanning the entire bitmap, if so, subscript 10/2 come to this number.
 
Finishing Source: 
https://blog.csdn.net/v_JULY_v/article/details/6279498
https://blog.csdn.net/IT_YUAN/article/details/8106573
https://blog.csdn.net/v_july_v/article / details / 7382693

Guess you like

Origin www.cnblogs.com/greatLong/p/11462610.html