Hash cutting

Hash cut:

To a size of more than 100G log fi le, log in deposit with the IP address, the design algorithm to find the largest number of occurrences of IP addresses? On the same subject to the conditions, how to find the IP top K's? How to achieve the direct use of Linux system commands?

answer:

Hash is to cut a large file, using the principle of hash, it will be divided into several small files. The same data is assigned to the same file. For example there is a store with 10 billion integer of large files, it will be divided into 100 smaller files. The die 100 each number, the same result as the number stored in a file. Because the result of the same number of die 100 is the same.

1. Locate the IP address appears most frequently:

Hash bucket method points
• the file into 1000 parts 100G, each IP address mapped to the corresponding file: IPINT 1000%
• IP respectively determined the highest frequency in each file, and then merging divided Hash bucket method:
• Use Hash sub-bucket method to distribute data to different file
• each file statistics Top K respectively
• Top K final summary

2. How to find the top K of IP:

The establishment of a small pile by comparing come.
Here Insert Picture Description
Here Insert Picture Description

3. To achieve a Linux system command:

Assume 10 Top:
the Sort log_file | uniq -c | -nr k1,1 the Sort | head -10

Da
Published 37 original articles · won praise 3 · Views 1104

Guess you like

Origin blog.csdn.net/weixin_43264873/article/details/102980358