21 functions to find a certain data in the 5G file data (the memory is only 50M)

21 functions to find a certain data in the 5G file data (the memory is only 50M)

1 Step
1) Split the file. At this time I have a log file of 5G. In order to maintain the integrity of the file data, we use line splitting. I first generate 1G files, and then loop 5 times to get 5G files. You can also use 4k to directly generate 5G files, but the slow CPU will be very stuck, my trash computer is. . . . . . The file for generating 5G is given below, but you need to use a small 4k file to generate 1G first, the code is the same, just change the number of times.

#!/bin/bash

BASE_LOG_PATH=~/MyLinux/BigFileSplit/message.log ##1G文件
RES_LOG_PATH=~/MyLinux/BigFileSplit/result.log   ##目标文件


for i in `seq 1 5`; do
    cat $BASE_LOG_PATH >> $RES_LOG_PATH 
    echo "www.example.com|10.32.185.95|-[28/Oct/2014:12:34:39 +0800]|" >> $RES_LOG_PATH    ##目标查找数据
done

2) Check the total number of lines of the file and the corresponding size after generation. wc -l result.log and ll -h. The number of lines in the file is 182,273,288, a total of more than 100 million lines, and the size is about 4.6G, which is considered 5G here. Then use them to find the number of lines needed to split each 10M small file. That is:
5G=182273288; here assume that it is divided into 10M small files, and 5G is converted.
(10x102.4x5)M=182273288; after dividing by 512, 10M is approximately 356002 rows.

2) split -l 356002 result.log test //Split files
ll -h //You can see that each file is about 92M.

3) Write a script to find the target data.

#!/bin/bash
name=~/MyLinux/BigFileSplit/test_   ##用于字符串拼接
for i in `seq 1 512`; do
ename=$name$i                      ##字符串与数字拼接成文件名
cat -n $ename |grep "www.example.com"
done

Guess you like

Origin blog.csdn.net/weixin_44517656/article/details/107932554