The use of multi-core CPU to accelerate Linux command - awk, sed, bzip2, grep, wc

Have you ever had to calculate a very large data (hundreds of GB) requirements? Or search inside, or other operations - that can not operate in parallel. Data experts, I was talking to you. You may have a 4 or more cores of the CPU, but we have the right tools, such as grep, bzip2, wc, awk, sed , etc., they are single-threaded, use only one CPU core. Borrowing the words of cartoon character Cartman, "How can I use these kernel"? Linux command to get all CPU cores, we need to use GNU Parallel command, which allows all of us to do amazing CPU cores within a single map-reduce operation, of course, but also with -pipes parameters rarely used (also called -spreadstdin). In this way, your load will be equally distributed to each CPU, really. BZIP2 bzip2 compression tool is better than gzip, but it is very slow! Do not toss, we have a solution to this problem. Previous practice:
cat bigfile.bin | bzip2 --best > compressedfile.bz2
Now this:
cat bigfile.bin | parallel --pipe --recend '' -k bzip2 --best > compressedfile.bz2
Especially for bzip2, GNU parallel is super fast on multi-core CPU. You Accidentally, it performs complete. GREP If you have a very large text file, you might like this before:
grep pattern bigfile.txt
Now you can:
cat bigfile.txt | parallel --pipe grep 'pattern'
Or this:
cat bigfile.txt | parallel --block 10M --pipe grep 'pattern'
This second use parameters used -block 10M, which is to say each core processing 1 million rows - you can use this feature to adjust how much of each row CUP core data processing. AWK following is a calculation of a very large data file with the command awk examples. General usage:
cat rands20M.txt | awk '{s+=$1} END {print s}'
Now this:
cat rands20M.txt | parallel --pipe awk \'{s+=\$1} END {print s}\' | awk '{s+=$1} END {print s}'
This somewhat complicated: -pipe parallel command parameters to output into a plurality of blocks cat awk assigned to calls, the formation of many sub-computing operation. These sub calculated via the second conduit into the same awk command, thereby outputting a final result. The first three awk backslash, which is the GNU parallel call awk needs. WC speed of a file you want to calculate the fastest line number it? Traditional practices:
wc -l bigfile.txt
Now you should do:
cat bigfile.txt | parallel --pipe wc -l | awk '{s+=$1} END {print s}'
Very clever, using the parallel first order 'mapping' wc -l a large number of calls, the sub-forming calculations, and finally to summarize awk sent through the pipeline. SED want to use sed command in a huge file to do a lot of operations to replace it? General practice:
sed s^old^new^g bigfile.txt
Now you can:
cat bigfile.txt | parallel --pipe sed s^old^new^g
Then you can use the pipeline to store the output to the specified file.

Reproduced in: https: //my.oschina.net/766/blog/211138

Guess you like

Origin blog.csdn.net/weixin_33676492/article/details/91547157