Use the CPU to speed up your Linux commands

When dealing with big data, we always think about some parallel operations to speed up our operations. Our cpu is multi-core and multi-threaded, but some of our commands are single-threaded commands , which cannot perform parallel operations, such as : grep, bzip2, wc, awk, sed, etc. can only use one CPU core. To make Linux commands use all CPU cores, we need to use GNU Parallel commands. Let’s speed up the technology below.

We all know that grep, bzip2, wc, awk, sed, etc. are all single-threaded and can only use one CPU core. So how can these cores be used?

To make Linux commands use all CPU cores, we need to use the GNU Parallel command, which allows all of our CPU cores to perform magical map-reduce operations in a single machine. Of course, this requires the rarely used --pipes parameter (also called --spreadstdin). That way, your load will be evenly distributed across the CPUs, really.

bzip2

bzip2 is a better compression tool than gzip, but it's slow! Don't worry, we have a solution to this problem.

Previous practice:

cat bigfile.bin | bzip2 --best > compressedfile.bz2

Now like this:

cat bigfile.bin | parallel --pipe --recend '' -k bzip2 --best > compressedfile.bz2

Especially for bzip2, GNU parallel is super fast on multi-core CPUs. If you don't pay attention, it will be executed.

GREP

If you have a very large text file, previously you might do:

grep pattern bigfile.txt

Now you can do this:

cat bigfile.txt | parallel --pipe grep 'pattern'

or this:

cat bigfile.txt | parallel --block 10M --pipe grep 'pattern'

This second usage uses the --block 10M parameter, which means that each core processes 10 million rows -- you can use this parameter to adjust how many rows of data each CPU core processes.

AWK

The following is an example of computing a very large data file with the awk command.

General usage:

cat rands20M.txt | awk '{s+=$1} END {print s}'

Now like this:

cat rands20M.txt | parallel --pipe awk \'{s+=\$1} END {print s}\' | awk '{s+=$1} END {print s}'

This one is a bit complicated: the --pipe parameter in the parallel command divides the output of cat into multiple chunks and dispatches them to the awk call, forming many subcomputing operations. These subcomputations are piped into the same awk command via a second pipeline, which outputs the final result. The first awk has three backslashes, which is what GNU parallel needs to call awk.

WC

Want to count the number of lines in a file as fast as possible?

Traditional approach:

wc -l bigfile.txt

Now you should:

cat bigfile.txt | parallel --pipe wc -l | awk '{s+=$1} END {print s}'

Very ingenious, first use the parallel command 'mapping' to generate a large number of wc -l calls to form sub-calculations, and finally send them to awk through pipelines for summary.

SED

Want to use the sed command to do a lot of substitutions in a huge file?

Conventional practice:

sed s^old^new^g bigfile.txt

Now you can:

cat bigfile.txt | parallel --pipe sed s^old^new^g

...then you can pipe the output to the specified file.

 

Guess you like

Origin blog.csdn.net/yaxuan88521/article/details/130543553