linux/Unix多线程高压缩率工具xz util使用介绍

在linux和unix系统中，常用的压缩软件有gzip，bzip，xz等。对tar来说也有相关的参数，分别是-z -j -J。压缩率大致为gzip<bzip<xz。xz格式压缩出来虽然文件较小，但它也是三种工具里压缩最慢的。不管用哪个工具压缩，都会有一个令人诟病的地方就是，对于一个文件只能采用一个线程进行压缩，导致CPU利用率特别低，哪怕再高配的服务器也快不起来。而xz5.2.0之后的版本就解决了这一问题，可以实现多线程压缩和单线程解压。对于压缩数据库导出文件这样特别大的单个文件特别给力。实现了CPU的高利用率，缩短压缩时间，同时保持了很高的压缩率。实测oracle导出文件能压缩至1%左右，DB2导出文件能压缩到2%左右，十分快且使用。

首先你需要安装xz 5.2.0之后的版本才能使用多线程压缩的功能，目前最新的版本为5.2.4，推荐使用最新的版本，因为之前的版本会存在一些内存控制的问题导致压缩有可能失败。

压缩一个文件特别方便，只需使用 -z参数如

xz -z ./haha.txt

就会在当前目录下生成一个haha.txt.xz文件，解压直接使用-d参数即可

如果想要自定义文件名，可以使用-c参数，如下

xz -c ./haha.txt > gaga.xz

如果想要在tar结束后对tar包进行压缩，可以使用管道，如下：

tar -cf - smit.log | xz -c > haha.tar.xz

如果要使用多线程，只需指定-T 参数或者--threads参数，如下，使用4个线程进行压缩

xz -T 4 -z ./haha.txt

注意，-T 参数只是指定最多利用的线程数，实际利用可能比该参数低。如果参数指定0，那么就尽可能多的使用CPU线程。一般来说线程越多CPU利用率越高。

另外，使用多线程压缩实际上是在读入文件的时候先进行分块，然后让不同线程压缩不同的分块，最后再拼合。在分块的时候，会在块的头部写入block的大小，默认块大小为1MB左右。可以参考官网的文档：

-T threads, --threads=threads

Specify the number of worker threads to use. Setting threads to a special value 0 makes xz use as many threads as there are CPU cores on the system. The actual number of threads can be less than threads if the input file is not big enough for threading with the given settings or if using more threads would exceed the memory usage limit.

Currently the only threading method is to split the input into blocks and compress them independently from each other. The default block size depends on the compression level and can be overriden with the --block-size=sizeoption.

Threaded decompression hasn't been implemented yet. It will only work on files that contain multiple blocks with size information in block headers. All files compressed in multi-threaded mode meet this condition, but files

如果你想单线程压缩，就不要指定-T参数了，以免以后的版本无法解压。

分割的块大小可以自己设置，推荐大小为字典的2-3倍，块设置的越大，内存占用也会越大，线程数越多，内存占用也会越大。所以如果压缩时报错内存不足的话，就把线程数和块都改小即可。

--block-size=size

When compressing to the .xz format, split the input data into blocks of sizebytes. The blocks are compressed independently from each other, which helps with multi-threading and makes limited random-access decompression possible. This option is typically used to override the default block size in multi-threaded mode, but this option can be used in single-threaded mode too.

In multi-threaded mode about three times size bytes will be allocated in each thread for buffering input and output. The default size is three times the LZMA2 dictionary size or 1 MiB, whichever is more. Typically a good value is 2-4 times the size of the LZMA2 dictionary or at least 1 MiB. Usingsize less than the LZMA2 dictionary size is waste of RAM because then the LZMA2 dictionary buffer will never get fully used. The sizes of the blocks are stored in the block headers, which a future version of xz will use for multi-threaded decompression.

In single-threaded mode no block splitting is done by default. Setting this option doesn't affect memory usage. No size information is stored in block headers, thus files created in single-threaded mode won't be identical to files created in multi-threaded mode. The lack of size information also means that a future version of xz won't be able decompress the files in multi-threaded mode.

--block-list=sizes

When compressing to the .xz format, start a new block after the given intervals of uncompressed data.

The uncompressed sizes of the blocks are specified as a comma-separated list. Omitting a size (two or more consecutive commas) is a shorthand to use the size of the previous block.

If the input file is bigger than the sum of sizes, the last value in sizesis repeated until the end of the file. A special value of 0 may be used as the last value to indicate that the rest of the file should be encoded as a single block.

If one specifies sizes that exceed the encoder's block size (either the default value in threaded mode or the value specified with --block-size=size), the encoder will create additional blocks while keeping the boundaries specified in sizes. For example, if one specifies --block-size=10MiB --block-list=5MiB,10MiB,8MiB,12MiB,24MiB and the input file is 80 MiB, one will get 11 blocks: 5, 10, 8, 10, 2, 10, 10, 4, 10, 10, and 1 MiB.

In multi-threaded mode the sizes of the blocks are stored in the block headers. This isn't done in single-threaded mode, so the encoded output won't be identical to that of the multi-threaded mode.

关于线程数量和块大小的设置是一个学问，如果想要科学的设置可以来回多试几次。

下面以IBM AIX小型机（power CPU）为例，介绍一下压缩效率

首先压缩同一个DB2导出文件，大小约50G，进行多线程压缩

xz --threads=8 --block-size=32MiB -c ./DEMODB.0.db2inst1.NODE0000.CATN0000.20180718161210.001 > ./20180718161210.xz

xz --threads=16 --block-size=8MiB -c ./DEMODB.0.db2inst1.NODE0000.CATN0000.20180718161210.001 > ./20180718161210.xz

8线程压缩结果为1926093900字节，16线程压缩大小为2003832484字节。原大小为51284066304字节，压缩率分别为3.755%和3.907%

在上面的例子中，下面的命令使用的线程数是上面的两倍，但是由于内存的原因，块大小缩减为原来的四分之一，下面是CPU的占用情况,左边为8线程，右边为16线程

可以看出，在8线程下，对CPU利用率在50%-60%之间，压缩50G用时大约30分钟，在16线程下，CPU利用率在70%-80&之间，压缩时间为19分钟左右。比8线程效率快了三分之一还多。所以通过修改线程数和块大小，可以找到最适合当前文件的压缩参数。但是注意，线程数越多意味着分快数越多，块头部包含的信息也越多，于是会发现多线程压缩出来的文件会比线程数较少时压缩出来的文件略大一丢丢。

线程数和块大小也不是无线增加的，建议根据CPU的核心数来分配，即使你分配的太多压缩工具也不一定会建立那么多线程。当线程数或块大小太大的时候，linux系统基本不会有什么问题，但是AIX系统往往会报错内存不足，但其实内存还剩90%。越早的xz版本越容易在AIX出现内存不足的问题，比如下面提升了块大小：

bash-4.3# xz --threads=16 --block-size=128MiB -c ./DEMODB.0.db2inst1.NODE0000.CATN0000.20180718161210.001 > ./20180910.xz
xz: ./DEMODB.0.db2inst1.NODE0000.CATN0000.20180718161210.001: There is not enough memory available now.

所以xz这个软件还是更推荐在linux上使用，因为不容易碰到内存的问题

官方文档可以使用man或者去官网查看：https://www.systutorials.com/docs/linux/man/1-xz/

linux/Unix多线程高压缩率工具xz util使用介绍

猜你喜欢