Some knowledge about Burrows-Wheeler transformation and Lempel-Ziv analysis

Some knowledge about Burrows-Wheeler transformation and Lempel-Ziv analysis


When it comes to data compression, a brief summary of what it does is to eliminate data redundancy, and its way of working is to find repeating patterns and encode them tightly.

一,Burrows-Wheeler Transform

1 Overview

In 1994, Michael Burrows and David Wheeler invented the Burrows-Wheeler Transform algorithm and named it after them. When reading "Universal losslessdata compression algorithm", I also deeply realized the precise description of the algorithm in the article, so that more than half of the content is about it and how to improve and optimize BWT.
Please add image description

I used to think that Burrows-Wheeler Transform is a compression algorithm, but later I saw some blogs, and I more agree that BWT is a data conversion algorithm , and more excellent compressors can be invented based on BWT. Data transformed by BWT is easier to compress and search , for example:

Please add image description

After conversion through BWT, many repeated characters will be put together, and compression and searching will be easy at this point.

2. Diagram

Please add image description

BWT is a process of adding tags, cyclic transfers, calculating arrays, and outputting results.

① Here we enter a string ababcand add a token to it to get ababc$this token $to be smaller than all characters.

Please add image description

② After that, we transfer the processed string cyclically. At this time, you can treat it ababc$as a circle, and then rotate it, so that the characters in column F (the first column) are arranged according to the ASCII code from small to large.

Please add image description

Please add image description

③ The last column of the obtained M array is the output L column

, , Lempel-Ziv Parsing

1 Overview

Personally, compared to the above algorithm, the LZ series algorithm may be easier to understand.

The Lempel-Ziv algorithm was first introduced by two big men, Abraham Lempel and Jacob Ziv , in the paper "A Universal Algorithm for Sequential Data Compression". Like the Burrows-Wheeler algorithm, Lempel-Ziv is also named by its inventor.

There are two versions of the Lempel-Ziv algorithm. According to the invention date of LZ77 in 1977 and LZ78 in 1978, many excellent variant algorithms such as deflate, lzx and lzma have been derived.

Please add image description

There is a more interesting thing here. If you look closely, you will find that there are more variants of the LZ77 algorithm invented first than the LZ78. Is it because the LZ77 has been used by people for a long time? No, this is because the LZ78 algorithm was patented by Sperry in 1984 for its variant lzw algorithm, and began to sue related software vendors for using the GIF format without a license. After that, the popularity of the LZ78 algorithm gradually declined. . Although LZW's patent issues have subsided and many LZW variants have emerged, it is currently only commonly used in GIF compression, with the dominant LZ77 algorithm.

Although there are many variants of the Lempel-Ziv algorithm, they all share a common idea: if some text is not uniformly random, that is, all letters are not equally likely to appear, then the substrings that have already appeared will be more likely than none Seen substrings are more likely to appear again. For example, in our daily life, we all have some Japanese words, such as "hello", "how are you"; then, "how are you", "how are you", "how are you" contains the words String "hello", we can simplify "hello" into a shorter binary code to replace "hello" in "how are you", thus simplifying the encoding.

It may not be clear to say this, let's take an example of LZ78 encoding to demonstrate.

2. LZ78

The LZ78 algorithm works by building a dictionary of substrings that appear in the text.

1. Diagram

The algorithm has two cases:

  1. If the current character does not appear in the dictionary, encode the character into the dictionary
  2. If the current character appears in the dictionary, the longest match is made with the character starting from the current character, and the first character after the longest substring matched will be specially processed and encoded into the dictionary.

The most direct way to explain the algorithm should be to draw a picture.

As an example : Suppose we have strings AABABBBABAand we compress them using LZ78 algorithm
Please add image description

① Start with the shortest phrase on the left that has never appeared, here it is A, into the dictionary.

Please add image description

② Next, consider the remaining strings. Since they have been seen before A, match the longest string A, and take the next character of the longest string for special processing, and ABput it in the dictionary .

Please add image description

③ Consider the rest of the string. Since it has been seen before A, continue to match the next bit B. At this time, the longest string is AB, continue to match the next bit. If the longest string is not matched, it is taken and ABBcompiled into the dictionary.

Please add image description

④ Consider the remaining strings, the first character is B, the longest string is not matched, and it is compiled into the dictionary
Please add image description

⑤ In the same way, match the remaining characters, match the longest string AB, and enter the dictionary together with the next digit
Please add image description

ABSince there is a string in the serial number (index) 2 A, it can be A的序号used for replacement 字串Aand encoded ABas 1B. ABBSimilarly, if there is the longest string in the string with the serial number (index) 3, the serial number ABthat can be used ABto replace ABBthe middle string is ABcoded as 2B. The string with sequence number (index) 4 Bdoes not match the previous string, it is an empty set Ø, and the code is 0B. ABAThe coding of index 5 is 2A.
Please add image description

At this point, we use the dictionary to encode the original string into a simpler string, which simplifies the related variables. At this time, we only need to assign values ​​to A and B to get the final encoded binary string. Here it is assumed A=0that B=1.
Please add image description

  • The LZ78 algorithm builds its dictionary dynamically, traversing the data only once, which means it doesn't have to receive the entire document before starting encoding.

Overview - The Hitchhiker’s Guide to Compression (go-compression.github.io)

Guess you like

Origin blog.csdn.net/qq_21484461/article/details/123415702