Some knowledge about Burrows-Wheeler transformation and Lempel-Ziv analysis
Article directory
When it comes to data compression, a brief summary of what it does is to eliminate data redundancy, and its way of working is to find repeating patterns and encode them tightly.
一,Burrows-Wheeler Transform
1 Overview
In 1994, Michael Burrows and David Wheeler invented the Burrows-Wheeler Transform algorithm and named it after them. When reading "Universal losslessdata compression algorithm", I also deeply realized the precise description of the algorithm in the article, so that more than half of the content is about it and how to improve and optimize BWT.
I used to think that Burrows-Wheeler Transform is a compression algorithm, but later I saw some blogs, and I more agree that BWT is a data conversion algorithm , and more excellent compressors can be invented based on BWT. Data transformed by BWT is easier to compress and search , for example:
After conversion through BWT, many repeated characters will be put together, and compression and searching will be easy at this point.
2. Diagram
BWT is a process of adding tags, cyclic transfers, calculating arrays, and outputting results.
① Here we enter a string ababc
and add a token to it to get ababc$
this token $
to be smaller than all characters.
② After that, we transfer the processed string cyclically. At this time, you can treat it ababc$
as a circle, and then rotate it, so that the characters in column F (the first column) are arranged according to the ASCII code from small to large.
③ The last column of the obtained M array is the output L column
, , Lempel-Ziv Parsing
1 Overview
Personally, compared to the above algorithm, the LZ series algorithm may be easier to understand.
The Lempel-Ziv algorithm was first introduced by two big men, Abraham Lempel and Jacob Ziv , in the paper "A Universal Algorithm for Sequential Data Compression". Like the Burrows-Wheeler algorithm, Lempel-Ziv is also named by its inventor.
There are two versions of the Lempel-Ziv algorithm. According to the invention date of LZ77 in 1977 and LZ78 in 1978, many excellent variant algorithms such as deflate, lzx and lzma have been derived.
There is a more interesting thing here. If you look closely, you will find that there are more variants of the LZ77 algorithm invented first than the LZ78. Is it because the LZ77 has been used by people for a long time? No, this is because the LZ78 algorithm was patented by Sperry in 1984 for its variant lzw algorithm, and began to sue related software vendors for using the GIF format without a license. After that, the popularity of the LZ78 algorithm gradually declined. . Although LZW's patent issues have subsided and many LZW variants have emerged, it is currently only commonly used in GIF compression, with the dominant LZ77 algorithm.
Although there are many variants of the Lempel-Ziv algorithm, they all share a common idea: if some text is not uniformly random, that is, all letters are not equally likely to appear, then the substrings that have already appeared will be more likely than none Seen substrings are more likely to appear again. For example, in our daily life, we all have some Japanese words, such as "hello", "how are you"; then, "how are you", "how are you", "how are you" contains the words String "hello", we can simplify "hello" into a shorter binary code to replace "hello" in "how are you", thus simplifying the encoding.
It may not be clear to say this, let's take an example of LZ78 encoding to demonstrate.
2. LZ78
The LZ78 algorithm works by building a dictionary of substrings that appear in the text.
1. Diagram
The algorithm has two cases:
- If the current character does not appear in the dictionary, encode the character into the dictionary
- If the current character appears in the dictionary, the longest match is made with the character starting from the current character, and the first character after the longest substring matched will be specially processed and encoded into the dictionary.
The most direct way to explain the algorithm should be to draw a picture.
As an example : Suppose we have strings AABABBBABA
and we compress them using LZ78 algorithm
① Start with the shortest phrase on the left that has never appeared, here it is A
, into the dictionary.
② Next, consider the remaining strings. Since they have been seen before A
, match the longest string A
, and take the next character of the longest string for special processing, and AB
put it in the dictionary .
③ Consider the rest of the string. Since it has been seen before A
, continue to match the next bit B
. At this time, the longest string is AB
, continue to match the next bit. If the longest string is not matched, it is taken and ABB
compiled into the dictionary.
④ Consider the remaining strings, the first character is B
, the longest string is not matched, and it is compiled into the dictionary
⑤ In the same way, match the remaining characters, match the longest string AB
, and enter the dictionary together with the next digit
AB
Since there is a string in the serial number (index) 2 A
, it can be A的序号
used for replacement 字串A
and encoded AB
as 1B
. ABB
Similarly, if there is the longest string in the string with the serial number (index) 3, the serial number AB
that can be used AB
to replace ABB
the middle string is AB
coded as 2B
. The string with sequence number (index) 4 B
does not match the previous string, it is an empty set Ø, and the code is 0B
. ABA
The coding of index 5 is 2A
.
At this point, we use the dictionary to encode the original string into a simpler string, which simplifies the related variables. At this time, we only need to assign values to A and B to get the final encoded binary string. Here it is assumed A=0
that B=1
.
- The LZ78 algorithm builds its dictionary dynamically, traversing the data only once, which means it doesn't have to receive the entire document before starting encoding.
Overview - The Hitchhiker’s Guide to Compression (go-compression.github.io)