Simple version of file compression based on Huffman tree

Resource download address : https://download.csdn.net/download/sheziqiong/88358849
Resource download address : https://download.csdn.net/download/sheziqiong/88358849

Simple version of file compression based on Huffman tree

1. The concept of data compression

Data compression refers to reducing the amount of data to reduce storage space without losing useful information, improving its transmission, storage and processing efficiency, or reorganizing data according to a certain algorithm to reduce data redundancy and storage space. a technical method.

2. Why compression is needed

1. Compress data storage capacity and reduce storage space

2. It can increase the speed of data transmission, reduce bandwidth usage, and improve communication efficiency.

3. An encryption protection for data to enhance the security of data during transmission.

3. Classification of compression

Lossy compression

Lossy compression takes advantage of the human insensitivity to certain frequency components in images or sound waves, allowing a certain amount of information to be lost during the compression process; although the original data cannot be completely restored, the impact of the lost part on understanding the original image is reduced. , but in exchange for a much larger compression ratio, that is, using compressed data for reconstruction. The reconstructed data is different from the original data, but it does not affect people's misunderstanding of the information expressed in the original data.

lossless compression

The data in the file is reorganized according to a specific encoding format. The compressed file can be restored to the exact same format as the source file without affecting the file content. For digital images, there will be no loss of image details. .

The external link image transfer failed. The source site may have an anti-leeching mechanism. It is recommended to save the image and upload it directly.

4. History of ZIP compression

In 1977, two Israelis, Jacob Ziv and Abraham Lempel, published a paper "A Universal Algorithm for Sequential Data Compression", a universal data compression algorithm. The so-called universal compression algorithm means that this compression algorithm does not compress the data. What are the limitations of the type? This algorithm laid the core of most lossless data compression today. In order to commemorate the two scientists, the algorithm was called LZ77. A year later, they proposed a similar algorithm called LZ78. The ZIP algorithm evolved based on the idea of ​​LZ77, but ZIP continues to compress the result after LZ77 encoding until it is difficult to compress. There are many variants of algorithms based on LZ77 and LZ78, basically starting with LZ, such as LZW, LZO, LZMA, LZSS, LZR, LZB, LZH, LZC, LZT, LZMW, LZJ, LZFG, etc.

The author of ZIP is Phil Katz, who is considered a tragic legend in the open source world. PhilKatz is an awesome programmer who became famous in the DOS era. In that era, the Internet speed was very slow. It was not even 1990 when PhilKatz started surfing the Internet. The WWW did not actually exist. Of course, there was no browser. What was the purpose of surfing the Internet at that time? It was basically like a network administrator typing various commands, so that you could actually chat and go to forums. It would be extremely slow to transfer a file without compression, so compression was very important in that era. At that time, a commercial company provided a compression software called ARC, which made you chat faster in that era, but it had to be paid. PhilKatz felt uncomfortable, so he wrote a PKARC compression tool, which was not only free, but also compatible with ARC. , so all netizens started using PKARC. Naturally, the ARC company was unhappy and took Philkatz to court, claiming that intellectual property rights were involved, etc. As a result, PhilKatz was imprisoned. A good person is a good person. He thought hard in prison and decided to come up with an awesome algorithm that surpasses ARC. The prison is suitable for thinking. It took him two weeks to complete it. It is called PKZIP. It is not only free, but also open source this time. Directly Publish the source code, because the algorithms are different, and no intellectual property rights are involved, so ZIP becomes popular, but Phil Katz does not make a penny from it, and is still in poverty because of many reasons such as drinking too much. , died in a motel in 2000. Heroes pass away, but the spirit lives on forever. Now we use UE to open the ZIP file, and we can see that the first two bytes are the ASCII code of the two characters PK.

5. Principle of GZIP compression algorithm

The GZIP compression algorithm has gone through two stages. The first stage uses the improved LZ77 compression algorithm to compress repeated statements in the context. The second stage uses human coding ideas to compress the data completed in the first stage on a byte basis. , thereby achieving efficient compression and storage of data.

5.1LZ77 compression algorithm

LZ77 is a dictionary-based algorithm that encodes long strings (also called phrases) into short tokens, replacing phrases in the dictionary with small tokens to achieve compression.

LZ77 uses a forward buffer and a sliding window to maintain the dictionary. It first loads a part of the data into the forward buffer. Once the phrase in the data passes through the forward buffer, it will move to the sliding window and become part of the dictionary. As the sliding window moves, the contents of the dictionary It is also constantly updated, which means that it is updating the dictionary and performing compression at the same time.

Assume that the dictionary has been established. Every time a character is read, scan forward to see if there are repeated phrases in the dictionary. If a repeated phrase is found, encode it into a phrase tag. The phrase tag consists of three parts: the offset in the sliding window (from the header to the previous character where the match starts), the number of symbols in the match, and the first symbol in the forward buffer after the match ends - ( oset,lenght,nextChar). When no match is found, the unmatched symbols are encoded into symbol tokens. This symbol tag contains only the symbol itself, without compression. Once n symbols have been encoded and corresponding tokens generated, the n symbols are removed from one end of the sliding window and replaced with the same number of symbols in the forward buffer. Then, the forward buffer is repopulated.

The following figure shows the compression process of LZ77, assuming that the sliding window size is 8 bytes and the forward buffer size is 4 bytes:

As each token is decoded, the token is encoded into characters and copied into a sliding window. Whenever a phrase token is encountered, the corresponding offset is looked up in the sliding window, while looking for phrases of the specified length found there. Whenever a symbol tag is encountered, generate one of the symbols stored in the tag:

5.2 LZ77 ideas in GZIP

In the GZIP algorithm, the algorithm idea of ​​LZ77 is also used, but it has been improved, mainly for the improvement of phrase tags: only the tuples of "length + distance" are used for representation, and the matching search is in the search buffer performed in, that is, a dictionary.

Note: The data in the search buffer is the data in the dictionary that has been scanned and established. The advance buffer is the compressed data.

The first byte in the current lookahead buffer is the character "r". What you need to do is to find the longest match in the search buffer of a string consisting of consecutive characters starting with "r" in the lookahead buffer. . For example, "re" is in the green part, and "re (there is a space here)" is also in the green part, but the latter has one more character than the former, so the latter is selected.

There are four issues to deal with now:

1. Starting from "r", how many characters does the string in the lookahead buffer consist of? Are there any rules? Or does it matter how many characters there are as long as a match can be found?

2. How to efficiently find matching strings in the search buffer?

3. How to find the longest match?

4. What should I do if a match cannot be found?

5.2.1 How many characters does the matching string consist of?

1. If you use length distance pairs to replace a single character, it is better not to replace it.

2. If two consecutive characters are replaced by length distance pairs, it does not matter whether they are replaced or not. Anyway, two numbers must be used before and after replacement (characters are actually numbers)

3. The prerequisite for replacement is that the string must consist of at least 3 consecutive characters in the lookahead buffer, and must start with the first byte in the lookahead buffer.

Therefore: the string in the lookahead buffer must consist of at least 3 consecutive characters. That is to say, when the string in the lookahead buffer consists of at least 3 consecutive characters and a match is found in the search buffer, the string in the lookahead buffer can be replaced with the "length + distance" pair. .

5.2.2 How to efficiently find matching strings

A string must consist of at least three consecutive characters before the string is eligible to search for a matching string in the search buffer. Compression uses a dictionary (hash table) to improve lookup speed. In compression, the "dictionary" consists of a whole block of continuous memory, which is 64KB in size and is divided into two parts, each part is one WSIZE in size, as shown in the figure below

The pointer head=prev+WSIZE, prev points to the starting position of the entire memory of the dictionary. Since memory is continuous, prev and head can be regarded as two arrays, namely prev[] and head[].

Compression is to use these three consecutive characters to calculate a hash value. This hash value is the index of the head[] array. The three characters are different, and the order of arrangement is different, and the calculated hash value is different. There are a total of results, and the range of the hash value is: 0,32767 , so each hash value cannot correspond to a string. , there will definitely be a conflict, and prev is here to resolve the conflict, and this array can also help find the longest matching string.

1. Dictionary construction

The so-called construction of the dictionary is to insert the string into the dictionary. The compression is performed byte by byte. Each time a byte is processed, the search buffer is expanded by one byte (if it has reached 32KB, then slide in the direction of the advance buffer. one byte), the lookahead buffer shrinks by one byte, and the first character of the lookahead buffer is constantly changing. Every time the first byte of the lookahead buffer changes, it is calculated using a string consisting of "starting from the first byte of the current lookahead buffer and three consecutive bytes in the lookahead buffer" The hash value ins_h is the index of the array head[], so head[ins_h] is used to record the occurrence position of the string. The occurrence position of the string is the first character of the string (that is, the current The first byte of the lookahead buffer) is the position in the current window. This is the process of inserting the string into the dictionary.

2. Find the matching string

Note: Inserting a string into the dictionary actually inserts the starting position of the string into the dictionary. The hash is only a preliminary match, which is used to find a position that matches this preliminary match. Use this position to find a longer match. is the purpose.

5.2.3 How to find the longest match

1. What to do if there is a dictionary conflict during insertion?

The insertion process is to use head[ins_h] to record the occurrence position of the current string. ins_h is the hash value of the current string and the index of the head[] array. What if head[ins_h] is not empty when inserting the occurrence position of the current string into a certain head[ins_h]? for example:

Assumption: strstart is the position of the first character of the lookahead buffer in the window, and the array element prev[strstart] is the position of the previous string. Insertion proceeds as follows:

Note: strstart gradually increases, so it can be ensured that the old content will not be overwritten every time prev[strstart] is assigned a value. Before each assignment, prev[strstart] must be completely new.

This forms a "chain" relationship, which links strings with the same hash value but different positions together through the index of the prev[] array and the array element value.

Note: Before compression, the head[] array will be initialized to 0, but the prev[] array will not be initialized. The initialization of the prev[] array is performed dynamically. Therefore, head[x] and prev[x] use 0 as "empty", that is, there is no matching string flag. The meaning of their values ​​​​itself is the position of the string, but the 0th character of window consists of a string. The position is also 0, which creates an ambiguity. The solution to this ambiguity is very simple and crude, which is to simply not let the first string in the window participate in the matching process.

2. What should I do if the first character position in the lookahead buffer is greater than 32K?

As the compression proceeds byte by byte, strstart, the first character of the look-ahead buffer, is constantly advancing (increasing) and will inevitably be larger than 32K (because the size of the sliding window is 64K). When strstart exceeds 32K, its function is used directly. The subscript of the prev array will be wrong. The solution to the source code is: let the result of the bitwise AND operation of strstart and WMASK be used as the index of prev[], instead of directly using strstart as the index of prev[]. Among them, the value of WMASK is 32768 in decimal, that is, prev[strstart&WMASK ]. This ensures that within the value range of strstart, the index of prev[] completely corresponds to strstart.

3. What should I do if the matching chain creates an infinite loop? When strstart is larger than 32K, an infinite loop may occur in the matching chain.

The source code uses max_chain_length (representing the maximum chain length of 256) to solve this problem, which represents the maximum number of items to search when searching along the matching chain. The smaller the max_chain_length value, the fewer matching nodes are searched and the faster the compression speed, but the compression rate may be relatively low.

4. The lazy matching source code uses two methods to find the longest match: lazy matching and long_match. Lazy matching finds the longest match for different strings, while

longest_match finds the longest match of the current string.

The so-called lazy matching means that although the current string has found the matching string that suits it best, it is not eager to replace it with the length distance pair. Instead, it uses two variables prev_length and prev_match to record the matching length and matching position respectively. . Then let strstart move one character toward the advance buffer. Now strstart has reached the new position and the current string is updated. At this time, find the matching string that best suits the current string. Assume that the matching string that is most suitable for the current string is found, and the matching position and matching length are obtained. This is the time for lazy matching to show its talents. Compare the matching length of the current matching string with prev_length. If the former is greater than the latter, use the former to update the latter. At the same time, update prev_match with the current matching position, and the character at the (current strstart–1) position is used as a Unmatched characters are processed. When prev_length and prev_match complete the update, strstart moves one character to the direction of the advance buffer and continues the same processing. This processing method is lazy matching; if the former is smaller than the latter, it is the last If the matching length (prev_length) is greater than this time, then the lazy matching stops, and the last matching length and matching position are used to form a length-distance pair tuple to complete the replacement. After the lazy matching stops, initialize prev_length and prev_match, move strstart to a new position (outside the replaced string), and prepare to start a new round of lazy matching process. Note that the string consisting of each character in the replaced string still needs to be inserted into the dictionary.

5.longest_match

longest_match is to traverse the matching chain and find the longest substring in the matching chain. Of course, the longest substring is not necessarily optimal. It needs to be combined with lazy matching to find the optimal one.

Notice:

1. The traversal of matching nodes is not infinite. There is a limit on the maximum number of matching link nodes.

2. The matching length of the matching string is not infinite. The maximum matching length of a matching string is MAX_MATCH(258)

3. The matching distance is not infinite, and the matching distance cannot exceed MAX_DIST

4. The longest matching string is obtained by longest_match and lazy matching. The former is responsible for finding the longest match of the current strstart, and the latter is responsible for finding the one with the longest matching string length among the longest matches of several consecutive strstarts as the final, real longest match.

5.2.4 What should I do if the longest match cannot be found?

How to distinguish matching strings or distance pairs from source characters in the compression results? for example:

Note: There is actually no expansion of distance pairs in compressed strings. So here comes the question: 23 is not a distance pair, but a source character, and 12,16 represents a distance pair, so how to distinguish it?

Solution: Use 1 bit to mark, for example, 1 represents the distance pair, and 0 represents the source character.

5.2.5 Decompression

During decompression, parsed byte by byte:

1. If the highest bit of the current byte is 0, it represents the source byte and is written directly to the file.

2. If the highest bit of the current byte is 1, it represents the distance pair, take out the distance pair, and then parse out a valid string according to the matching starting position and length of the distance pair.

5.3Huffman Thought in GZIP

After repeated statement compression of the source data through the previous LZ77 deformation idea, the repeatability at the statement level has been resolved, but it does not mean that the compression effect has been optimal. There may also be a large number of repetitions at the byte level. For example: "BCDCDDBDDCADCBDC"

One byte occupies 8 bits, so if you can find a code smaller than 8 bits for all bytes, and then use the found code to rewrite the corresponding bytes in the source file, you can also make the source file smaller. So how to find the code?

5.3.1 Static equal-length encoding

The encoding length of each character is equal, for example:

Use equal-length encoding to compress the above source data: 01101110111101111110001110011110. The result after compression only occupies 4 bytes, and the compression rate is still relatively high.

5.3.2 Dynamic unequal length encoding

The encoding of each character is determined according to the specific character situation, such as:

Use unequal length encoding to compress the source data: 10111011001010011100011101011 After the compression is completed, the last byte has not been used up, and there are still 3 bits left. Obviously, dynamic unequal length encoding has a better compression rate than equal length encoding. How to obtain dynamic unequal length encoding?

5.3.3human encoding
1.haman tree

The sum of the path lengths from the root node of the binary tree to all leaf nodes in the binary tree multiplied by the corresponding weights is the weighted path length of the binary tree.

WPL。

The weighted path lengths of the above four trees are:

WPLa=1 2+3 2+5 2+7 2=32

WPLb=12+33+53+71=33

WPLc=73+53+32+11=43

WPLd=13+33+52+71=29

The binary tree with the smallest weighted path is called a Human tree.

2.Human tree construction

1. Construct n binary tree forests with only root nodes from the given n weights {w1,w2,w3,…,wn}, F={T1,T2,T3,…,Tn},

Each binary tree Ti has only one root node with weight wi, and its left and right children are empty.

2. Repeat the following steps until there is only one tree left in F

Select the two binary trees with the smallest root node weights in F and construct a new binary tree as the left and right subtrees. The weight of the root node of the new binary tree is the sum of the weights of the root nodes of the left and right subtrees. Delete these two in F a binary tree and add the new binary tree to F

  1. Get haman encoding

Question: What if one encoding in unequal length encodings is the prefix of another encoding? Will this happen?

5.3.4 Use human encoding to compress source files
  1. Count the number of occurrences of each character in the source file

2. Create a human tree using the number of times a character appears as the weight.

3. Obtain the human code corresponding to each character through the human tree

4. Read the source file, rewrite each character in the source file using the obtained human encoding, and write the rewriting results into the compressed file until the end of the file.

5.3.4 Compressed file format

Is it possible to save only the compressed data in a compressed file?

The answer is no, because there is no way to decompress when decompressing. For example: 10111011001010011100011101011. Only compressed data cannot be decompressed. Therefore, in addition to saving the compressed data, the compressed file must also save the information needed for decompression:

1. Suffix of source file

2. The total number of lines of character number pairs

3. Characters and the number of occurrences of the characters (for a simple period, each character is placed on one line)

4. Compress data

5.3.5 Decompression

1. Get the suffix of the source file from the compressed file

2. Get the total number of lines with character counts from the compressed file

3. Get the number of occurrences of each character

4. Rebuild the human tree

5. Unzip

Obtain compressed data one byte from the compressed file. Each time a byte of compressed data is obtained, start from the root node and traverse the human tree according to the binary bit information of the byte. If the bit is 0, take the current node. The left child, otherwise the right child is taken, until the leaf node position is traversed, the character is parsed successfully. Continue this process until all data has been parsed.

Resource download address : https://download.csdn.net/download/sheziqiong/88358849
Resource download address : https://download.csdn.net/download/sheziqiong/88358849

Guess you like

Origin blog.csdn.net/newlw/article/details/133065122