Huffman Codec

Foreword: It's the day of the school's annual data structure course again. I can't stand the zealous "inquiry" of my schoolmates about the content of my data structure course. I'll put what I did in my previous data structure course here. in conclusion

Huffman Codec

The topic I chose for my class was the Huffman codec, which is similar to the decompression software we usually use, which can compress a large file into a smaller file.

aim of design

The main purpose of data structure course design is to enable students to further master the methods and steps of application system design, flexibly use and deeply understand the application of typical data structures in software development through system analysis, system design, programming and debugging, writing experimental reports, etc. , to further improve the ability to analyze and solve problems, and improve the level of programming.

design content

Huffman codec
uses Huffman coding for information communication, which can greatly improve channel utilization, shorten information transmission time, and reduce transmission costs. Writing code to implement a Huffman encoder/decoder requires that a The encoding system pre-encodes the data to be transmitted, and decodes (recovers) the transmitted data at the receiving end.
Specific features include:

  1. Build a Huffman tree: Read in a file (xxx.souce), count the frequency of characters in the file, and use the frequency of these characters as a weight to build a Huffman tree.
  2. Encoding: Use the established Huffman tree to obtain the Huffman encoding of each character, encode the text, and then output the encoding result and store it in the file (xxx.code).
  3. Decoding: Use the established Huffman tree to decode the code in the file ( xxx.code ), output the decoding result, and store it in the file ( xxx.decode).
  4. Compress and decompress files using bit manipulation. (optional)

Outline design

Compression and decompression interface:
This is a simple graphical interface written in JavaFx, which can choose to compress and decompress files. After selecting, the selected files can be compressed or decompressed.

Compression function

The compression function is to form a Huffman tree according to each element in the file according to the weight (that is, the number of times the element appears), and then form a Huffman code by traversing the formed Huffman tree, and then press the formed Huffman code. Bit compression (that is, every 8 bits form a byte), and finally write the bitwise compressed code, the elements of the Huffman tree, the weights corresponding to these elements, and the name format of the file into the compressed file to facilitate the next time Decompression operation.

Unzip function

The decompression function process is similar to the compression process. It also reads the file, reads the information stored in the file during compression, first restores the Huffman code according to the read byte array, and then according to the read data. The information of the Huffman tree recreates the Huffman tree, then traverses the Huffman tree to re-form the Huffman code set, traverses the Huffman code to re-form the character array, restores the original byte array according to the character array, and finally uses The output byte array can be restored to the original file to obtain the file before compression.

Functional block diagram

insert image description here

insert image description here

insert image description here

Detailed functional description of each module

compression

Read the file to be compressed:
read the selected file to be compressed, convert the data in it into a character array with a byte array, and finally return the character array;
count the number of occurrences of characters:
count each character in the character array read from the file The number of occurrences, store them in the Map set;
create a Huffman tree: create a Huffman tree
according to the set of counted character times;
traverse the Huffman tree:
after the Huffman tree is created, the Huffman tree is created. Traverse to get the codes of each leaf node, and store these codes in the collection;
form Huffman code:
traverse the character array according to the code of the leaf node to get the Huffman code, and return the code string;
compress the Huffman code:
Compress the Huffman encoded string into a byte array bitwise, and return the byte array;
save the compressed Huffman encoding into a file:
store the returned byte array after Huffman encoding is compressed inside the compressed file;

unzip

Read compressed file:
read the file to be decompressed, return Huffman tree, Huffman encoded byte array, original file name and format;
restore Huffman encoding:
read the Huffman encoded word Convert the section array to a string again, that is, restore the bytes to a binary string;
recreate the Huffman tree: recreate
the Huffman tree according to the read leaf and weight information of the Huffman tree;
traverse the Huffman tree Man tree:
traverse the Huffman tree to get the code of the leaf, then traverse according to the Huffman code string to obtain the original character array, and then convert the payment array into a byte array and return;
restore the original file data:
according to the returned bytes Array and the format and name of the original file read before, recreate the file, and write information into it to restore the original file information.

detailed design

function call relationship

insert image description here

Data flow chart of each function

insert image description here

insert image description here

insert image description here

insert image description here

insert image description here

insert image description here

insert image description here

insert image description here

insert image description here

insert image description here

insert image description here

Key design and coding

Bitwise compression of Huffman codes

 public static byte[] codeZip(String code){
    
    
        int len;
        if (code.length()%8==0){
    
    
            len = code.length()/8;
        }else{
    
    
            len = code.length()/8+1;
        }
        byte[]bytes = new byte[len];

        int index = 0;
        for (int i = 0;i < code.length();i += 8){
    
    
            String strByte;
            if (i+8>code.length()){
    
    
                strByte = code.substring(i);
            }else{
    
    
                strByte = code.substring(i,i+8);
            }
          //Integer.parseInt(strByte, 2)方法的作用是输出二进制strByte数变成十进制后的数
          //如:1010变为十进制后为10
            bytes[index] = (byte) Integer.parseInt(strByte,2);
            index++;
        }
        return bytes;
}

The operation of restoring the Huffman encoding to a character array

public static char[] enCode(String code, Map<String,Character> map){
    
    
        List<Character> list = new ArrayList<>();
        for(int i = 0; i < code.length(); ) {
    
    
            int count = 1;
            boolean flag = true;
            Character b = null;
            String key = null;
            while(flag) {
    
    
                if (i+count<=code.length()){
    
    
                    key = code.substring(i, i + count);
                }else{
    
    
         //如果到了结尾而且key还找不到匹配的值,就对key进行补零的操作
                    key="0"+key;
                    System.out.println(key);
                }
                b = map.get(key);
                if (b == null) {
    
    
                    ++count;
                } else {
    
    
                    flag = false;
                }
            }
            list.add(b);
            i+=count;
        }
        char[]chars = new char[list.size()];
        for (int i=0;i<chars.length;i++){
    
    
            chars[i] = list.get(i);
        }
        return chars;
    }

Test data and running results

Normal test data and running results

unzip the txt file

Before compression:

insert image description here

Compressed:

insert image description here
After decompression:
insert image description here

Console output:

insert image description here

Output Huffman tree node information, Huffman encoding, Huffman encoding length, compression ratio, etc.

unzip mp4 file

Before unzipping:

insert image description here

After compression:

insert image description here

After decompression:

insert image description here

Console output:

insert image description here

Output Huffman tree node information, Huffman encoding, Huffman encoding length, compression ratio, etc.

Abnormal test data and running results

Compress large files

insert image description here

Error:

insert image description here

This is because the Java virtual machine's heap memory is insufficient to report an error. I have to complain about this. I did it in a hurry and used the encoding of String type directly, resulting in the inability to compress too large files...

Compress empty files

insert image description here

Error:

insert image description here

This is because the judgment of the empty file is not added (because I thought that no one would compress an empty file... Well, well, someone will compress an empty file)

Debugging situation, design skills and experience

improve proposals

The current Huffman codec basically implements the functions required by the experiment, but there are still certain shortcomings and room for improvement, as follows:

  1. Add the judgment of the empty file, so as to avoid the compiler reporting an error when compressing the empty file;
  2. The solution to the Java virtual machine memory overflow when compressing large files can be solved by using IO streams, and a limit can be set. When 10M of data is read in, the file data operation is performed while reading the file, and the file After the data operation is completed, the operation data is output to the compressed file, and after reading the 10M, the reading operation is continued according to the above process, that is, segmented reading and writing. In this way, the memory occupation of the virtual machine can be greatly reduced, and the speed of processing data by the virtual machine can be accelerated.
  3. The beautification of the interface written by JavaFx still needs to be improved. You can add a progress bar to indicate which step of compression is in progress, which can give people a more intuitive experience of the process of compression and decompression. After all, not everyone can see the software. The output of the runtime console.

experience

I learned a lot from this class. I applied the knowledge of Huffman tree I learned in the class to practice and made a program. I knew how to use code to create Huffman tree, and how to use Huffman tree after creation. The Huffman tree was converted to Huffman coding. Before the class, I thought that converting it into Huffman coding and storing it in the file would realize the compression of the file, but when I wrote the program, I found out that it made the file bigger. , in order to meet the compression requirements, it is also necessary to perform bit-wise compression on the obtained Huffman code. The bit-wise compression is to recompress the "01" Huffman code into one byte according to 8 bits. Theoretically, the data repetition rate is very high. The memory can be reduced by 7 to 8 times when the file is compressed, so that the file can be compressed, and in order to decompress it, the information of the Huffman tree must be recorded in the file to restore the Huffman tree.

references

Most of the references are school textbooks...although I seem to be checking most of them directly on Baidu...

[1] Wang Shuyan. Data Structure and Algorithm. Beijing: Higher Education Press. 2019
[2] Wang Shuyan. Data Structure and Algorithm. Beijing: People’s Posts and Telecommunications Press. 2013
[3] Geng Guohua. Data Structure C Language Description. Beijing: Higher Education Education Press. 2011
[4] Yan Weimin. Data Structure. Beijing: Tsinghua University Press. 2012
[5] Wang Shuyan. C language programming tutorial. Beijing: People's Posts and Telecommunications Press. 2014

At last

The project address is as follows:
Github address: https://github.com/guanchanglong/HuffmanEncoder/tree/master
Could you please give a star when reading the code ^ _ ^, thank you very much for your effort.

PS: You can also go to my personal blog to see more content
Personal blog address: Xiaoguan classmate's blog

Guess you like

Origin blog.csdn.net/weixin_45784666/article/details/122141228