Huffman coding realizes binary file compression and decompression

Basic introduction to Huffman tree

1. Given n weights as n leaf nodes, construct a binary tree, if the weighted path length (wpl) of the tree reaches the minimum, call such a binary tree the optimal binary tree, also called Huffman tree (Huffman Tree), and other books are translated as Huffman Tree.

2. The Huffman tree is the tree with the shortest weighted path length, and the node with the larger weight is closer to the root.

Several important concepts and examples of Huffman tree

1. Path and path length: In a tree, the path between children or grandchildren nodes that can be reached from a node down is called a path. The number of branches in the path is called the path length. If the number of layers of the root node is specified as 1, the path length from the root node to the node of the Lth layer is L-1

2. Node weight and weighted path length: If a node in the tree is assigned a value with a certain meaning, then this value is called the weight of the node. The weighted path length of a node is: the product of the length of the path from the root node to the node and the weight of the node

3. The weighted path length of the tree: The weighted path length of the tree is defined as the sum of the weighted path lengths of all leaf nodes, denoted as WPL (weighted path length). The larger the weight, the more away the node from the root node. The nearest binary tree is the optimal binary tree.

4. The smallest WPL is the Huffman tree

The basic idea of ​​generating Huffman numbers

Give you a sequence {13, 7, 8, 3, 29, 6, 1} and ask to be transformed into a Huffman tree.

Steps to form a Huffman tree:

1. Sort from small to large, each data, each data is a node, each node can be regarded as the simplest binary tree

2. Take out the two binary trees with the smallest root node weight

3. To form a new binary tree, the weight of the root node of the new binary tree is the sum of the weights of the root nodes of the previous two binary trees  

4. Sort this new binary tree again according to the weight of the root node, and repeat the steps of 1-2-3-4 until all the data in the sequence are processed, and a Huffman is obtained. tree

Diagram

{13, 7, 8, 3, 29, 6, 1}  

1. Sort {1, 3, 6, 7, 8, 13, 29}

2. Reorder and rebuild a new tree

And so on...

Code

Build node

class Node implements Comparable<Node>{
    public int value;
    public Node left;
    public Node right;

    public Node(int value){
        this.value = value;
    }

    public void preOrder(){
        System.out.println(this);
        if(this.left != null)
            this.left.preOrder();
        if(this.right != null)
            this.right.preOrder();
    }

    @Override
    public String toString() {
        return "Node{" +
                "value=" + value +
                '}';
    }

    @Override
    public int compareTo(Node o) {
        return this.value - o.value;
    }
}

Create a Huffman tree

public static Node createHuffmanTree(int[]arr){
        ArrayList<Node> nodes = new ArrayList<>();
        for (int i = 0; i < arr.length; i++) {
            nodes.add(new Node(arr[i]));
        }
        while(nodes.size() > 1){
            Collections.sort(nodes);
            Node left = nodes.get(0);
            Node right = nodes.get(1);
            Node parent = new Node(left.value + right.value);
            parent.left = left;
            parent.right = right;
            nodes.add(parent);
            nodes.remove(left);
            nodes.remove(right);
        }
        return nodes.get(0);
    }

Huffman coding

1. Huffman coding is also translated as Huffman Coding, also known as Huffman Coding, which is a coding method and belongs to a program algorithm

2. Huffman coding is one of the classic applications of Huffman trees in telecommunications.

3. Huffman coding is widely used for data file compression. Its compression ratio is usually between 20% and 90%

4. Huffman code is a type of variable word length coding (VLC). Huffman proposed a coding method in 1952, called the best coding

Information processing method in the communication field 1-fixed-length coding i like like like java do you like a java // 40 characters in total (including spaces)  

105 32 108 105 107 101 32 108 105 107 101 32 108 105 107 101 32 106 97 118 97 32 100 111 32 121 111 117 32 108 105 107 101 32 97 32 106 97 118 97 //correspond to Ascii code

01101001 00100000 01101100 01101001 01101011 01100101 00100000 01101100 01101001 01101011 01100101 00100000 01101100 01101001 01101011 01100101 00100000 01101010 01100001 01110110 01100001 00100000 01100100 01101111 00100000 01101001 01101111 01010101 00100000 0110000 0110000 0110000 0110000 01100101001110 10000 01101011 0100000 01101011 0100000 0110000 0110000 01101100 01101111 01110101 001101 00

The information is transmitted in binary, the total length is 359 (including spaces)

Online transcoding tool

Information Processing in the Communication Field 2-Variable Length Coding

i like like like java do you like a java // 40 characters in total (including spaces)

d:1 y:1 u:1 j:2 v:2 o:2 l:4 k:4 e:4 i:5 a:5 :9 // the number of characters corresponding to 0=, 1=a, 10=i, 11=e, 100=k, 101=l, 110=o, 111=v, 1000=j, 1001=u, 1010=y, 1011=d Note: Encode according to the number of times each character appears, The principle is that the greater the number of occurrences, the smaller the code. For example, if a space appears 9 times, the code is 0, and so on.

According to the encoding specified for each character above, when we transmit "i like like like java do you like a java" data, the encoding is 10010110100...  

The character encoding cannot be the prefix of other character encodings. The encoding that meets this requirement is called prefix encoding, that is, it cannot match the repeated encoding.

Information Processing in the Communication Field 3-Huffman Coding

i like like like java do you like a java // 40 characters in total (including spaces)

d:1 y:1 u:1 j:2 v:2 o:2 l:4 k:4 e:4 i:5 a:5 :9 // The corresponding number of each character is constructed according to the number of times the above character appears A Huffman tree, the number of times is used as the weight.

//According to the Huffman tree, specify the encoding for each character//, the path to the left is 0 //The path to the right is 1, and the encoding is as follows:

o: 1000   u: 10010  d: 100110  y: 100111  i: 101 a : 110     k: 1110    e: 1111       j: 0000       v: 0001 l: 001          : 01

According to the Huffman encoding above, our "i like like like java do you like a java" string corresponds to the encoding (note the lossless compression we use here) 1010100110111101111010011011110111101001101111011110100001100001110011001111000011001111000100100100110111101111011100100001100001110

Length is: 133

Description:

1. The original length is 359, which is compressed (359-133) / 359 = 62.9%

2. This code satisfies the prefix code, that is, the character code cannot be the prefix of other character codes. Will not cause ambiguity in matching

Note that this Huffman tree may be different depending on the sorting method, so the corresponding Huffman codes are not exactly the same, but the wpl is the same and is the smallest, for example: if we make each generation The new binary tree is always ranked at the last of the binary trees with the same weight. The resulting binary tree is:

Huffman coding code implementation

Build node

class Node implements Comparable<Node>{
    public Byte data;
    public int weight;
    public Node left;
    public Node right;

    public Node(Byte data,int weight){
        this.data = data;
        this.weight = weight;
    }

    public void preOrder(){
        System.out.println(this);
        if (this.left != null)
            this.left.preOrder();
        if(this.right != null)
            this.right.preOrder();
    }

    @Override
    public String toString() {
        return "Node{" +
                "data=" + data +
                ", weight=" + weight +
                '}';
    }

    @Override
    public int compareTo(Node o) {
        return this.weight - o.weight;
    }
}

1. Convert each data into a node and store it in the List collection

private static ArrayList<Node> getList(byte[]bytes){
        ArrayList<Node> nodes = new ArrayList<>();
        Map<Byte,Integer> map = new HashMap<>();
        for(byte b : bytes){
            Integer count = map.get(b);
            if(count == null){
                map.put(b,1);
            }else
                map.put(b,map.get(b) + 1);
        }
        //遍历map
        for(Map.Entry<Byte,Integer> entry : map.entrySet()){
            nodes.add(new Node(entry.getKey(),entry.getValue()));
        }
        return nodes;
    }

2. Obtain the Huffman tree according to the List collection

private static Node createHuffmanTree(List<Node> nodes){
        while(nodes.size() > 1){
            Collections.sort(nodes);
            Node leftNode = nodes.get(0);
            Node rightNode = nodes.get(1);
            Node parent = new Node(null,leftNode.weight + rightNode.weight);
            parent.left = leftNode;
            parent.right = rightNode;
            nodes.remove(leftNode);
            nodes.remove(rightNode);
            nodes.add(parent);
        }
        return nodes.get(0);
    }

3. Obtain the Huffman coding table according to the Huffman tree (the method overload mechanism is used here to facilitate parameter transfer)

static Map<Byte,String> huffmanCodes = new HashMap<>();
    static StringBuilder stringBuilder = new StringBuilder();

    private static Map<Byte,String> getCodes(Node root){
        if (root == null)
            return null;
        getCodes(root,"",stringBuilder);
        return huffmanCodes;
    }

    //根据哈夫曼树获得哈夫曼编码
    private static void getCodes(Node node,String code,StringBuilder stringBuilder){
        StringBuilder stringBuilder1 = new StringBuilder(stringBuilder);
        stringBuilder1.append(code);
        if(node.data == null) {//非叶子节点
            //向左递归
            getCodes(node.left,"0",stringBuilder1);
            //向右递归
            getCodes(node.right,"1",stringBuilder1);
        }else {
            huffmanCodes.put(node.data,stringBuilder1.toString());
        }
    }

4. Compress the data according to the original byte array and Huffman coding table

private static byte[] zip(byte[] bytes, Map<Byte, String> huffmanCodes) {
        StringBuilder builder = new StringBuilder();
        for(byte b : bytes){
            builder.append(huffmanCodes.get(b));
        }
        int length = (builder.length() + 7) / 8 ;
        byte[] by = new byte[length];
        int index = 0;
        String str;
        for (int i = 0; i < builder.length(); i += 8) {
            if((i + 8) > builder.length()){
                str = builder.substring(i);
                by[index] = (byte)Integer.parseInt(str,2);
                index++;
            }else {
                str = builder.substring(i, i + 8);
                by[index] = (byte)Integer.parseInt(str,2);
                index++;
            }
        }
        return by;
    }

5. In order to facilitate later use, encapsulate the compression

private static byte[] huffmanZip(byte[] bytes){
        //获取ArrayList<Node>集合
        ArrayList<Node> nodes = getList(bytes);
        //获取哈夫曼树
        Node root = createHuffmanTree(nodes);
        //获取哈夫曼编码表
        Map<Byte,String>huffmanCodes = getCodes(root);
        //压缩数据
        byte[]huffmanCodesBytes = zip(bytes,huffmanCodes);
        return huffmanCodesBytes;
    }

The compressed Huffman byte array is [-88, -65, -56, -65, -56, -65, -55, 77, -57, 6, -24, -14, -117, -4 , -60, -90, 28]

Decoding (decompression)

6. Write a method to convert bytes to binary, where positive numbers need to fill the high eight bits, and negative numbers need to intercept the eighth complement. If it is the last one, you don't need to do any operation and convert directly

/**
     * 将字节类型的十进制转换为二进制的字符串类型
     * @param flag 是否是最后一位,最后一位不需要补高位
     * @param b 带转换的字节
     * @return
     */
    private static String byteToBitString(boolean flag,byte b){
        int temp = b;
        if(flag){
            temp |= 256;//正数需要补高位;
        }
        String str = Integer.toBinaryString(temp);
        if(flag){
            return str.substring(str.length() - 8);
        }else
            return str;
    }

7. Decompression, here first convert the Huffman coded byte array into StringBuilder, then invert the Huffman code table to get the decoding table, and then decode it against the decoding table. And encapsulated in the List collection

/**
     * 解压
     * @param huffmanCodesBytes 哈夫曼编码后的字节数组
     * @param huffmanCodes 哈夫曼编码表
     * @return 哈夫曼编码前的字节数组
     */
    public static byte[] deCode(byte[]huffmanCodesBytes,Map<Byte,String> huffmanCodes){
        boolean flag = true;
        StringBuilder builder = new StringBuilder();
        for (int i = 0; i < huffmanCodesBytes.length; i++) {
            byte b = huffmanCodesBytes[i];
            if(i == huffmanCodesBytes.length - 1)
                flag = false;
            builder.append(byteToBitString(flag,b));
        }
        String str = builder.toString();
        //将hashMap反转
        Map<String,Byte> map = new HashMap<>();
        for (Map.Entry<Byte,String> entry : huffmanCodes.entrySet()){
            map.put(entry.getValue(),entry.getKey());
        }
        ArrayList<Byte> list = new ArrayList<>();
        int start = 0;
        int end = 1;
        while(start < str.length()){
            while(end < str.length() && map.get(str.substring(start,end)) == null){
                end++;
            }
            list.add(map.get(str.substring(start,end)));
            start = end;
        }
        byte[] b = new byte[list.size()];
        for (int i = 0; i < b.length; i++) {
            if(list != null) {
                b[i] = list.get(i);
            }
        }
        return b;
    }

test

public static void main(String[] args) {
        String constant = "i like like like java do you like a java";
        System.out.println(constant);
        byte[] constantBytes = constant.getBytes();
        byte[] huffmanCodesBytes = huffmanZip(constantBytes);
        Map<Byte, String> huffmanCodes = getCodes(createHuffmanTree(getList(constantBytes)));
        byte[] sourceBytes = deCode(huffmanCodesBytes, huffmanCodes);
        System.out.println(new String(sourceBytes));
    }

You can see that the encoding is exactly the same before and after the encoding, which realizes the entire compression and decompression process

practice

Realize the compression and decompression of binary files

compression

public static void zipFile(String srcFile,String dstFile){
        //创建输入流
        FileInputStream is = null;
        //创建输出流和对象输出流
        FileOutputStream os = null;
        ObjectOutputStream oos = null;
        try {
            is = new FileInputStream(srcFile);
            byte[] b = new byte[is.available()];
            is.read(b);
            os = new FileOutputStream(dstFile);
            oos = new ObjectOutputStream(os);
            byte[] huffmanBytes = huffmanZip(b);
            oos.writeObject(huffmanBytes);
            oos.writeObject(huffmanCodes);
        } catch (Exception e) {
            e.printStackTrace();
        }finally {
            try {
                oos.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
            try {
                os.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
            try {
                is.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }

Unzip

public static void unZipFile(String zipFile,String dstFile){
        //创建输入流和对象输入流
        FileInputStream is = null;
        ObjectInputStream ois = null;
        //创建输出流
        FileOutputStream os = null;
        try {
            is = new FileInputStream(zipFile);
            ois = new ObjectInputStream(is);
            byte[] huffmanBytes = (byte[])ois.readObject();
            Map<Byte,String> huffmanCodes = (Map<Byte,String>)ois.readObject();
            byte[] bytes = deCode(huffmanBytes,huffmanCodes);
            os = new FileOutputStream(dstFile);
            os.write(bytes);
        } catch (Exception e) {
            e.printStackTrace();
        }finally {
            try {
                os.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
            try {
                ois.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
            try {
                is.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }

test

        //测试压缩文件
        /*String srcFile = "d://src.bmp";
        String dstFile = "d://src.zip";
        zipFile(srcFile,dstFile);
        System.out.println("压缩成功!");*/

        //测试解压文件
        String zipFile = "d://src.zip";
        String dstFile = "d://src1.bmp";
        unZipFile(zipFile,dstFile);
        System.out.println("解压成功!");

Picture comparison before and after compression

Guess you like

Origin blog.csdn.net/qq_45796208/article/details/111631837