Data Compression - Huffman Tree and Huffman Compression

The idea of ​​Huffman compression: use fewer bits to represent frequently occurring characters and use more bits to represent fewer characters. This reduces the total number of bits used to represent the string.

Premise: All character encodings do not become prefixes of other character encodings. This premise is guaranteed to hold using a Huffman tree.

Construct a Huffman tree:

First define the node class of the Huffman tree :

private static class Node implements Comparable<Node> {
    private final char ch;
    private final int freq;
    private final Node left, right;

    Node(char ch, int freq, Node left, Node right) {
        this.ch    = ch;
        this.freq  = freq;
        this.left  = left;
        this.right = right;
    }
    private boolean isLeaf() { return (left == null) && (right == null); }
    public int compareTo(Node that) { return this.freq - that.freq; }
}

Then build the Huffman tree:

Huffman tree is a two-round algorithm that needs to scan the target string twice to compress it. The first scan counts the frequency of occurrence of each character, and the second scan compresses it according to the generated compilation table.

The construction process is as follows: create a separate node for each character (can be seen as a tree with only one node). First find the two nodes with the smallest frequency, and then create a new node with these two nodes as children (the frequency value of the new node is the sum of the frequency values ​​of the two child nodes); this operation will make the tree in the forest The number is reduced by one. Keep repeating this process until only one tree remains. In this way, the path from the root node of the tree to the leaf node is the Huffman code corresponding to the character in the leaf node.

 private static Node buildTrie(int[] freq) {
    MinPQ<Node> pq = new MinPQ<Node>();
    for (char i = 0; i < R; i++)
        if (freq[i] > 0)
            pq.insert(new Node(i, freq[i], null, null));

    while (pq.size() > 1) {
        Node left  = pq.delMin();
        Node right = pq.delMin();
        Node parent = new Node('\0', left.freq + right.freq, left, right);
        pq.insert(parent);
    }
    return pq.delMin();
}

Decoding operation:

Decode the bit stream according to the Huffman tree: move down from the root node according to the input of the bit stream (go to the left child node when encountering 0, go to the right child node when encountering 1), and output the The character of the leaf node and back to the root node.

 public static void expand() {
    Node root = readTrie(); 
    int length = BinaryStdIn.readInt();

    for (int i = 0; i < length; i++) {
        Node x = root;
        while (!x.isLeaf()) {
            boolean bit = BinaryStdIn.readBoolean();
            if (bit) x = x.right;
            else     x = x.left;
        }
        BinaryStdOut.write(x.ch);
   }
   BinaryStdOut.close();
}

Compression operation:

The compaction operation is implemented according to the constructed compilation table. According to the Huffman tree, build a table that associates characters and binary strings corresponding to paths, and then scan the target string. For each character read in, look up the table to get the corresponding binary string and output it.

Build the compilation table:

 private static void buildCode(String[] st, Node x, String s) {
    if (!x.isLeaf()) {
        buildCode(st, x.left,  s + '0');
        buildCode(st, x.right, s + '1');
    }
    else {
        st[x.ch] = s;
    }
}

Compression using compilation table:

for (int i = 0; i < input.length; i++) {
        String code = st[input[i]];
        for (int j = 0; j < code.length(); j++) {
            if (code.charAt(j) == '0')
                BinaryStdOut.write(false);
            else (code.charAt(j) == '1') 
                BinaryStdOut.write(true);
        }
    }

For any prefix code, the length of the encoded bit string is equal to the weighted outer path length of the Huffman tree.

Given a set of r symbols and their frequencies, the prefix code constructed by the Huffman algorithm is optimal.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324377213&siteId=291194637