Comic: What the hell is "Huffman coding"?

Author | 小灰

Source | Programmer Xiaohui (ID: chengxuyuanxiaohui)

In the last issue, we introduced a special data structure "Huffman tree", also known as the optimal binary tree. Friends who have not seen it can click the link below:

Comic: What is "Huffman Tree"?

So, what is the use of this data structure? We will reveal the answer today.

How do computer systems store information?

A computer is not a person. It does not know Chinese and English, let alone pictures and videos. The only things it knows are 0 (low level) and 1 (high level).

Therefore, all the text, images, audio, and video we see on the computer are stored and transmitted in binary.

In a narrow sense, converting all kinds of information that humans can understand into a binary form that can be recognized by computers is called encoding.

There are many ways to encode, and the encoding method we are most familiar with is ASCII.

In the ASCII code, each character is expressed as a specific 8-bit binary number, such as:

Obviously, the ASCII code is an equal-length encoding, that is, the encoding length of any character is equal.

Why do you say that? Let's look at an example:

If there are only 6 characters A, B, C, D, E, and F in a piece of information, if we use equal length encoding, we can design each character as a binary encoding with a length of 3:

In this way, given a piece of information "ABEFCDAED", it can be encoded into binary "000 001 100 101 010 011 000 100 011", the total length of the code is 27.

But is this encoding method the optimal design? What happens if we make different characters correspond to different length codes? such as:

In this way, the given information "ABEFCDAED" can be encoded into binary "0 00 10 11 01 1 0 10 1", the total length of the encoding is only 14.

Huffman Coding (Huffman Coding), also invented by MIT's Huffman Coding, this coding method achieves two important goals:

1. Any character encoding is not a prefix of other character encodings.

2. The total length of the information encoding is the smallest.

What does the Huffman code generation process look like? Let us look at the following example:

If there are only 6 characters A, B, C, D, E, and F in a piece of information, the number of occurrences is 2 times, 3 times, 7 times, 9 times, 18 times, 25 times, how to design the corresponding code What?

We may consider these 6 characters as 6 leaf nodes and the number of occurrences of the characters as the weight of the node to generate a Huffman tree:

What is the significance of this?

Each node of the Huffman tree includes two branches, left and right. Each bit of the binary has two states of 0 and 1. We can correspond these two. The left branch of the node is regarded as 0, What is the result of the right branch as 1?

In this way, the path from the root node of the Huffman tree to each leaf node can be equivalent to a binary code:

The binary code generated by the Huffman tree in the above process is the Huffman code.

Now, we face two key issues:

First of all, is there any ambiguity caused by the prefix problem in the generated code? The answer is no ambiguity.

Because each character corresponds to a leaf node of the Huffman tree, the path from the root node to these leaf nodes has no inclusion relationship, and the resulting binary code will naturally not be a prefix of each other.

Secondly, can the code generated in this way guarantee a minimum total length? The answer is yes.

The important characteristic of the Huffman tree is that the sum of all leaf nodes (weight X path length) is the smallest.

In the scenario of information encoding, the weight of the leaf node corresponds to the frequency of occurrence of the character, and the path length of the node corresponds to the encoding length of the character.

The sum of all characters (frequency X code length) is the smallest, which naturally means that the total code length is the smallest.

private Node root;

private Node[] nodes;



//构建哈夫曼树

public void createHuffmanTree(int[] weights) {

//优先队列，用于辅助构建哈夫曼树

Queue<Node> nodeQueue = new PriorityQueue<>();

    nodes = new Node[weights.length];



//构建森林，初始化nodes数组

for(int i=0; i<weights.length; i++){

        nodes[i] = new Node(weights[i]);

        nodeQueue.add(nodes[i]);

}



//主循环，当结点队列只剩一个结点时结束

while (nodeQueue.size() > 1) {

//从结点队列选择权值最小的两个结点

Node left = nodeQueue.poll();

Node right = nodeQueue.poll();

//创建新结点作为两结点的父节点

Node parent = new Node(left.weight + right.weight, left, right);

        nodeQueue.add(parent);

}

    root = nodeQueue.poll();

}



//输入字符下表，输出对应的哈夫曼编码

public String convertHuffmanCode(int index) {

return nodes[index].code;

}



//用递归的方式，填充各个结点的二进制编码

public void encode(Node node, String code){

if(node == null){

return;

}

    node.code = code;

    encode(node.lChild, node.code+"0");

    encode(node.rChild, node.code+"1");

}



public static class Node implements Comparable<Node>{

int weight;

//结点对应的二进制编码

String code;

Node lChild;

Node rChild;



public Node(int weight) {

this.weight = weight;

}



public Node(int weight, Node lChild, Node rChild) {

this.weight = weight;

this.lChild = lChild;

this.rChild = rChild;

}



@Override

public int compareTo(Node o) {

return new Integer(this.weight).compareTo(new Integer(o.weight));

}

}



public static void main(String[] args) {

char[] chars = {'A','B','C','D','E','F'};

int[] weights = {2,3,7,9,18,25};

HuffmanCode huffmanCode = new HuffmanCode();

    huffmanCode.createHuffmanTree(weights);

    huffmanCode.encode(huffmanCode.root, "");

for(int i=0; i<chars.length; i++){

System.out.println(chars[i] +":" + huffmanCode.convertHuffmanCode(i));

}

}

In this code, the Node class adds a new field code to record the binary code corresponding to the node.

After the Huffman tree is constructed, you can recursively fill the code value of each node from the root node down.