DSAA之补充哈夫曼编码（二）

1. 为什么需要编码？

在数据通信中，传送报文时总是希望总长度尽可能短。在实际应用中，各个字符的出现频度或使用次数是不相同的，如A、B、C的使用频率远远高于X、Y、Z，自然会想到设计编码时，让使用频率高的用短码，使用频率低的用长码，以优化整个报文编码。

为使不等长编码为前缀编码(即要求一个字符的编码不能是另一个字符编码的前缀)，可用字符集中的每个字符作为叶子结点生成一棵编码二叉树，为了获得传送报文的最短长度，可将每个字符的出现频率作为字符结点的权值赋予该结点上，显然字使用频率越小权值越小，权值越小叶子就越靠下，于是频率小编码长，频率高编码短，这样就保证了此树的最小带权路径长度效果上就是传送报文的最短长度。因此，求传送报文的最短长度问题转化为求由字符集中的所有字符作为叶子结点，由字符出现频率作为其权值所产生的哈夫曼树的问题。利用哈夫曼树来设计二进制的前缀编码，既满足前缀编码的条件，又保证报文编码总长最短。

　　仔细理解这样的场景，为了使总传输报文的长度最小，希望以使用频率决定每个字符的编码长度，这样产生的总报文长度最短。所以WPL（树的带权路径长度规定为所有叶子结点的带权路径长度之和，记为WPL）代表了给定待编码报文的发送报文（编码之后）的最短总长度。理解这里很重要，因为哈夫曼树只是为了达到该目标的策略。
　　根据这点我们可以推断两个性质：

字符权值越大，路径越短，离root节点越近。
字符权值越小，路径越长，离root节点越远。

2. 基本定义

Huffman coding assigns codes to characters such that the length of the code depends on the relative frequency or weight of the corresponding character.与上面相呼应

Huffman codes are of variable-length, and prefix-free (no code is prefix of any other). Any prefix-free binary code can be visualized as a binary tree with the encoded characters stored at the leaves.从Huffman tree可以直观的得到每个字符（叶节点）的编码

Huffman coding tree or Huffman tree is a full binary tree in which each leaf of the tree corresponds to a letter in the given alphabet.

WPL： Define the weighted path length of a leaf to be its weight times its depth.

The Huffman tree is the binary tree with minimum external path weight, i.e., the one with the minimum sum of weighted path lengths for the given set of leaves. So the goal is to build a tree with the minimum external path weight.

3. Huffman tree building

Prepare a collection of n initial Huffman trees, each of which is a single leaf node.

Put the n trees onto a priority queue organized by weight (frequency).

Remove the first two trees (the ones with lowest weight). Join these two trees to create a new tree whose root has the two trees as children, and whose weight is the sum of the weights of the two children trees.

Put this new tree into the priority queue.

Repeat steps 2-3 until all of the partial Huffman trees have been combined into one.

　　原文直接给出代码实现，笔者这里使用伪代码表示下：

   p_queue_push(array)
   for p_queue_size != 1 
       p_queue_push(tree_merge(p_queue_pop(),p_queue_pop())) 
   ans=p_queue_pop()

　　为什么这样做，就可以得到最优二叉树呢？引文给出了一个解释：

It’s a greedy algorithm: at each iteration, the algorithm makes a “greedy” decision to merge the two subtrees with least weight. Does it give the desired result?

Lemma: Let x and y be the two least frequent characters. There is an optimal code tree in which x and y are siblings whose depth is at least as any other leaf nodes in the tree.

Theorem: Huffman codes are optimal prefix-free binary codes (The greedy algorithm builds the Huffman tree with the minimum external path weight for a given set of letters).

4. 最后

　　根据上面的内容可以解决哈夫曼类型的选择题了：求树的ＷPL、求节点的WPL、人工编码或者解码。对于最后为什么上面的构造可以成立，可以从最开始的为什么需要编码的第二点理解：因为每次选择两棵root权值最小的树融合，所以权值小的叶节点的深度加深，而权值大的叶节点的深度减少，总体上达到树的WPL最小的效果。

引文出处

[1] 百度百科：哈夫曼树
[2] Huffman coding