Super detailed explanation of Huffman Tree (Huffman Tree) and the construction principle and method of Huffman coding, and realize it with code.

1 Basic concept of Huffman tree

Path : A branch from one node to another in the tree constitutes a path between these two nodes.

Node path length : the number of branches on the path between two nodes.

The path length of the tree: the sum of the path lengths from the root of the tree to each node. Write: TL

Weight (weight) is also called weight: assign a value with a certain meaning to a node in the tree, (the specific meaning is determined according to the occasion where the tree is used), then this value is called the weight of the node. For example, in the judgment tree mentioned above, 5% represents the proportion of people in the corresponding score segment in the total number of people. The weighted
path length of the node: the product of the path length from the root node to the node and the weight on the node

The weighted path length of the tree: the sum of the weighted path lengths of all leaf nodes in the tree.

The path length of the tree: the sum of the path lengths from the root of the tree to each node.

Huffman tree: the optimal tree, the tree with the shortest weighted path length (WPL)

"The shortest weighted path length" is the result of comparison among trees with "same degree", so it is called the optimal binary tree and the optimal ternary tree.

Huffman tree: the optimal binary tree, the binary tree with the shortest weighted path length (WPL), because the algorithm for constructing this tree was proposed by Professor Huffman in 1952, so it is called the Huffman tree, corresponding The algorithm is called the Huffman algorithm.

2. Huffman tree construction algorithm

Huffman algorithm (method of constructing Huffman tree)

(1) Construct a forest of n binary trees F=(T1, T2,.., Tn) according to n given weights (W1, W2,..., Wn), where only one weight of Ti is Wi; the root node.

Tectonic forest is all root

(2) In F, select two trees with the smallest weight of the root node as the left and right subtrees, construct a new binary tree, and set the weight of the root node of the new binary tree to be the root node of the left and right subtrees The sum of the weights of the points.

Choose two small ones to make new trees

(3) Delete these two trees in F, and add the newly obtained binary tree to the forest.

delete two little add newcomers

(4) Repeat (2) and (3) until there is only one tree in the forest, which is the Huffman tree.

Repeat 2 and 3 to leave a single root

Summarize

1. In the Huffman algorithm, initially there are n binary trees, which need to be merged n-1 times to finally form a Huffman tree.

2. After n-1 times of merging, n-1 new nodes are generated, and these n-1 new nodes are all branch nodes with two children.

It can be seen that there are n+n-1 =2n-1 nodes in the Huffman tree, and the degree of all the branch nodes is not 1.

3. Build Huffman tree code implementation

3.1 Node structure in Huffman tree

When building a Huffman tree, it is first necessary to determine the composition of the nodes in the tree.

Since the construction of the Huffman tree starts from the leaf node and continuously builds new parent nodes until the root of the tree, the node should contain a pointer to the parent node. But when using the Huffman tree, it starts from the root of the tree and traverses the nodes in the tree according to the requirements, so each node needs to have pointers to its left and right children.

//哈夫曼树结点结构
typedef struct {
    int weight;//结点权重
    int parent, left, right;//父结点、左孩子、右孩子在数组中的位置下标
}HTNode, *HuffmanTree;

3.2 Search algorithm in Huffman tree

When building a Huffman tree, it is necessary to filter out the two nodes with the smallest value according to the weight value of each node each time, and then build a binary tree.

The idea of finding the two nodes with the smallest weight value is: starting from the starting position of the tree group, first find two nodes without parent nodes (indicating that they have not been used to build a tree), and then follow up with no parent nodes The nodes are compared in turn, there are two situations to consider:

If it is smaller than the smaller of the two nodes, keep this node and delete the original larger node;
If it is between two node weight values, replace the original larger node;

//HT数组中存放的哈夫曼树，end表示HT数组中存放结点的最终位置，s1和s2传递的是HT数组中权重值最小的两个结点在数组中的位置
void Select(HuffmanTree HT, int end, int *s1, int *s2)
{
    int min1, min2;
    //遍历数组初始下标为 1
    int i = 1;
    //找到还没构建树的结点
    while(HT[i].parent != 0 && i <= end){
        i++;
    }
    min1 = HT[i].weight;
    *s1 = i;
   
    i++;
    while(HT[i].parent != 0 && i <= end){
        i++;
    }
    //对找到的两个结点比较大小，min2为大的，min1为小的
    if(HT[i].weight < min1){
        min2 = min1;
        *s2 = *s1;
        min1 = HT[i].weight;
        *s1 = i;
    }else{
        min2 = HT[i].weight;
        *s2 = i;
    }
    //两个结点和后续的所有未构建成树的结点做比较
    for(int j=i+1; j <= end; j++)
    {
        //如果有父结点，直接跳过，进行下一个
        if(HT[j].parent != 0){
            continue;
        }
        //如果比最小的还小，将min2=min1，min1赋值新的结点的下标
        if(HT[j].weight < min1){
            min2 = min1;
            min1 = HT[j].weight;
            *s2 = *s1;
            *s1 = j;
        }
        //如果介于两者之间，min2赋值为新的结点的位置下标
        else if(HT[j].weight >= min1 && HT[j].weight < min2){
            min2 = HT[j].weight;
            *s2 = j;
        }
    }
}

3.3 Construction Algorithm Implementation

//HT为地址传递的存储哈夫曼树的数组，w为存储结点权重值的数组，n为结点个数
void CreateHuffmanTree(HuffmanTree *HT, int *w, int n)
{
    if(n<=1) return; // 如果只有一个编码就相当于0
    int m = 2*n-1; // 哈夫曼树总节点数，n就是叶子结点
    *HT = (HuffmanTree) malloc((m+1) * sizeof(HTNode)); // 0号位置不用
    HuffmanTree p = *HT;
    // 初始化哈夫曼树中的所有结点
    for(int i = 1; i <= n; i++)
    {
        (p+i)->weight = *(w+i-1);
        (p+i)->parent = 0;
        (p+i)->left = 0;
        (p+i)->right = 0;
    }
    //从树组的下标 n+1 开始初始化哈夫曼树中除叶子结点外的结点
    for(int i = n+1; i <= m; i++)
    {
        (p+i)->weight = 0;
        (p+i)->parent = 0;
        (p+i)->left = 0;
        (p+i)->right = 0;
    }
    //构建哈夫曼树
    for(int i = n+1; i <= m; i++)
    {
        int s1, s2;
        Select(*HT, i-1, &s1, &s2);
        (*HT)[s1].parent = (*HT)[s2].parent = i;
        (*HT)[i].left = s1;
        (*HT)[i].right = s2;
        (*HT)[i].weight = (*HT)[s1].weight + (*HT)[s2].weight;
    }
}

4. Huffman coding

4.1 Basic concepts of Huffman coding

In remote communication, it is necessary to convert the character to be transmitted into a binary string

If the code is designed as a binary code with different lengths , that is , the characters that appear more often in the string to be transmitted are coded as short as possible , then the number of converted binary strings may be reduced.

Key: To design codes with different lengths, the code of any character must not be the prefix of the code of another character. This code is called prefix code .

Question: What kind of prefix code can make the total length of the message the shortest?

Huffman coding method:

1. The average probability of each character in the statistical character set appearing in the message (the greater the probability, the shorter the code required)

2. Utilize the characteristics of the Huffman tree: the leaf with the greater weight is closer to the root; the probability value of each character is used as the weight to construct the Huffman tree. The node with higher probability has the shorter path.

3. Mark 0 or 1 on each branch of the Huffman tree:

The left branch of the node is marked 0, and the right branch is marked 1

Connect the labels on the path from the root to each leaf as the encoding of the character represented by the leaf.

Two questions:

1. Why can Huffman encoding be guaranteed to be a prefix encoding?

Since no leaf is an ancestor of another leaf, it is impossible for each leaf node's code to be a prefix of other leaf node codes. (The characters are all leaf nodes, and the root to a character will not pass another character T)

2. Why can Huffman coding guarantee the shortest total length of character codes?

Because the weighted path length of the Huffman tree is the shortest, the total length of the character code is the shortest.

Property 1 Huffman code is a prefix code

Property 2 Huffman coding is the optimal prefix code

4.2 Huffman coding code implementation

There are two ways to use the program to find the Huffman code:

Find the root node from the leaf node, and reverse record the marks passed on the way. For example, the Huffman encoding of character c in Figure 3 finds the root node from node c, and the result is: 0 1 1 , so the Huffman encoding of character c is: 1 1 0 (reverse order output).
From the root node to the leaf node, record the marks passed on the way. For example, to find the Huffman code of character c in Figure 3, start from the root node, and the sequence is: 1 1 0.

The implementation code using method 1 is:

//HT为哈夫曼树，HC为存储结点哈夫曼编码的二维动态数组，n为结点的个数
void HuffmanCoding(HuffmanTree HT, HuffmanCode *HC,int n){
    *HC = (HuffmanCode) malloc((n+1) * sizeof(char *));
    char *cd = (char *)malloc(n*sizeof(char)); //存放结点哈夫曼编码的字符串数组
    cd[n-1] = '\0';//字符串结束符
   
    for(int i=1; i<=n; i++){
        //从叶子结点出发，得到的哈夫曼编码是逆序的，需要在字符串数组中逆序存放
        int start = n-1;
        //当前结点在数组中的位置
        int c = i;
        //当前结点的父结点在数组中的位置
        int j = HT[i].parent;
        // 一直寻找到根结点
        while(j != 0){
            // 如果该结点是父结点的左孩子则对应路径编码为0，否则为右孩子编码为1
            if(HT[j].left == c)
                cd[--start] = '0';
            else
                cd[--start] = '1';
            //以父结点为孩子结点，继续朝树根的方向遍历
            c = j;
            j = HT[j].parent;
        }
        //跳出循环后，cd数组中从下标 start 开始，存放的就是该结点的哈夫曼编码
        (*HC)[i] = (char *)malloc((n-start)*sizeof(char));
        strcpy((*HC)[i], &cd[start]);
    }
    //使用malloc申请的cd动态数组需要手动释放
    free(cd);
}

The implementation code using the second algorithm is:

//HT为哈夫曼树，HC为存储结点哈夫曼编码的二维动态数组，n为结点的个数
void HuffmanCoding(HuffmanTree HT, HuffmanCode *HC,int n){
    *HC = (HuffmanCode) malloc((n+1) * sizeof(char *));
    int m=2*n-1;
    int p=m;
    int cdlen=0;
    char *cd = (char *)malloc(n*sizeof(char));
    //将各个结点的权重用于记录访问结点的次数，首先初始化为0
    for (int i=1; i<=m; i++) {
        HT[i].weight=0;
    }
    //一开始 p 初始化为 m，也就是从树根开始。一直到p为0
    while (p) {
        //如果当前结点一次没有访问，进入这个if语句
        if (HT[p].weight==0) {
            HT[p].weight=1;//重置访问次数为1
            //如果有左孩子，则访问左孩子，并且存储走过的标记为0
            if (HT[p].left!=0) {
                p=HT[p].left;
                cd[cdlen++]='0';
            }
            //当前结点没有左孩子，也没有右孩子，说明为叶子结点，直接记录哈夫曼编码
            else if(HT[p].right==0){
                (*HC)[p]=(char*)malloc((cdlen+1)*sizeof(char));
                cd[cdlen]='\0';
                strcpy((*HC)[p], cd);
            }
        }
        //如果weight为1，说明访问过一次，即是从其左孩子返回的
        else if(HT[p].weight==1){
            HT[p].weight=2;//设置访问次数为2
            //如果有右孩子，遍历右孩子，记录标记值 1
            if (HT[p].right!=0) {
                p=HT[p].right;
                cd[cdlen++]='1';
            }
        }
        //如果访问次数为 2，说明左右孩子都遍历完了，返回父结点
        else{
            HT[p].weight=0;
            p=HT[p].parent;
            --cdlen;
        }
    }
}

【Data Structure and Algorithm】-Huffman Tree (Huffman Tree) and Huffman Coding