[DSA] Tree-Huffman Tree Detailed Explanation (3)

What is a Huffman tree

  • The definition of Baidu Encyclopedia
    gives N weights as N leaf nodes to construct a binary tree. If the weighted path length of the tree reaches the minimum, this binary tree is called the optimal binary tree, also known as the Huffman tree Huffman Tree). The Huffman tree is the tree with the shortest weighted path length, and the nodes with larger weights are closer to the root.
    Huffman tree

The above definition is very academic and a very rigorous expression, but it always seems not so easy to understand. Here, let's not talk about theory, let's look directly at how a Huffman tree is constructed and understand abstract concepts through figurative things.

How to build a Huffman tree

Below, we use an example of Huffman coding to understand the Huffman tree.

  • Background problem
    Here, we illustrate with an example.
    Now, an English article needs to be sent from A to B. The requirement is that the length of the code is the shortest. There are 26 English letters in total. If the capitalization is different, it is 52. Then we need 6 bits to encode (2 ^ 5 <52 <2 ^ 6 = 64). If this article has 10,000 letters, the encoding length is 10000 * 6. We know that in the article, the frequency of occurrence of each letter is different, thinking: use different lengths of coding digits for letters of different frequencies, that is, the shortest coding length of the most frequent letters, the most frequent letters Long coding length. This will minimize the total length of the entire article. Improve transmission efficiency.

The implementation is as follows:
character frequency table

letter A B C D E F G H
frequency 80 30 20 75 40 8 55 60

According to the above table, we know that the encoding of A is the shortest and the encoding of F is the longest. The frequency here is arbitrarily specified by me, and the probability of the actual letter appearing is counted in cryptography (you can refer to the letter frequency ).

  • Construction
    Step 1: Select the two letters F and C with the lowest frequency, and use these letters to form a binary tree. The small one is the left child and the big one is the right child. And take the sum of the frequencies of F and C as the root node and return it to the frequency table.
    Insert picture description here
    Step 2: Repeat the above operations continuously.
    F + C is less than B, so 28 is in the left child's position.
letter A B D E FC G H
frequency 80 30 75 40 28 55 60

Insert picture description here
Found that the smallest at this time is 40 and 55 (E and G)

letter A D E FCB G H
frequency 80 75 40 58 55 60

Insert picture description here
The smallest at this time are 58 and 60 (EG and H)

letter A D FCB EG H
frequency 80 75 58 95 60

Insert picture description here

letter A D FCBH EG
frequency 80 75 118 95

Insert picture description here

letter AD FCBH EG
frequency 155 118 95

Insert picture description here

letter AD FCBHEG
frequency 155 213

Insert picture description here
At this point, the construction of the Huffman tree is completed, so how is the coding mentioned above implemented? According to binary, the left side is marked with 0, and the right side is marked with 1. The sequence of 0s and 1s along the direction of the tree until the leaf node where the letter is located is the Huffman code of the letter. As follows: The
Insert picture description here
following table is the final code of all letters

letter coding
A 01
B 1101
C 11001
D 00
E 100
F 11000
G 101
H 111

If I want to send four letters of ABC, then the code is 0 111111 1111101, a total of 14 bits, if you code according to the first 6 letters, the length is 18

Huffman tree code implementation

When constructing a Huffman tree, you need to filter out the two nodes with the smallest value each time according to the weight value of each node, and then build a binary tree.
The idea of ​​finding the two nodes with the smallest weight value is: starting from the beginning of the tree group, first find the two nodes without parent nodes (indicating that they have not been used to build a tree), and then follow up with no parent nodes Compare the nodes in turn, there are two cases to consider:

  • If it is smaller than the smaller of the two nodes, keep this node and delete the original larger node;
  • If it is between the weight values ​​of the two nodes, replace the original larger node;

Huffman tree structure data structure

// 哈夫曼树结点结构
typedef int Type;

typedef struct HuffmanNode_
{
    Type  weight; // 节点权重
    Type  parent, left, right; //父结点、左孩子、右孩子在数组中的位置下标

}Node, *HuffmanTree; 
// 选中频率最小的两个数据
// HT数组中存放的哈夫曼树,end表示HT数组中存放结点的最终位置,s1和s2传递的是HT数组中权重值最小的两个结点在数组中的位置

void select(HuffmanTree HT, int *pos1, int *pos2, int end)
{
    int min1 = 0, min2 = 0;

    int i = 1; // 数组的 0 号元素作为根节点的位置所以不使用

    // 找到没有构建成树的第一个节点
    while (HT[i].parent != 0 && i <= end)
    {
        i++;
    }
    min1 = HT[i].weight;
    *pos1 = i;

    i++;
    // 找到没有构建成树的第二个节点
    while(HT[i].parent != 0 && i <= end) 
    {
        i++;
    }

    min2 = HT[i].weight;
    if (min2 < min1)
    {
        min2 = min1;
        *pos2 = *pos1;
        min1 = HT[i].weight;
        *pos1 = i;
    }
    else
    {
        *pos2 = i;
    }

    // 取得两个节点之后,跟之后所有没有构建成树的节点逐一比较,最终获取最小的两个节点
    for (int j = i+1; j <= end; ++j)
    {
        // 如果已经存在父节点,也就是已经被构建树了,则跳过
        if (HT[j].parent != 0)
        {
            continue;
        }

        // 如果比min1 还小,将min2 = 敏, min1修改为新的节点下标
        if (HT[j].weight < min1)
        {
            min2 = min1;
            min1 = HT[j].weight;
            *pos2 = *pos1;
            *pos1 = j;
        }
        else if (HT[j].weight < min2 && HT[j].weight > min1)
        {
            // 如果大于 min1 小于 min2
            min2 = HT[j].weight;
            *pos2 = j;
        }
    }
}
// 创建完整的哈夫曼树
// HT为地址传递的存储哈夫曼树的数组,w为存储结点权重值的数组,n为结点个数

HuffmanTree init_huffman_tree(Type *weight, int node_num)
{
    if (node_num <= 1)
    {
        // 只有一个节点那么编码就是 0
        return NULL;
    }

    int tree_node_num = node_num * 2 - 1; // 根节点不使用
    HuffmanTree p = (HuffmanTree)malloc((tree_node_num+1) * sizeof(Node));

    // 初始化哈夫曼数组中的所有节点
    for (int i = 1; i <= tree_node_num; ++i)
    {        
        if (i <= node_num)
        {
            (p+i)->weight = *(weight+i-1); // 第0个位置不使用
        }
        else
        {
            (p+i)->weight = 0;
        }

        (p+i)->parent = 0;
        (p+i)->left = 0;
        (p+i)->right = 0;
    }

    return p;
}

void close_huffman_tree(HuffmanTree HT)
{
    if (HT)
    {
        free(HT);
        HT = NULL;
    }
}

void create_huffman_tree(HuffmanTree HT, int node_num)
{
    if (NULL == HT || node_num <= 1)
    {
        return;
    }

    int tree_node_num = node_num * 2 - 1; // 根节点不使用
    for (int i = node_num + 1; i <= tree_node_num; ++i)
    {
        int pos1 = -1, pos2 = -1;
        // 找到频率最小的连个节点
        select(HT, &pos1, &pos2, i-1);
        printf("当前最小的两个节点 [%d %d]\n", HT[pos1].weight, HT[pos2].weight);
        // 这里使用下表来表示父子关系
        HT[pos1].parent = HT[pos2].parent = i; // pos1 位置的元素和pos2位置的元素 的父节点就是,第 i个位置的元素
        HT[i].left = pos1;  // 父节点的左后孩子赋值
        HT[i].right = pos2;
        HT[i].weight = HT[pos1].weight + HT[pos2].weight; // 父节点的权重等于 左右孩子权重的和
    }
}
  • Test code
void print(HuffmanTree HT, int node_num)
{
    if (NULL == HT)
    {
        printf("数组为空\n");
        return;
    }

    int tree_node_num;

    for (int i = 1; i < tree_node_num; ++i)
    {
        printf("%d 的父节点:%d 左孩子:%d 右孩子:%d\n", HT[i].weight, HT[HT[i].parent].weight, HT[i].left, HT[i].right);
    }

}

int main(int argc, char const *argv[])
{
    
    Type weight[8] = {80, 30, 20, 75, 40, 8, 55, 60};

    int node_num = sizeof(weight) / sizeof(Type);

    HuffmanTree HT = init_huffman_tree(weight, node_num);

    create_huffman_tree(HT, node_num);

    print(HT, node_num);

    close_huffman_tree(HT);

    return 0;
}
  • Test results
    !Insert picture description here

Why design a Huffman tree

  • The Huffman tree is mainly used for Huffman coding. Its main function is to use frequency attributes for encoding, and finally achieve the goal: let high-frequency data have short encoding, and low-frequency data have long encoding.
  • Huffman coding is not suitable for all scenarios, it is more suitable for data coding with multiple frequency changes.
Published 134 original articles · Liked 119 · Visit 310,000+

Guess you like

Origin blog.csdn.net/jobbofhe/article/details/102502565