Huffman tree and Huffman coding (priority queue)

Author Wu Jianquan

Unit Chongqing University of Science and Technology

Topic description:

Huffman Tree, also known as the optimal binary tree, is a binary tree with the shortest weighted path length. The so-called weighted path length of a tree is the weight of all leaf nodes in the tree multiplied by the path length to the root node (if the root node is level 0, the path length from the leaf node to the root node is the number of layers of nodes). The path length of the tree is the sum of the path lengths from the tree root to each node, recorded as WPL=(W1 L1+W2 L2+W3 L3+...+Wn Ln), N weights Wi (i=1, 2,...n) constitute a binary tree with N leaf nodes, and the path length of the corresponding leaf node is Li (i=1,2,...n). It can be proved that the WPL of Huffman tree is the smallest.

In data communication, the transmitted text needs to be converted into a binary string, and different arrangements of 0 and 1 codes are used to represent the characters. For example, the message to be transmitted is "AFTER DATA EAR ARE ART AREA", the character set used here is "A, E, R, T, F, D", and the number of occurrences of each letter is {8, 4, 5 ,3,1,1}. It is now required to design a code for these letters. To distinguish 6 letters, the simplest binary encoding method is equal-length encoding, which is fixed with 3-digit binary. You can use 000, 001, 010, 011, 100, and 101 respectively for "A, E, R, T, F, D ” is encoded and sent, and when the other party receives the message, it is decoded according to three bits and one. Obviously the length of the encoding depends on the number of different characters in the message. If 26 different characters may appear in the message, the fixed encoding length is 5. However, when transmitting a message it is always desirable to keep the total length as short as possible. In practical applications, the frequency of occurrence or number of uses of each character is different. For example, the frequency of use of A, B, and C is much higher than that of Use short codes and long codes for low frequency use to optimize the encoding of the entire message.

In order to make the unequal length encoding a prefix encoding (that is, it is required that the encoding of one character cannot be the prefix of another character encoding), each character in the character set can be used as a leaf node to generate a encoding binary tree. In order to obtain the shortest length of the transmitted message Length, the frequency of occurrence of each character can be assigned to the node as the weight of the character node. Obviously, the smaller the frequency of word use, the smaller the weight. The smaller the weight, the lower the leaf, so the smaller the frequency, the longer the code. The higher the frequency, the shorter the code. This ensures that the minimum weighted path length of this tree is effectively the shortest length of the transmitted message. Therefore, the problem of finding the shortest length of a transmitted message is transformed into the problem of finding a Huffman tree generated by using all characters in the character set as leaf nodes and the frequency of character occurrence as its weight. Huffman trees are used to design binary prefix coding, which not only meets the conditions of prefix coding, but also ensures that the total length of the message coding is shortest.

This question requires inputting several symbols used in messages and their frequency of occurrence from the keyboard, and then constructing a Huffman tree to output Huffman codes.

Note:
In order to ensure a unique Huffman tree, this question stipulates that when constructing the Huffman tree, the weight of the left child node is not greater than the weight of the right child node. If the weights are equal, the node that is dequeued first in the priority queue is selected as the left child. When encoding, the left branch takes "0" and the right branch takes "1".

Input format:

The input has 3 lines.
Line 1: Number of symbols n (2 to 20).

Line 2: A string without spaces. Record the symbol table of this question. We agree that the symbols are all single lowercase English letters and appear sequentially starting from the character 'a'. That is, if n is 2, the symbol table is ab; if n is 6, the symbol is abcdef; and so on.

Line 3: The frequency of occurrence of each symbol (an integer multiplied by 100), separated by spaces.

Output format:

First output the weighted path length of the constructed Huffman tree.
Next, n lines are output, each line is a character and the Huffman code corresponding to the character. Characters are output in dictionary order.

Characters and Huffman codes are separated by colons.

For example:

a:10

b:110

Input example:

Give a set of inputs here.

6
abcdef
15 19 10 6 38 12

Output sample:

The corresponding output is given here.

240
a:101
b:111
c:1101
d:1100
e:0
f:100

Tip:
For the above example data, the Huffman Tree established according to the question requirements is as shown below:

Code length limit 16 KB

Time limit 400 ms

Memory limit 64 MB

 Analysis: I spent a long time revising this question. The overall idea is to simulate the process of a Huffman tree based on priority queues and structures. The overall idea of ​​the question is not too difficult, but there are two major pitfalls:

1: When overloading operators, you need to sort the characters first, and then sort the weights.

2: When the left and right children are the same size, put the right child first and then the left child (this is particularly tricky because the question does not clearly explain it. According to the normal thinking, put the left child first and then the right child)

#include <iostream>
#include <queue>
#include <map>

using namespace std;

const int N = 110;

struct node
{
    int id, w;
    char op;
    bool operator < (const node &a) const
    {
        if (a.w == w) return op < a.op;
        return w > a.w;
    }
};

struct node1
{
    int id, w;
    char op;
    int l, r, p;
}h[N];

int n;
string s;
map<char, string>mp;

void init()
{
    for (int i = 0; i < n - 1; i ++ )
        h[i].p = h[i].l = h[i].r = -1;
}

int main()
{
    init();
    cin >> n >> s;
    priority_queue<node, vector<node> >q;
    for (int i = 0; i < n; i ++ )
    {
        int x;
        cin >> x;
        q.push({i, x, s[i]});
    }
    int sum = 0;
    for (int i = n; i < 2 * n - 1; i ++ )
    {
        auto x = q.top();
        q.pop();
        auto y = q.top();
        q.pop();
        if(x.w == y.w) swap(x, y);
        h[i].l = x.id, h[i].r = y.id;
        h[x.id].p = h[y.id].p = i;
        h[i].id = i;
        h[i].w = x.w + y.w;
        q.push({i, h[i].w, '-'});
        sum += h[i].w;
    }
    cout << sum << endl;
    
    for (int i = 0; i < n; i ++ )
    {
        h[n * 2 - 2].p = -1;
        string s1 = "";
        int pre = i;
        int pp = h[i].p;
        while (pp != -1)
        {
            int ll = h[pp].l;
            int rr = h[pp].r;
            if(ll == pre) s1 = "0" + s1;
            else s1 = "1" + s1;
            pre = pp;
            pp = h[pp].p;
        }
        mp[s[i]] = s1;
    }
    for (auto it : mp)
        cout << it.first << ':' << it.second << endl;
    
    return 0;
}

Guess you like

Origin blog.csdn.net/qq_52331221/article/details/127909309