Huffman coding (C++ implementation)

Article directory

1 Introduction
2. Fixed length encoding
3. Huffman coding
4. Huffman decoding
5. Coding characteristics
6. Code implementation
7. Summary

1 Introduction

In the previous article, the concept of Huffman tree and its implementation were introduced .

What are the uses of Huffman trees? - That is used to create Huffman Coding ( Huffman Coding - a binary encoding ).

Huffman coding is a coding method that can be used for data compression. For example, you can imagine using this compression method in compression software such as winrar or winzip. The construction process of Huffman coding requires the use of Huffman trees.

2. Fixed length encoding

Huffman coding mainly solves the problem of information compression when transmitting information between communicating parties, and expresses the same content by transmitting the minimum amount of information.

Imagine that there is a piece of text content "AFDBCFBDEFDF" that needs to be sent to others through the network.

Generally speaking, when transmitting a piece of information, it is most convenient to use binary 0 and 1 (representing two signals respectively) for information transmission, so consider encoding the text content to be transmitted.

Because this text only involves 6 letters A, B, C, D, E, and F. Using a 3-digit binary number to represent (encode) a letter is enough to represent 8 letters (from binary 000 to binary 111), so using a 3-digit binary number (character) to express the 6 letters involved in this text more than enough

As shown in the figure below, note that this is a fixed-length encoding:

Insert image description here

When transmitting the text content "AFDBCFBDEFDF", the transmitted data is encoded 000101011001010101001011100101011101.

The receiver can restore the binary encoding and decoding to the real text content according to the prior agreement between the two parties, that is, 3-bit division. But if there is a lot of text content to be transmitted, the encoded content will also be very long, which means that there will be a lot of content to be transmitted.

In fact, in real data transmission, regardless of whether English letters, Chinese characters, etc. are transmitted, the frequency of letters or Chinese characters is not the same. For example, among the 26 letters in English, "E, A, T, I, N, The frequency of occurrence of "O" will be significantly higher than that of other letters, and the frequency of occurrence of "you, ren, de, gong, zai, yi, shi" in Chinese characters will also be higher than that of other Chinese characters.

3. Huffman coding

For the 6 letters contained in the previously transmitted text content "AFDBCFBDEFDF", we can roughly estimate or assume their frequency of occurrence. Calculated based on a total frequency of 100%, a rough estimate of the frequency of these 6 letters is A: 12% , B: 15%, C: 9%, D: 24%, E: 8%, F: 32%.

With this estimate, we can re-carry out the coding plan according to the Huffman tree - regard the six letters A, B, C, D, E, and F as leaf nodes respectively, and calculate the percentage of these six letters ( Remove the percent sign) 12, 15, 9, 24, 8, and 32 are regarded as the weights of leaf nodes, so that a Huffman tree can be constructed.

As shown below:

Insert image description here

For the Huffman tree shown in the figure above, if each segment in the left branch path is marked with 0, and each segment in the right branch path is marked with 1, in other words, starting from the root node and going left represents a binary 0, Going right represents binary 1, so this scheme of using binary characters to represent letters can be mapped to a tree representation. As shown below:

Insert image description here

As can be seen from the figure above, starting from the root node, the binary number contained in the path that needs to be passed to access node D is 00, and the binary number contained in the path that node E passes through is 010..., which means that the encoding of D is 00 , the code of E is 010…….

Then the binary code corresponding to the leaf node of the Huffman tree is shown in the figure below (the binary character code corresponding to these letters is Huffman code ).

Insert image description here

As you can see from the figure above, the letters that appear most frequently use the shortest encoding as much as possible to save data communication. Therefore, this is a variable length encoding, that is, the corresponding binary character lengths after different encodings are different.

Ultimately, the text content to be transmitted is "AFDBCFBDEFDF", and the actual transmitted content is encoded 11010001110111011100010100010.

You can compare it with the original binary characters that need to be transmitted:

Raw binary string: 000101011001010101001011100101011101(36 characters)
New binary string: 11010001110111011100010100010(29 characters)

This means less data needs to be transmitted, which means the data is compressed. Save approximately 19% of storage or transmission costs. Obviously, if more text content is to be transmitted, the cost savings will be even more substantial.

4. Huffman decoding

So how to decode the real text content from the new binary string? Because there are only 0 and 1 in the encoding, and it is a variable-length encoding, it is actually easy to cause decoding errors due to confusion during decoding. Therefore, for variable-length encoding, the design must ensure that any letter is encoded Neither can be a prefix of another alphabetic encoding.

For example, the binary code of the letter F is 10, then other letters must not start with 10 when coding. Observe the above and below pictures, the binary characters of letters A, B, C, D, and E do not start with 10.

Insert image description here

A concept is involved here: prefix encoding

If in an encoding scheme, any encoding is not the prefix (leftmost substring) of any other encoding, the encoding is called a prefix encoding.

Huffman coding is a kind of prefix coding.

For example, assume that the binary code of the letter C is not 011 but 101, because starting with 10 is not allowed. Then if the transmitted content is 10110110, the receiver may decode it into CCF when decoding, or it may decode For the FAA, this creates ambiguity and confusion.

According to the encoding shown in the figure above, when a new binary string is received, the decoded content can only be "AFDBCFBDEFDF" and will never be decoded into other content. Of course, in order to ensure that the content sent by the sender can be successfully decoded by the receiver, the receiver must also have a coding table as shown in the figure above.

5. Coding characteristics

Huffman coding is a coding scheme that uses the method of constructing a Huffman tree to determine the character set. It is a prefix coding that can ensure the uniqueness of the solved content during decoding.

Each character in the character set is used as a leaf node, and the frequency of occurrence of each character is used as the weight of the node to construct a Huffman tree.
Mark each segment in the left branch path of the Huffman tree with 0, and mark each segment in the right branch path with 1. Of course, the left branch can be marked with 1 and the right branch can be marked with 0, as long as both communicating parties make consistent agreements during encoding and decoding.
The 0 or 1 corresponding to each segment of the path from the root node of the Huffman tree to the leaf node is connected to form the Huffman code of the character.

Because Huffman trees are non-unique, Huffman codes are also non-unique. When constructing a Huffman tree, some data suggest that the weight of the left child node should not be greater than the weight of the right child node, or require that the relationship between the weights of the left child and right child nodes should be consistent.

In other words, it is to either ensure that the weights of all left child nodes are less than or equal to the weights of the right child nodes, or to ensure that the weights of all right child nodes are less than or equal to the weights of the left child nodes (there is no need to maintain the relationship between the weights of the left and right child nodes). consistency).

6. Code implementation

The implementation code of Huffman coding can be directly implemented in the HFMTree class in the previous article , and publicjust add modified member functions.

//生成哈夫曼编码
bool CreateHFMCode(int idx) //参数idx是用于保存哈夫曼树的数组某个节点的下标
{
    
    
	//调用这个函数时，m_length应该已经等于整个哈夫曼树的节点数量，那么哈夫曼树的叶子节点数量应该这样求
	int leafNodeCount = (m_length + 1) / 2;

	if (idx < 0 || idx >= leafNodeCount)
	{
    
    
		//只允许对叶子节点求其哈夫曼编码
		return false;
	}
	string result = ""; //保存最终生成的哈夫曼编码
	int curridx = idx;
	int tmpparent = m_data[curridx].parent;
	while (tmpparent != -1) //沿着叶子向上回溯
	{
    
    
		if (m_data[tmpparent].lchild == curridx)
		{
    
    
			//前插0
			result = "0" + result;
		}
		else
		{
    
    
			//前插1
			result = "1" + result;
		}
		curridx = tmpparent;
		tmpparent = m_data[curridx].parent;
	} //end while
	cout << "下标为【" << idx << "】，权值为" << m_data[idx].weight << "的节点的哈夫曼编码是" << result << endl;
	return true;
}

Add test examples to the main function

//哈夫曼编码测试
int main()
{
    
    
	int weigh[] = {
    
     12,15,9,24,8,32 };
	int sz = sizeof(weigh) / sizeof(weigh[0]);

	//分别传入：权值列表中元素个数 和 权值列表首地址
	HFMTree hfmt(sz, weigh);

	hfmt.CreateHFMTree(); //创建哈夫曼树
	hfmt.preOrder(hfmt.GetLength() - 1); //遍历哈夫曼树，参数其实就是根节点的下标（数组最后一个有效位置的下标）

	//求哈夫曼编码
	cout << "--------------" << endl;
	for (int i = 0; i < sz; ++i)
		hfmt.CreateHFMCode(i);

	return 0;
}

The test results are as follows:

Insert image description here

Please note that this result is not exactly the same as the Huffman coding result shown in the figure above. This is because the construction rules of the Huffman tree in program coding fully comply with "the weight of the left child node of the Huffman tree is not greater than the right child." "Node weight", and the Huffman tree in the above figure is not constructed according to this rule (for example, when nodes 24 and 17 are combined, and when nodes 32 and 27 are combined).

7. Summary

Huffman trees are used to create Huffman codes. Huffman coding is a coding method that can be used for data compression. The construction process of Huffman coding requires the use of Huffman trees.

Thinking: The 26 letters in English have different frequencies of use. The frequency data of these 26 letters can be searched through search engines. If you want to perform Huffman coding on these 26 letters, calculate how much the data can be compressed using Huffman coding?