Huffman tree (implemented in C language)

Basic concept of Huffman tree

Before knowing the Huffman tree, you must know the following basic terms:
1. What is a path?

In a tree, the path between nodes that can be reached from a node is called a path.

insert image description here
As shown in the figure, the path from root node A to leaf node I is A->C->F->I

2. What is the path length?

The number of "edges" traversed by a path is called the path length of the path

insert image description here
As shown in the figure, the path passes through 3 edges, so the path length of the path is 3

3. What is the weighted path length of a node?

If a node in the tree is assigned a value with a certain meaning, the value is called the weight of the node. The product of the path length from the root node to the node and the weight of the node is called the weighted path length of the node.

insert image description here
As shown in the figure, the weighted path length of leaf node I is  3 × 3 = 9 3\times3=93×3=9

4. What is the weighted path length of a tree?

The weighted path length of the tree is specified as the sum of the weighted path lengths of all leaf nodes, denoted as WPL.

insert image description here
As shown in the figure, the weighted path length WPL of the binary tree = 2 × 2 + 2 × 6 + 3 × 1 + 3 × 3 + 2 × 2 = 32 =2\times 2 + 2 \times 6 + 3 \times 1 + 3 \times 3 + 2 \times 2 = 32=2×2+2×6+3×1+3×3+2×2=32

5. What is a Huffman tree?

Given n weights as n leaf nodes, construct a binary tree. If the weighted path length of the tree reaches the minimum, the binary tree is called a Huffman tree, also known as an optimal binary tree.

According to the calculation rules of the weighted path length of the tree, it is not difficult for us to understand: the weighted path length of the tree is related to the distribution of its leaf nodes.
Even for two binary trees with the same structure, the weighted path lengths of the two binary trees are different because of the different distribution of their leaf nodes.
insert image description here
So how can we minimize the weighted path length of a binary tree?
 According to the calculation rules of the weighted path length of the tree, we should make the leaf nodes with large weights close to the root node as much as possible, and keep the leaf nodes with small weights away from the root node, so as to make the binary tree The weighted path length reaches a minimum.

Construction of Huffman tree

Build ideas

A very simple and easy-to-operate algorithm is given below to construct a Huffman tree:
1. There are n nodes in the initial state, and the weights of the nodes are given n numbers, and they are regarded as n A tree with only one root node.
2. Merge the two trees with the smallest root node weights, and generate the parent nodes of the two trees. The weight is the sum of the weights of the two root nodes, so that the number of trees is reduced by one.
3. Repeat operation 2 until there is only one tree left, which is the Huffman tree.

For example, given 5 numbers, namely 1, 2, 2, 3, 6, it is required to build a Huffman tree.
Animation demonstration:
insert image description here
1. Initial state: There are 5 trees with only root nodes.
insert image description here
2. Merge two trees with weights 1 and 2 to generate the parent node of the two trees, and the weight of the parent node is 3.
insert image description here
3. Merge two trees with weights 2 and 3 to generate the parent node of the two trees, and the weight of the parent node is 5.
insert image description here
4. Merge two trees with weights 3 and 5 to generate the parent node of the two trees, and the weight of the parent node is 8.
insert image description here
5. Merge the two trees whose weights are 6 and 8, and generate the parent node of the two trees. The weight of the parent node is 14.
insert image description here
6. At this time, there is only one tree left, and this tree is the Huffman tree.
insert image description here
Observing this Huffman tree, we can also find that there is no node with degree 1 in the Huffman tree. Because we choose two trees for merging each time, there is naturally no node with degree 1.
From this we can also deduce that if n numbers are given to construct a Huffman tree, the total number of nodes in the constructed Huffman tree is 2n-1, because for any binary tree, the leaf nodes with a degree of 0 The number of nodes must be 1 more than the number of nodes with degree 2.

Proof:
  Let the number of nodes with degrees 0, 1, and 2 be n0, n1, and n2 respectively,
  then the number of summary points of the binary tree n = n 0 + n 1 + n 2 n=n0+n1+n2n=n 0+n 1+n 2Assuming
  that the number of edges in the binary tree is B
  , thenB = n 2 × 2 + n 1 × 1 B=n2\times2+n1\times1B=n2 _×2+n 1×1
  又 ∵ B = n − 1 \because B=n-1 B=n1
   ∴ n 2 × 2 + n 1 = n − 1 \therefore n2\times2+n1=n-1 n2 _×2+n 1=n1
  isn = n 2 × 2 + n 1 + 1 n=n2\times2+n1+1n=n2 _×2+n 1+1
   ∴ n 0 + n 1 + n 2 = n 2 × 2 + n 1 + 1 \therefore n0+n1+n2 = n2\times2+n1+1 n 0+n 1+n2 _=n2 _×2+n 1+1
  isn 0 = n 2 + 1 n0=n2+1n 0=n2 _+1

To sum up : building a Huffman tree is to repeatedly select the two smallest elements to merge until only one element remains.

Code

In the code implementation, the type definition of a single node is as follows:

typedef double DataType; //结点权值的数据类型

typedef struct HTNode //单个结点的信息
{
    
    
	DataType weight; //权值
	int parent; //父节点
	int lc, rc; //左右孩子
}*HuffmanTree;

When the code is implemented, we use an array to store the basic information (weight, parent node, left child and right child) of each node in the constructed Huffman tree. The basic layout of the array is as follows:
insert image description here
Let's take "building a Huffman tree with numbers 7, 5, 4, and 2" as an example. The basic implementation steps of the code are as follows:

The first stage:
the number of summary points of the constructed Huffman tree is 2 × 4 − 1 = 7 2\times4-1=72×41=7 , but the array we created here can store the information of 8 nodes, because we do not store node information at the position of subscript 0 in the array, and the specific reason will be given later.
We first assign the numbers 7, 5, 4, and 2 used to construct the Huffman tree to the weight positions with subscripts 1-4 in the array in turn, and initialize the rest of the information to 0.
insert image description here
The second stage:
From the elements with subscripts 1-4 in the array, select two nodes with the smallest weight and a parent node of 0 (representing that they have no parent node yet), and generate their parent nodes:
 1. The weight of the node whose subscript is 5 is equal to the sum of the weights of the two selected nodes.
 2. The parent node of the two selected nodes is the node with subscript 5.
 3. The left child of the node with subscript 5 is the node with the smaller weight among the two selected nodes, and the other is its right child.
insert image description here
Then, from the elements with subscripts 1-5 in the array, select two nodes with the smallest weight and whose parent node is 0, and generate their parent nodes.
insert image description here
Continue to select two nodes with the smallest weight and parent node 0 from the elements with subscripts 1-6 in the array, and generate their parent nodes.
insert image description here
At this point, except for the element with subscript 0, all elements in the array have their own node information, and the Huffman tree has been constructed.
insert image description here
According to the array information, we can draw the constructed Huffman tree:
insert image description here
Observing the data in the array, we can find that the left child and right child of the nodes with weights 7, 5, 4, and 2 are both 0, That is, they have no left and right children, because they are leaf nodes. In addition, the node whose parent node is 0 in the array is actually the root node of the constructed Huffman tree.

Now let's talk about why the element with subscript 0 in the array does not store node information?
 Because the left and right children of the leaf node in the array are 0, and the parent node of the root node is 0, if we use the subscript 0 element in the array to store node information, then we will not be able to distinguish the nodes whose left and right children are 0 Is it a leaf node or the left and right children of the node are nodes with subscript 0, and I don't know who the root node of the Huffman tree is.

code show as below:

//在下标为1到i-1的范围找到权值最小的两个值的下标,其中s1的权值小于s2的权值
void Select(HuffmanTree& HT, int n, int& s1, int& s2)
{
    
    
	int min;
	//找第一个最小值
	for (int i = 1; i <= n; i++)
	{
    
    
		if (HT[i].parent == 0)
		{
    
    
			min = i;
			break;
		}
	}
	for (int i = min + 1; i <= n; i++)
	{
    
    
		if (HT[i].parent == 0 && HT[i].weight < HT[min].weight)
			min = i;
	}
	s1 = min; //第一个最小值给s1
	//找第二个最小值
	for (int i = 1; i <= n; i++)
	{
    
    
		if (HT[i].parent == 0 && i != s1)
		{
    
    
			min = i;
			break;
		}
	}
	for (int i = min + 1; i <= n; i++)
	{
    
    
		if (HT[i].parent == 0 && HT[i].weight < HT[min].weight&&i != s1)
			min = i;
	}
	s2 = min; //第二个最小值给s2
}

//构建哈夫曼树
void CreateHuff(HuffmanTree& HT, DataType* w, int n)
{
    
    
	int m = 2 * n - 1; //哈夫曼树总结点数
	HT = (HuffmanTree)calloc(m + 1, sizeof(HTNode)); //开m+1个HTNode,因为下标为0的HTNode不存储数据
	for (int i = 1; i <= n; i++)
	{
    
    
		HT[i].weight = w[i - 1]; //赋权值给n个叶子结点
	}
	for (int i = n + 1; i <= m; i++) //构建哈夫曼树
	{
    
    
		//选择权值最小的s1和s2,生成它们的父结点
		int s1, s2;
		Select(HT, i - 1, s1, s2); //在下标为1到i-1的范围找到权值最小的两个值的下标,其中s1的权值小于s2的权值
		HT[i].weight = HT[s1].weight + HT[s2].weight; //i的权重是s1和s2的权重之和
		HT[s1].parent = i; //s1的父亲是i
		HT[s2].parent = i; //s2的父亲是i
		HT[i].lc = s1; //左孩子是s1
		HT[i].rc = s2; //右孩子是s2
	}
	//打印哈夫曼树中各结点之间的关系
	printf("哈夫曼树为:>\n");
	printf("下标   权值     父结点   左孩子   右孩子\n");
	printf("0                                  \n");
	for (int i = 1; i <= m; i++)
	{
    
    
		printf("%-4d   %-6.2lf   %-6d   %-6d   %-6d\n", i, HT[i].weight, HT[i].parent, HT[i].lc, HT[i].rc);
	}
	printf("\n");
}

Note: In order to avoid the use of secondary pointers, function parameters are passed by reference in C++.

Generation of Huffman codes

Code Generation Ideas

For any binary tree, all branches on the binary tree are numbered, all left branches are marked as 0, and all right branches are marked as 1.
Let's take the Huffman tree constructed by 7, 5, 4, and 2 as an example.
insert image description here
Then, for any node on the tree, a number can be uniquely determined according to the path from the root node to the node.

For the leaf node on the Huffman tree, a number determined according to the path from the root node to the leaf node is the Huffman code of the leaf node.

For example, in the figure above:
the Huffman encoding of leaf node 7 is: 0
The Huffman encoding of leaf node 5 is: 10
The Huffman encoding of leaf node 4 is: 111
The Huffman encoding of leaf node 2 Coded as: 110

Code

We first need to be clear about such a question:
what is the longest length of the generated Huffman code for the Huffman tree constructed with n data?
 Because the Huffman code is a number determined according to the path from the root node to the leaf node, the leaf node with the longest generated Huffman code must be the node farthest from the root node. To make the root node of the leaf node reach the farthest, the generated Huffman tree should be a skewed binary tree.
insert image description here
The Huffman code generated by the leaf node in the last layer of the oblique binary tree is the longest, and the length of the Huffman code is the path length from the root node to the leaf node, that is, n − 1 n- 1n1

If a string wants to accommodate any of the "Huffman codes generated with n data", then the length of the string should be nnn , because we also need to use a byte position to store the end mark '\0' of the string.

Let's take the Huffman tree constructed by numbers 7, 5, 4, and 2 as an example. The basic implementation steps of Huffman code generation are as follows:

The first stage:
Because the number of data is 4, we open up an auxiliary space with a size of 4, and assign the last position as '\0' to temporarily store the Huffman code being generated.
insert image description here
In order to store these 4 data Huffman codes, we create a character pointer array, which has 5 elements, and the type of each element is char**. The basic layout of the character pointer array is as follows: Note: here
insert image description here
for The subscripts in the "array generated when constructing the Huffman tree" correspond to each other, so the element with the subscript 0 in the character pointer array does not store valid data.

The second stage:
use the constructed Huffman tree to generate the Huffman codes of the four data. The process of generating Huffman coding for a single data is as follows:
 1. Determine the relationship between the data node and its parent node. If the data node is the left child of its parent node, move the start pointer forward and set the 0 fills in the position pointed to by start, if it is the right child, fill in 1 in this position.
 2. Then use the same method to judge the relationship between its parent node and its parent node's parent node, until the node to be judged is the root node of the Huffman tree, the Huffman code of the node Generated.
 3. Copy the data starting from the start position in the string to the corresponding position in the character pointer array.

Here we take the Huffman encoding of generated data 5 as an example.
insert image description here
Note : Before generating Huffman encoding of data each time, point the start pointer to '\0'.

According to this method, after the Huffman codes of 7, 5, 4, and 2 are sequentially generated, the basic layout of the character pointer array is as follows: the
insert image description here
Huffman codes are generated.

code show as below:

typedef char **HuffmanCode;

//生成哈夫曼编码
void HuffCoding(HuffmanTree& HT, HuffmanCode& HC, int n)
{
    
    
	HC = (HuffmanCode)malloc(sizeof(char*)*(n + 1)); //开n+1个空间,因为下标为0的空间不用
	char* code = (char*)malloc(sizeof(char)*n); //辅助空间,编码最长为n(最长时,前n-1个用于存储数据,最后1个用于存放'\0')
	code[n - 1] = '\0'; //辅助空间最后一个位置为'\0'
	for (int i = 1; i <= n; i++)
	{
    
    
		int start = n - 1; //每次生成数据的哈夫曼编码之前,先将start指针指向'\0'
		int c = i; //正在进行的第i个数据的编码
		int p = HT[c].parent; //找到该数据的父结点
		while (p) //直到父结点为0,即父结点为根结点时,停止
		{
    
    
			if (HT[p].lc == c) //如果该结点是其父结点的左孩子,则编码为0,否则为1
				code[--start] = '0';
			else
				code[--start] = '1';
			c = p; //继续往上进行编码
			p = HT[c].parent; //c的父结点
		}
		HC[i] = (char*)malloc(sizeof(char)*(n - start)); //开辟用于存储编码的内存空间
		strcpy(HC[i], &code[start]); //将编码拷贝到字符指针数组中的相应位置
	}
	free(code); //释放辅助空间
}

Note: In order to avoid the use of secondary pointers, function parameters are passed by reference in C++.

Full code display and code testing

Full source code:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

typedef double DataType; //结点权值的数据类型

typedef struct HTNode //单个结点的信息
{
    
    
	DataType weight; //权值
	int parent; //父节点
	int lc, rc; //左右孩子
}*HuffmanTree;

typedef char **HuffmanCode; //字符指针数组中存储的元素类型

//在下标为1到i-1的范围找到权值最小的两个值的下标,其中s1的权值小于s2的权值
void Select(HuffmanTree& HT, int n, int& s1, int& s2)
{
    
    
	int min;
	//找第一个最小值
	for (int i = 1; i <= n; i++)
	{
    
    
		if (HT[i].parent == 0)
		{
    
    
			min = i;
			break;
		}
	}
	for (int i = min + 1; i <= n; i++)
	{
    
    
		if (HT[i].parent == 0 && HT[i].weight < HT[min].weight)
			min = i;
	}
	s1 = min; //第一个最小值给s1
	//找第二个最小值
	for (int i = 1; i <= n; i++)
	{
    
    
		if (HT[i].parent == 0 && i != s1)
		{
    
    
			min = i;
			break;
		}
	}
	for (int i = min + 1; i <= n; i++)
	{
    
    
		if (HT[i].parent == 0 && HT[i].weight < HT[min].weight&&i != s1)
			min = i;
	}
	s2 = min; //第二个最小值给s2
}

//构建哈夫曼树
void CreateHuff(HuffmanTree& HT, DataType* w, int n)
{
    
    
	int m = 2 * n - 1; //哈夫曼树总结点数
	HT = (HuffmanTree)calloc(m + 1, sizeof(HTNode)); //开m+1个HTNode,因为下标为0的HTNode不存储数据
	for (int i = 1; i <= n; i++)
	{
    
    
		HT[i].weight = w[i - 1]; //赋权值给n个叶子结点
	}
	for (int i = n + 1; i <= m; i++) //构建哈夫曼树
	{
    
    
		//选择权值最小的s1和s2,生成它们的父结点
		int s1, s2;
		Select(HT, i - 1, s1, s2); //在下标为1到i-1的范围找到权值最小的两个值的下标,其中s1的权值小于s2的权值
		HT[i].weight = HT[s1].weight + HT[s2].weight; //i的权重是s1和s2的权重之和
		HT[s1].parent = i; //s1的父亲是i
		HT[s2].parent = i; //s2的父亲是i
		HT[i].lc = s1; //左孩子是s1
		HT[i].rc = s2; //右孩子是s2
	}
	//打印哈夫曼树中各结点之间的关系
	printf("哈夫曼树为:>\n");
	printf("下标   权值     父结点   左孩子   右孩子\n");
	printf("0                                  \n");
	for (int i = 1; i <= m; i++)
	{
    
    
		printf("%-4d   %-6.2lf   %-6d   %-6d   %-6d\n", i, HT[i].weight, HT[i].parent, HT[i].lc, HT[i].rc);
	}
	printf("\n");
}

//生成哈夫曼编码
void HuffCoding(HuffmanTree& HT, HuffmanCode& HC, int n)
{
    
    
	HC = (HuffmanCode)malloc(sizeof(char*)*(n + 1)); //开n+1个空间,因为下标为0的空间不用
	char* code = (char*)malloc(sizeof(char)*n); //辅助空间,编码最长为n(最长时,前n-1个用于存储数据,最后1个用于存放'\0')
	code[n - 1] = '\0'; //辅助空间最后一个位置为'\0'
	for (int i = 1; i <= n; i++)
	{
    
    
		int start = n - 1; //每次生成数据的哈夫曼编码之前,先将start指针指向'\0'
		int c = i; //正在进行的第i个数据的编码
		int p = HT[c].parent; //找到该数据的父结点
		while (p) //直到父结点为0,即父结点为根结点时,停止
		{
    
    
			if (HT[p].lc == c) //如果该结点是其父结点的左孩子,则编码为0,否则为1
				code[--start] = '0';
			else
				code[--start] = '1';
			c = p; //继续往上进行编码
			p = HT[c].parent; //c的父结点
		}
		HC[i] = (char*)malloc(sizeof(char)*(n - start)); //开辟用于存储编码的内存空间
		strcpy(HC[i], &code[start]); //将编码拷贝到字符指针数组中的相应位置
	}
	free(code); //释放辅助空间
}

//主函数
int main()
{
    
    
	int n = 0;
	printf("请输入数据个数:>");
	scanf("%d", &n);
	DataType* w = (DataType*)malloc(sizeof(DataType)*n);
	if (w == NULL)
	{
    
    
		printf("malloc fail\n");
		exit(-1);
	}
	printf("请输入数据:>");
	for (int i = 0; i < n; i++)
	{
    
    
		scanf("%lf", &w[i]);
	}
	HuffmanTree HT;
	CreateHuff(HT, w, n); //构建哈夫曼树

	HuffmanCode HC;
	HuffCoding(HT, HC, n); //构建哈夫曼编码

	for (int i = 1; i <= n; i++) //打印哈夫曼编码
	{
    
    
		printf("数据%.2lf的编码为:%s\n", HT[i].weight, HC[i]);
	}
	free(w);
	return 0;
}

We test the following questions:

It is known that only eight kinds of characters may appear in a system in the communication network, and the probabilities are
0.05, 0.29
, 0.07, 0.08, 0.14, 0.23, 0.03, 0.11. Try to design Huffman coding

operation result:
insert image description here

Guess you like

Origin blog.csdn.net/chenlong_cxy/article/details/117929139