C language implementation of Huffman (Huffman) coding

C language implementation of Huffman (Huffman) coding

This article will give the principle of Huffman coding in C language, examples and C language simulation results, code.

1. Huffman coding principle and examples

Huffman coding is a kind of information source coding, and its coding purpose is to utilize the channel capacity with the highest coding efficiency.
For example, assume that the message consists of five character sequences, and the probability of each character is given, set as a, b, c, d, e. The occurrence probability is 0.12, 0.40, 0.15, 0.08, 0.25. Two encodings (mappings) are given below:

character symbolic probability Encoding method 1 Encoding method 2
a 0.12 000 000
b 0.40 001 11
c 0.15 010 01
d 0.08 011 001
e 0.25 100 10

Any string of 3-digit binary digits in encoding method 1 is not a prefix of another string of 3-digit binary digits, so there is no doubt that it has a prefix. When decoding, 3 binary digit strings are taken each time, and every 3 bits are decoded as 1 character.
Encoding mode 2 actually has a prefix. However, due to the different bit lengths of its encoding, it is difficult to see whether it is prefixed. We might as well use a binary tree to represent its code (prefix codes can be represented by a binary tree):
Binary tree of encoding mode 1

Binary tree - encoding method 1

Binary tree of encoding method 2

Binary tree - encoding method 2

Thus, prefix codes can be viewed as paths in a binary tree. Each node has 0 attached to its left branch and 1 attached to its right branch. Use characters as labels for leaf nodes. The sequence of 0 or 1 encountered on the path from the root node to the leaf node is the code corresponding to the character of the leaf node.
For encoding mode 1, all characters have a code length of 3, and their average code length is 3. But the average encoding length of the encoding mode 2 is 2.2. Obviously, encoding mode 2 is more efficient.

The optimal prefix code can be obtained by using the Huffman algorithm.

First, select two characters with the lowest probability of occurrence from the given character set, taking the previous example as characters a and d. Construct a parent node, the symbol is set to x, and its corresponding probability is the sum of the symbol probabilities of a and d, and its child nodes are a and d respectively. Then recursively obtain the prefix encoding binary tree in the same way for the character set and probability set composed of the remaining nodes and new nodes. The optimal prefix code can be obtained by traversing the binary tree.

The order of its construction is as follows:
Construction orderThrough the above methods, the optimal encoding method can be obtained:

character symbolic probability Huffman coding
a 0.12 1111
b 0.40 0
c 0.15 110
d 0.08 1110
e 0.25 10

After calculation, it can be seen that the average code length is 2.15.

2. C language implementation of Huffman coding

The symbols are A~P, a total of 16 symbols, and their occurrence probability is as follows:

symbol probability
A 0.06
B 0.12
C 0.15
D 0.05
E 0.06
F 0.02
G 0.07
H 0.03
I 0.13
J 0.09
K 0.07
L 0.06
M 0.02
N 0.02
O 0.01
P 0.04

1. Initialization

First, input the initial information and set up the node tree. Instead of using the pointer variable, the index is used as the address.

struct Huffman
{
    
    
	double weight;
	int lchild;
	int rchild;
	int parent;
};	//Huffman tree结构体定义

The structure represents a binary tree node, and the number of nodes to be set should be node_num, directly defined as H.
First initialize the binary tree. The size of the binary tree (the number of nodes) should be 2 times the number of symbols minus 1. For the convenience of modification, macro definitions are made for the number of nodes, the number of symbols, and the probability space:

#define symbol_num 16
#define node_num 2 * (symbol_num) - 1
double symbol_P[symbol_num] = {
    
     0.06, 0.12, 0.15, 0.05, 0.06, 0.02, \
0.07, 0.03, 0.13, 0.09, 0.07, 0.06, 0.02, 0.02, 0.01, 0.04};

The weights of the symbol_node (16) nodes before initialization are the corresponding probabilities. For the convenience of subsequent sorting, set the weights of the remaining nodes to 1. For the rest of the child nodes, the number of the parent node is set to -1 with reference to the data structure. For the subsequent extraction of symbol serial numbers, construct a list array to store the sorted serial numbers. To avoid destroying the original probability space, a new initial probability space can be set for modification.

for (int i = 0; i < node_num; i++)
{
    
    
	if (i < symbol_num)
		symbol_Ptemp[i] = symbol_P[i];
	else
		symbol_Ptemp[i] = 1;
}

for (int i = 0; i < node_num; i++)
{
    
    
	list[i] = i;
}
for (int i = 0; i < node_num; i++)
{
    
    
	H[i].parent = -1;
	H[i].lchild = -1;
	H[i].rchild = -1;
	
	if (i < symbol_num)
	{
    
    
		H[i].weight = symbol_P[i];
		code_len[i] = 0;			//为输出方便而设置,后续会解释
	}
	else
		H[i].weight = 0;
}

2. Bubble sort

It needs to be reordered every time and the two smallest weights are taken out, and the return value is the number of the smallest value. Moreover, the sorted sequence needs to be saved, and the reason for setting the list array is reflected at this time. The sorting here is the bubble sorting method from small to large, so I won't go into details here.

int i, j, min0, min1;
double temp;
int temp1;
for (i = 0; i < node_num - 1; i++)
{
    
    
	for (j = 0; j < node_num - 1 - i; j++)
	{
    
    
		if (symbol_Ptemp[j] > symbol_Ptemp[j + 1])
		{
    
    
			temp = symbol_Ptemp[j];
			symbol_Ptemp[j] = symbol_Ptemp[j + 1];
			symbol_Ptemp[j + 1] = temp;
			
			temp1 = list[j];
			list[j] = list[j + 1];
			list[j + 1] = temp1;
		}
	}
}
min0 = list[0];
min1 = list[1];

3. Construct nodes

The logic here is very simple. If you have obtained the smallest two node serial numbers, just construct a parent node for it. Considering that every time a step of parent node construction is completed, the two child nodes of this parent node can no longer participate in the comparison. It is advisable to assign 1 to the weight of the two child nodes after extraction so that it does not affect the ascending order bubbling.

H[i].weight = H[min0].weight + H[min1].weight;
symbol_Ptemp[i] = H[i].weight;
H[i].lchild = min0;
H[i].rchild = min1;
H[min0].parent = i;
H[min1].parent = i;
symbol_Ptemp[0] = 1.0;
symbol_Ptemp[1] = 1.0;
Bubble();

The above function can be completed by recursing from the first parent node to the root node.

3. Simulation results

Note that the coding method of this experiment is not unique, which is related to the shape of the tree. But the resulting coding efficiency must be the same and minimal.

Simulation results

4. Simulation code

The output function is more complicated and not very useful for simulation, so it will not be explained here.

// Huffman Coding
#include<stdio.h>
#include<stdlib.h>
#include<math.h>
#define symbol_num 16
#define node_num 2 * (symbol_num) - 1
double symbol_P[symbol_num] = {
    
     0.06, 0.12, 0.15, 0.05, 0.06, 0.02, \
0.07, 0.03, 0.13, 0.09, 0.07, 0.06, 0.02, 0.02, 0.01, 0.04};	//概率空间
double symbol_Ptemp[node_num];//便于比较大小,对于新节点叠加在后
int list[node_num];
int min1, min0;
char code[symbol_num];
int code_len[symbol_num];

struct Huffman
{
    
    
	double weight;
	int lchild;
	int rchild;
	int parent;
};				//Huffman tree结构体定义


void Initsymbol_P();					//概率空间初始化
void InitHT(Huffman H[node_num]);		//节点初始化
void Bubble();							//冒泡排序
void Nodeproduce();						//哈夫曼树构建
void OutputTree();

Huffman H[node_num];

int main()
{
    
    
	Initsymbol_P();
	Bubble();
	InitHT(H);
	Nodeproduce();
	OutputTree();

	return 0;
}

void Initsymbol_P()
{
    
    
	for (int i = 0; i < node_num; i++)
	{
    
    
		if (i < symbol_num)
			symbol_Ptemp[i] = symbol_P[i];
		else
			symbol_Ptemp[i] = 1;
	}
	
	for (int i = 0; i < node_num; i++)
	{
    
    
		list[i] = i;
	}
	
}

void InitHT(Huffman H[node_num])			//初始化哈夫曼树
{
    
    
	for (int i = 0; i < node_num; i++)
	{
    
    
		H[i].parent = -1;
		H[i].lchild = -1;
		H[i].rchild = -1;

		if (i < symbol_num)
		{
    
    
			H[i].weight = symbol_P[i];
			code_len[i] = 0;
		}
		else
			H[i].weight = 0;
	}
}

void Bubble()
{
    
    
	int i, j;
	double temp;
	int temp1;
	for (i = 0; i < node_num - 1; i++)
	{
    
    
		for (j = 0; j < node_num - 1 - i; j++)
		{
    
    
			if (symbol_Ptemp[j] > symbol_Ptemp[j + 1])
			{
    
    
				temp = symbol_Ptemp[j];
				symbol_Ptemp[j] = symbol_Ptemp[j + 1];
				symbol_Ptemp[j + 1] = temp;

				temp1 = list[j];
				list[j] = list[j + 1];
				list[j + 1] = temp1;
			}
		}
	}
	min0 = list[0];
	min1 = list[1];
}

void Nodeproduce()
{
    
    
	for (int i = symbol_num; i < node_num; i++)
	{
    
    
		H[i].weight = H[min0].weight + H[min1].weight;
		symbol_Ptemp[i] = H[i].weight;
		H[i].lchild = min0;
		H[i].rchild = min1;
		H[min0].parent = i;
		H[min1].parent = i;
		symbol_Ptemp[0] = 1.0;
		symbol_Ptemp[1] = 1.0;
		Bubble();
	}
}

void OutputTree()
{
    
    
	for (int i = 0; i < node_num; i++)
	{
    
    
		printf("%d, %f, %d, %d, %d\n",i, H[i].weight, H[i].lchild, H[i].rchild, H[i].parent);
	}

	for (int i = 0; i < symbol_num; i++)
	{
    
    
		int temp_i = i;
		int temp_iup = H[temp_i].parent;
		while (temp_iup != -1)
		{
    
    
			code_len[i]++;
			temp_i = temp_iup;
			temp_iup = H[temp_i].parent;
		}
		code[code_len[i]] = '\0';
		temp_i = i;
		temp_iup = H[temp_i].parent;
		int j = 1;
		while (temp_iup != -1)
		{
    
    
			if (H[temp_iup].lchild == temp_i)
			{
    
    
				code[code_len[i] - j] = 48;
			}
			if (H[temp_iup].rchild == temp_i)
			{
    
    
				code[code_len[i] - j] = 49;
			}
			temp_i = temp_iup;
			temp_iup = H[temp_i].parent;
			j++;
		}
		printf("'%c': Pro = %f   Huffman Code length: %d  ", i + 65, H[i].weight, code_len[i]);
		printf("Huffman Code -> %s \n", code);
	}
	
}

5. References

  1. Huffman (huffman) tree and Huffman coding
  2. Tan Haoqiang: C Programming (Fourth Edition)
  3. Zhang Yan, Li Xiukun, Liu Xianmin: Data Structures and Algorithms (5th Edition)

Guess you like

Origin blog.csdn.net/Koroti/article/details/108585954