Implementation of Huffman Coding

Although Huffman coding seems to be rarely used in acm, it often appears in computer books as a basic algorithm.

And my understanding of Huffman coding is only limited to its use in the coding field, can it improve the efficiency of data transmission, or is it used to compress files? These may not be accurate, and I have not checked them in detail.

Huffman codes can be obtained by building a Huffman tree.


Example


Let's use a simple example to briefly describe what Huffman coding is? what is the benefit?

Scenario: Region X needs to send some text to Region Y, and the two places are connected by cable (or by telegraph), requiring minimal binary stream to transmit information: ABACDAAB

You can see that there are 4 A, 2 B, 1 C, and 1 D in this message.

1. If it is encoded in a common binary form, then A: 00, B: 01, C: 10, D: 11; the information to binary stream is: 0001001011000001, a total of 16 bits.

2. If Huffman coding is used, one of them is A: 1, B: 01, C: 000, D: 001; the information to binary stream is: 10110000011101, a total of 14 bits, 2 bits less than ordinary .

At this time, we may understand that Huffman coding means that the higher the frequency, the shorter the coding ?


We try to assign values ​​like this, A: 0, B: 1, C: 01, D: 10; the binary stream is: 0100110001, this is only 10 bits, less than Huffman!

The above statement seems to make sense, but we ignore that Huffman's use may be information transmission and compression. When we tell the receivers the encoding rules and binary streams, they need to restore these binary streams to understandable information. Both ordinary encoding and Huffman encoding can be restored smoothly, while the third one is impossible to restore. Is 0100.... ABAA.... or CAA....?

So Huffman encoding must be a prefix code, and it is also an optimal prefix code (prefix code definition: in a character set, the encoding of any character is not a prefix of another character encoding )

Therefore, the code that must achieve the above two points is the Huffman code .


solve


Let's still use this example to briefly see how to find the Huffman encoding of each character by constructing a Huffman tree when obtaining ABACDAAB:

Step 1: Obtain the number of occurrences of each character as the weight of the character node;

Step 2: Select the two nodes with the smallest weights each time, merge them into a common parent node and put them back; repeat until only one node is left; at this time, the left branch is 0 and the right branch is 1 for encoding;

The specific process is shown in the following figure:



Implementation of the code


In step 1, the number of occurrences of each letter can be obtained by hash traversal;

Step 2 can be implemented by simulating tree building with a binary linked list;

In step 2, each time the two nodes with the smallest weight are popped from the set and merged into the same parent node, they must be put back. Obviously, a code that is sorted after each insertion needs to be added to the ordinary queue; if you build your own wheel, you must also It takes a while, but fortunately the C++ standard library provides a " priority queue " to meet our needs. ( In a priority queue, elements are given priority; when an element is accessed, the element with the highest priority is dequeued first)


Let's take a look at the specific code:

#include<stdio.h>
#include<stdlib.h>
#include<string.h>
#include<queue>
using namespace std;

typedef struct node{
	char ch; //Store the character represented by this node, only used by leaf nodes
	int val; //record the weight of the node
	struct node *self,*left,*right; //Three pointers, used to record their own address, the address of the left child and the address of the right child respectively
	friend bool operator <(const node &a,const node &b) //The operator is overloaded to define the comparison structure of the priority queue
	{
		return a.val>b.val; //Here is the priority queue with small weights
	}
}node;

priority_queue<node> p; //Define the priority queue
char res[30]; //Used to record Huffman encoding
void dfs(node ​​*root,int level) //Print characters and corresponding Huffman codes
{
	if(root->left==root->right) //The address of the left child of the leaf node must be equal to the address of the right child, and both must be NULL; the leaf node records characters
	{
		if(level==0) //"AAAA" is only one character
		{
			res[0]='0';
			level++;
		}
		res[level]='\0'; //Character array ends with '\0'
		printf("%c=>%s\n",root->ch,res);
	}
	else
	{
		res[level]='0'; //Left branch is 0
		dfs(root->left,level+1);
		res[level]='1'; //The right branch is 1
		dfs(root->right,level+1);
	}
}
void huffman(int *hash) //Build Huffman tree
{
	node *root,fir,sec;
	for(int i=0;i<26;i++) //The program can only process information strings that are all uppercase English characters, so there are only 26 hashes
	{
		if(!hash[i]) //The corresponding letter does not appear in the text
			continue;
		root=(node ​​*)malloc(sizeof(node)); //Open up a node
		root->self=root; //Record your own address so that the parent node can connect to itself
		root->left=root->right=NULL; //This node is a leaf node, and the left and right child addresses are both NULL
		root->ch='A'+i; //Record the character represented by the node
		root->val=hash[i]; //Record the weight of the character
		p.push(*root); //Push the node into the priority queue
	}
    //The following loop simulates the tree building process, and each time the two smallest nodes are taken out and merged, they are re-pressed into the queue
    //When the number of remaining nodes in the queue is 1, the Huffman tree construction is completed
	while(p.size()>1)
	{
		fir=p.top();p.pop(); //Remove the smallest node
		sec=p.top();p.pop(); //Remove the next smallest node
		root=(node ​​*)malloc(sizeof(node)); //Build a new node as the parent node of fir and sec
		root->self=root; //Record your own address to facilitate the connection of the parent node of the node
		root->left=fir.self; //Record the address of the left child node
		root->right=sec.self; //Record the address of the right child node
		root->val=fir.val+sec.val;//The weight of the node is the sum of the weights of the two children
		p.push(*root); //Push the new node into the queue
	}
	fir=p.top();p.pop(); //Pop up the root node of the Huffman tree
	dfs(fir.self,0); //Output the character recorded by the leaf node and the corresponding Huffman code
}
intmain()
{
	char text[100];
	int hash[30];
	memset(hash,0,sizeof(hash)); //Hash array initialization is all 0
	scanf("%s",text); //Read in the information string text
	for(int i=0;text[i]!='\0';i++)//Get the number of occurrences of each character by hash
	{
		hash[text[i]-'A']++; //The program assumes that all uppercase letters are run
	}
	huffman(hash);
	return 0;
}


Question: Why does each node need a pointer to record its own address?

Because when it is merged, it needs to pass its own address to the child pointer of the parent node. And what we push into the queue each time is a node variable, not the address of the node, so there must be a field to record the address of the node. Then what if we change the definition of the priority queue and change the variable stored in it to the address?

priority_queue<struct node *> p;
p.push(root);

The answer is still no, the priority queue's comparison function cannot accept addresses when overloaded. For details, please refer to the blog of "liuzhanchen1987" . Thanks to this blogger for sharing.


demo program


Before, I used VC6 to write two small programs, respectively using Huffman coding to encode and send information and decode and restore information.

If necessary, you can " click to download ", the program can only transmit uppercase English characters, numbers and a small amount of punctuation; if you want to transmit more characters, you can expand the length of the hash array.

The program only provides source code and screenshots of running, sorry ^_^


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325561223&siteId=291194637
Recommended