POJ1521---Entropy

POJ — Entropy

Title Description English

An entropy encoder is a data encoding method that achieves lossless data compression by encoding messages from which "waste" or "extra" information is removed.
Consider the text "AAAAABCD". To use ASCII, the encoding requires 64 bits. Since the glyph "A" appears at a higher frequency, can it be done better by encoding with fewer bits? The best encoding is to encode "A" as "0", "B" as "10", "C" as "110", and "D" and "111". (This is obviously not the only optimal encoding, because it is obvious that for any given encoding, the encodings of B, C, and D can be freely interchanged without increasing the size of the final encoded message.) Using this encoding, the message The encoding is only 13 bits to "0000010110111", and the compression ratio is 4.9:1 (that is, each bit in the final encoded message represents as much information as 4.9 bits in the original encoding).

enter

The input file will contain a list of text strings, one per line. The text string will only contain uppercase alphanumeric characters and underscores (used to replace spaces). The end of the input will be signaled by a line containing only the word "END" as a text string.

Output

For each text string in the input, output the bit length of 8-bit ASCII encoding, the bit length of the best variable-length encoding without prefix, and the compression rate accurate to one decimal point.

Sample input

AAAAABCD

THE_CAT_IN_THE_HAT

END

Sample output

64 13 4.9

144 51 2.8

Problem-solving ideas

The main content of this question is that the
first output of the Huffman tree case indicates the number of digits used by the string to be stored in ASCII, that is, length * 8. The second output is to use binary encoding to store characters, but requires the number of digits At least, we have to use Huffman encoding at this time. When processing, we only need to count the number of different characters that appear, and then use it as a weight, so as to find the optimal encoding length based on the sum of the frequencies of newly generated nodes. The last output is obtained by dividing the first two outputs.

Code

#include<iostream>
#include<queue>
#include<string>
using namespace std;
int arr[40];
string s;
int main()
{
    
    
	while (cin >> s && s != "END")
	{
    
    
		priority_queue<int, vector<int>, greater<int> > Q;
		int len = s.size();
		for (int i = 0; i < len; i++)
		{
    
    
			arr[s[i] - 'A']++;
		}
		cout << len * 8 << " ";
		for (int i = 0; i < 40; i++)
		{
    
    
			if (arr[i])
			{
    
    
				Q.push(arr[i]);
			}
		}
		/*
		* 方法一
		*/
		//int total = 0;
		//if (Q.size() == 1) {
    
    
		//	total = len;
		//}
		//while (Q.size() > 1)
		//{
    
    
		//	int sum = 0;
		//	sum += Q.top();
		//	Q.pop();
		//	sum += Q.top();
		//	Q.pop();
		//	Q.push(sum);
		//	//cout << sum << endl;
		//	total += sum;
		//}
		//方法二
		int total = len;
		while (Q.size() > 2)
		{
    
    
			int sum = 0;
			sum += Q.top();
			Q.pop(); 
			sum += Q.top();
			Q.pop();
			Q.push(sum);
			//cout << sum << endl;
			total += sum;
		}
		cout << total << " ";
		printf("%.1lf\n", (double)len * 8 / total);
		memset(arr, 0, sizeof(arr));
	}
	return 0;
}

Code analysis

There are two ways to think about this question. The key is that all characters are one type of character, such as AAAAAAAAAAAAA. Then for method 1, if there is no such judgment for Q.size == 1, it will not work, and the total is 0. Obviously does not meet the meaning of the question, so additional judgments are required. If multiple character combinations are used, they can be processed as usual. The
second method is to directly give total an initial value-the total length of the string, which is the last step of constructing the Huffman tree The operation. The sum of the total weight == the number of characters == the result of the last merge of the Huffman tree. So don't consider the situation when all characters are one type.
But I still recommend the first one, It's easier to understand, but you have to be careful.

Gains

When a character is encountered, we map it to a number processing method: such as A==> arr[0]=arr[ str[i] –'A'] (when str[i]
== A), B== > arr[1]=arr[ str[i] –'B'] (when str[i] == B).

It is better to use the C language for input and output in the competition, which is more efficient.

The use of priority queues.

Handling of special circumstances, such as method one needs to consider whether the number of character types is greater than 1, and method two does not need to be considered.