卜若的代码笔记-算法系列-第4个算法案例分析:哈夫曼编码求最长编码长度

1. 这一题的问题是给你一窜字符串

AAAAABCD

如果使用ASCII编码,则是64位

现在让你使用哈夫曼编码,则需要多少位?

原始题目:

题目1:Entropy

Description

An entropy encoder is a data encoding method that achieves lossless data compression by encoding a message with wasted or extra information removed. In other words, entropy encoding removes information that was not necessary in the first place to accurately encode the message. A high degree of entropy implies a message with a great deal of wasted information; english text encoded in ASCII is an example of a message type that has very high entropy. Already compressed messages, such as JPEG graphics or ZIP archives, have very little entropy and do not benefit from further attempts at entropy encoding.

English text encoded in ASCII has a high degree of entropy because all characters are encoded using the same number of bits, eight. It is a known fact that the letters E, L, N, R, S and T occur at a considerably higher frequency than do most other letters in english text. If a way could be found to encode just these letters with four bits, then the new encoding would be smaller, would contain all the original information, and would have less entropy. ASCII uses a fixed number of bits for a reason, however: its easy, since one is always dealing with a fixed number of bits to represent each possible glyph or character. How would an encoding scheme that used four bits for the above letters be able to distinguish between the four-bit codes and eight-bit codes? This seemingly difficult problem is solved using what is known as a prefix-free variable-length encoding.

In such an encoding, any number of bits can be used to represent any glyph, and glyphs not present in the message are simply not encoded. However, in order to be able to recover the information, no bit pattern that encodes a glyph is allowed to be the prefix of any other encoding bit pattern. This allows the encoded bitstream to be read bit by bit, and whenever a set of bits is encountered that represents a glyph, that glyph can be decoded. If the prefix-free constraint was not enforced, then such a decoding would be impossible.

Consider the text AAAAABCD. Using ASCII, encoding this would require 64 bits. If, instead, we encode A with the bit pattern 00, B with 01, C with 10, and D with 11 then we can encode this text in only 16 bits; the resulting bit pattern would be 0000000000011011. This is still a fixed-length encoding, however; were using two bits per glyph instead of eight. Since the glyph A occurs with greater frequency, could we do better by encoding it with fewer bits? In fact we can, but in order to maintain a prefix-free encoding, some of the other bit patterns will become longer than two bits. An optimal encoding is to encode A with 0, B with 10, C with 110, and D with 111. (This is clearly not the only optimal encoding, as it is obvious that the encodings for B, C and D could be interchanged freely for any given encoding without increasing the size of the final encoded message.) Using this encoding, the message encodes in only 13 bits to 0000010110111, a compression ratio of 4.9 to 1 (that is, each bit in the final encoded message represents as much information as did 4.9 bits in the original encoding). Read through this bit pattern from left to right and youll see that the prefix-free encoding makes it simple to decode this into the original text even though the codes have varying bit lengths.

As a second example, consider the text THE CAT IN THE HAT. In this text, the letter T and the space character both occur with the highest frequency, so they will clearly have the shortest encoding bit patterns in an optimal encoding. The letters C, I and N only occur once, however, so they will have the longest codes.

There are many possible sets of prefix-free variable-length bit patterns that would yield the optimal encoding, that is, that would allow the text to be encoded in the fewest number of bits. One such optimal encoding is to encode spaces with 00, A with 100, C with 1110, E with 1111, H with 110, I with 1010, N with 1011 and T with 01. The optimal encoding therefore requires only 51 bits compared to the 144 that would be necessary to encode the message with 8-bit ASCII encoding, a compression ratio of 2.8 to 1.

熵编码器是一种数据编码方法,它通过对删除了浪费或额外信息的消息进行编码来实现无损的数据压缩。换句话说,熵编码首先删除了对消息进行精确编码所不需要的信息。高度的熵意味着大量信息的浪费;ASCII编码的英语文本就是熵值非常高的消息类型的一个例子。已经压缩的消息,如JPEG图形或ZIP归档,熵非常小,并且不能从熵编码的进一步尝试中获益。

ASCII码编码的英文文本的熵值很高,因为所有字符都是用相同的8位编码的。众所周知,字母E, L, N, R, S和T在英语文本中出现的频率比其他大多数字母要高得多。如果有一种方法可以用四位元来编码这些字母,那么新的编码就会更小,包含所有的原始信息,熵也会更小。ASCII使用固定位数是有原因的:它很简单,因为总是处理固定位数来表示每个可能的字形或字符。使用四位元来表示上述字母的编码方案如何能够区分四位元码和八位元码?这个看似困难的问题用无前缀可变长度编码解决了。

在这种编码中,任何位数都可以用来表示任何符号,而消息中不存在的符号则不进行编码。但是,为了能够恢复信息,任何编码符号的位模式都不能作为任何其他编码位模式的前缀。这允许对已编码的位流进行逐位读取,每当遇到表示符号的一组位时,就可以对该符号进行解码。如果没有强制执行无前缀约束,那么这样的解码将是不可能的。

考虑文本AAAAABCD。使用ASCII,编码这将需要64位。如果我们用00位模式编码A,用01位编码B,用10位编码C,用11位编码D,那么我们只能用16位编码这个文本;得到的位模式是0000000000011011。然而,这仍然是一个固定长度编码;每个象形文字使用两位而不是八位。既然A字形出现的频率更高,我们能不能用更少的比特来编码呢?事实上我们可以,但是为了保持一个无前缀编码,其他一些位模式将变得比2位更长。最佳编码是用0编码A,用10编码B,用110编码C,用111编码D。(这显然不是唯一的最佳编码,因为很明显,对于任何给定的编码,B、C和D的编码都可以自由地交换,而不会增加最终编码消息的大小。)使用这种编码,消息仅以13位到0000010110111进行编码,压缩比为4.9:1(即,最终编码消息中的每一位代表的信息与原始编码中的4.9位相同)。从左到右阅读这个位模式,你会看到无前缀编码使解码成原始文本变得很简单,即使代码有不同的位长。

作为第二个例子,考虑文本the CAT IN the HAT。在本文中,字母T和空格字符都以最高的频率出现,因此它们在最佳编码中显然具有最短的编码位模式。字母C、I和N只出现一次,所以它们的编码最长。

有许多可能的无前缀可变长度位模式集,它们将产生最佳编码,即允许文本以最少的位进行编码。其中一个最优编码是对00,A, 100, C, 1110, E, 1111, H, 110, I, 1010, N, 1011和T, 01进行编码。因此,最佳编码只需51位,而用8位ASCII编码(压缩比为2.8:1)编码则需要144位。


Input

The input file will contain a list of text strings, one per line. The text strings will consist only of uppercase alphanumeric characters and underscores (which are used in place of spaces). The end of the input will be signalled by a line containing only the word END as the text string. This line should not be processed.

输入文件将包含一个文本字符串列表,每行一个。文本字符串将只包含大写字母数字字符和下划线(用于替代空格)。输入的结尾将由一行表示,其中只包含单词end作为文本字符串。不应该处理这一行。

Output

For each text string in the input, output the length in bits of the 8-bit ASCII encoding, the length in bits of an optimal prefix-free variable-length encoding, and the compression ratio accurate to one decimal point.

对于输入中的每个文本字符串,输出8位ASCII编码的位长度、最佳无前缀变长编码的位长度和精确到小数点的压缩比。


Sample Input

AAAAABCD

THE_CAT_IN_THE_HAT

END

Sample Output

64 13 4.9

144 51 2.8

 

要求:

  1. 写出算法思想及求解步骤。
  2. 写出样例的求解过程。
  3. 写出源程序。

2.哈夫曼编码

 

 

 

代码:

package com.company;

import java.util.HashMap;
import java.util.PriorityQueue;

public class HFM {
    public PriorityQueue MQ;

    public Integer[] originData;
    public HashMap<Character,Integer> counterMap;
    /**
     * 数据初始化,获得各个权重节点
     * @param chars
     */
    public void initData(char[] chars){
        counterMap = new HashMap<>();


        for(int i =0;i<chars.length;i++){

            if (counterMap.containsKey(chars[i])){
                counterMap.put(chars[i],counterMap.get(chars[i])+1);
            }else {
                counterMap.put(chars[i],1);

            }


        }
        originData = new Integer[counterMap.size()];
        int index  = 0;
       for (Character character : counterMap.keySet()){

            originData[index] = counterMap.get(character);

            index++;
       }



    }



    public void hfmTree(){
        MQ = new PriorityQueue<>();
        for (int i =0;i<originData.length;i++){
            MQ.offer(originData[i]);
        }

        int maxValue = 0;
        while (true){
            //弹出队列顶

            Integer p0 = (Integer) MQ.poll();
            //弹出队列次顶
            Integer p1 = (Integer)MQ.poll();
            Integer p3 =p0+p1;
            maxValue+= p3;

            if (MQ.isEmpty())
            {
                break;
            }
            MQ.offer(p3);

        }
        System.out.println(maxValue);
    }
    public static void main(String[] args){
        HFM hfm = new HFM();

        //char[] characters = new char[]{'A','A','A','A','A','B','C','D'};
        //THE_CAT_IN_THE_HAT
        char[] characters = new char[]{'T','H','E','_','C','A','T','_'
                ,'I','N','_','T','H','E','_','H','A','T'};
        hfm.initData(characters);

        hfm.hfmTree();

    }

}
发布了202 篇原创文章 · 获赞 10 · 访问量 1万+

猜你喜欢

转载自blog.csdn.net/qq_37080133/article/details/103537840