数据结构与算法之9（哈夫曼编解码与广度优先搜索）

》哈夫曼编码
在二叉树最后的例子里的最后提到了哈夫曼树，个人感觉不是很好理解，为大家找到了一个篇讲的比较简洁明了的http://blog.csdn.net/jinixin/article/details/52142352，就不再造轮子了，该篇文章是用c实现的，不过概念是一样的。

旁白：看完上面一篇之后，你得理解以下问题，哈夫曼编码为什么可以用来压缩数据？什么是最优前缀码？哈夫曼树的构造过程。

》哈夫曼解码

该篇文章未谈到如何进行解码的。这里打个比方任意字符串abcbcad，对其进行哈夫曼编码，再进行解码，我们通过手写解释整个过程：
1）统计频率作为权重
a-2 b-2 c-2 d-1
2) 构造哈夫曼树 ,我们用-数字代表节点权重（构造树）

            ?-7
         0/    \1
        ?-3    ?-4
     0/   \1  0/   \1
   d-1   a-2  b-2  c-2

3)根据哈夫曼树对字符串abcbcad进行编码（利用树编码）

a    b    c     b     c     a     d
01   10   11    10    11    01    00

所以字符串abcbcad的编码结果为：01101110110100

4)对01101110110100进行解码（利用树解码）

旁白：解码过程其实就是将编码还原为字符串abcbcad。我们能利用的还是这颗树。从树根开始查找0，发现节点?-3，节点?-3里面没有数据不是我们的字符，节点?-3下继续查找1，发现节点a-2，里面有数据a，此时结束查找将01解码为a，一次解码结束，后面重新循环。

理解了以上问题之后，我们再来探讨如何用java实现一个哈夫曼编解码。
如果你尝试了用手写去进行编码，再进行解码。你会发现整个过程可以是这样的：
1）构建哈夫曼树
2）利用树编码（遍历字符串，查找每个字符在树中的路径，即为其编码）
3）利用树解码（遍历编码，如果编码对应节点没有数据，则以该节点为父节点继续查看下一位编码，直到对应节点出现数据，便取出节点中的字符，然后重新循环）

》数据结构选用：

分析1: 首先节点和树，是少不了的，同时构建树前要计算权重，比如统计出现的次数，这个权重写到节点中就行。为了方便计算，节点保存到map中，以关键字作为key，权重作为value。

分析2: 同时构建树的每一步需要进行排序，每次选出两个最小权重的构建树，为了方便起见采用我们前面讲过的优先级队列PriorityQueue，因为每次插入一个元素时，队列都会进行siftUp进行排序，很适合我们这个场景，我们把构建好的树根节点插入队列中就会自动排序，当然每次自己实现排序也是完全没有问题的。

分析3: 使用优先级队列其中元素必须支持排序，所以光Node还不行，Node还要实现Comparable接口，支持比较大小，才能排序。

分析4: 同时，构造一颗树后编码一个字符需要查找路径，这个路径的查找过程，模拟手写编码过程，先找到这个字符在树中对应的节点（由于哈夫曼树无序，这个过程只能遍历整个树），然后查父节点，往上一直到树根反过来就是编码了(需要反向，可以用栈来存储)。由于需要查父节点，Node中除了保存左右子节点，还得保存父节点。

》综上，java代码如下：
1）Node节点

class Node implements Comparable<Node>{
    private Character key;  //关键字
    private int weight;   //权重
    private Node leftChild,rightChild;
    private Node parent;

    public Node() {
    }

    public Node(Character key, int weight) {
        super();
        this.key = key;
        this.weight = weight;
    }
    public Character getKey() {
        return key;
    }
    public void setKey(Character key) {
        this.key = key;
    }
    public int getWeight() {
        return weight;
    }
    public void setWeight(int weight) {
        this.weight = weight;
    }
    public Node getLeftChild() {
        return leftChild;
    }
    public void setLeftChild(Node leftChild) {
        this.leftChild = leftChild;
    }
    public Node getRightChild() {
        return rightChild;
    }
    public void setRightChild(Node rightChild) {
        this.rightChild = rightChild;
    }
    public Node getParent() {
        return parent;
    }
    public void setParent(Node parent) {
        this.parent = parent;
    }

    //实现比较方法
    @Override
    public int compareTo(Node o) {
        return this.weight-o.weight;
    }

    //重载输出
    @Override
    public String toString() {
        // 让其可以按照 关键字-权重 的格式输出
        return this.key+"-"+this.weight;
    }
}

2）哈夫曼Tree

class HuffmanTree{
    private Node root;
    private String input;   //输入字符串
    private String code;   //编码字符串

    public HuffmanTree() {
        super();
        // TODO Auto-generated constructor stub
    }
    public Node getRoot() {
        return root;
    }
    public void setRoot(Node root) {
        this.root = root;
    }

    //2.1根据字符串构建树
    createHuffmanTree(String str);
    //2.2查找某个字符的编码
    String getCodeByKey(char key)
    //2.3对字符串编码
    String code()
    //2.4对字符串解码
    String decode()
}

2.1）根据字符串构建树

/**
     * 构建哈夫曼树
     * @param input   输入参数
     * @return  返回树根节点
     */
    public Node createHuffmanTree(String input){
        this.input = input;

        //1.遍历一边计算权重，数据存于map
        int size = input.length();
        if(size==0){          
            root = null;
            return root;
        }else if(size==1){
            root = new Node(input.charAt(0),1);
            return root;
        }

        HashMap<Character, Integer> map = new HashMap<>();
        for(int i=0;i<input.length();i++){
            char key = input.charAt(i);
            Integer weight = map.get(key);
            if(weight==null){
                weight=1;
            }else{
                weight++;
            }
            map.put(key,weight);
        }
        //2.遍历map插入所有元素到队列
        PriorityQueue<Node> nodeQueue= new PriorityQueue<>();
        Set<Character> keySet = map.keySet();
        Iterator<Character> iterator = keySet.iterator();
        while(iterator.hasNext()){
            //获取数据
            char key = iterator.next();
            int weight = map.get(key);
            //构建节点
            Node n = new Node(key,weight);
            //插入队列
            nodeQueue.add(n);
        }
        //[s-1, z-1, v-2, c-2, a-3, n-2, o-2, f-3, x-2]
        System.out.println(nodeQueue);

        //3.遍历优先级队列,每次取出两个最小权重的节点构建树，这里简单使用while循环
        Node childTree = null;
        while(nodeQueue.size()>1){
            //取出前两个
            Node left = nodeQueue.poll();
            Node right = nodeQueue.poll();
            //构建一颗huffman子树
            childTree = new Node();
            childTree.setWeight(left.getWeight()+right.getWeight());
            childTree.setLeftChild(left);
            childTree.setRightChild(right);
            left.setParent(childTree);
            right.setParent(childTree);
            //将该子树根节点插入队列
            nodeQueue.add(childTree);
        }
        root = childTree;
        return root;
    }

ps：注意这一句System.out.println(nodeQueue);其打印值是我的实验数据，为什么优先级队列中的数据不是说好的有序的呢？不明白的童鞋可以去看我前面栈与队列一节中的讲解。

2.2）查找某个字符的编码

/**
     * 根据关键字查找编码：
     * 1.先找到关键字相同的节点
     * 2.根据父节点往上推导路径
     * 
     * 由于哈夫曼树是无序的，不能采用二叉搜索树的办法，只能一个个遍历
     * 这里用队列实现先根遍历：根，左，右，也就是广度优先搜索。
     * 如果关键字不同，则从子节点依次比较，只到结束
     * @param key
     * @return
     */
    public String getCodeByKey(char key){
        //在树中查找该节点
        Node result = null;
        Queue<Node> queue = new LinkedList<>();
        queue.add(root);
        while(!queue.isEmpty()){
            Node current = queue.poll();
            if(current.getKey()!=null && key==current.getKey()){
                result = current;
                break;
            }else{
                if(current.getLeftChild()!=null){
                    queue.add(current.getLeftChild());
                }
                if(current.getRightChild()!=null){
                    queue.add(current.getRightChild());
                }
            }
        }
        //根据parent逆推
        Node current = result;
        Stack<String> stack = new Stack<>();
        while(current.getParent()!=null){
            //如果当前节点是父节点的左子节点，则压入0，右节点则压入1
            if(current == current.getParent().getLeftChild()){
                stack.push("0");
            }else{
                stack.push("1");
            }
            //顺藤而上
            current = current.getParent();
        }

        //依次出栈即为编码
        StringBuilder sb = new StringBuilder();
        while(!stack.isEmpty()){
            sb.append(stack.pop());
        }
        return sb.toString();
    }

//2.3对字符串编码

//编码
    public String code(){
        StringBuilder sb = new StringBuilder();
        for(int i=0;i<input.length();i++){
            char key = input.charAt(i);
            String code = getCodeByKey(key);
            System.out.println(key+"的编码为："+code);
            sb.append(code);
        }
        this.code = sb.toString();
        return code;
    }

2.4）解码

    //解码
    public String decode(){
        StringBuilder sb = new StringBuilder();
        Queue<Character> queue = new LinkedList<>();
        for(int i=0;i<code.length();i++){
            queue.add(code.charAt(i));
        }
        while(!queue.isEmpty()){
            char c = queue.poll();
            System.out.print("读取:");
            Node current = root;
            while(current.getKey()==null){
                System.out.print(c);
                if(c=='0'){  //左
                    current = current.getLeftChild();
                }else{  //右
                    current = current.getRightChild();
                }
                if(current.getKey()==null){
                    c = queue.poll();
                }
            }
            System.out.println("解码为："+current.getKey());
            sb.append(current.getKey());
        }
        return sb.toString();
    }

2.5）测试结果

字符串abcbcad
计算权重存于优先级队列：[d-1, b-2, c-2, a-2]
编码结果为：
a的编码为：01
b的编码为：11
c的编码为：10
b的编码为：11
c的编码为：10
a的编码为：01
d的编码为：00
01111011100100
解码结果为：
读取:01解码为：a
读取:11解码为：b
读取:10解码为：c
读取:11解码为：b
读取:10解码为：c
读取:01解码为：a
读取:00解码为：d
abcbcad

总结：至此哈夫曼编解码过程结束，其中查询字符编码这个过程是比较复杂的，用到广度优先搜索，可以发现利用队列和广度优先，可以把多个分支合并为一个队列，然后依次处理数据。这个原理很有用，后面图论还会用到。

数据结构与算法之9（哈夫曼编解码与广度优先搜索）

猜你喜欢