Java implementation-binary tree of data structure (6): data compression-Huffman coding

Data structure of the binary tree (6): data compression-Huffman coding

Basic introduction:
Huffman tree is a kind of variable word length coding (VLC), which is widely used in data file compression , and its compression rate is usually between 20% and 90%.

  • 1. Information processing method in the field of communication 1- fixed-length encoding The
    Insert picture description here
    above paragraph of English contains 40 characters including spaces, and their corresponding Ascii codes are as follows:
    Insert picture description here
    Then these Ascii codes are converted into binary, and the
    Insert picture description here
    total length is 359 , That is to say, using fixed-length encoding to send this sentence requires 359 length.

  • 2. Information processing method in the field of communication 2- variable length coding
    Insert picture description here
    Still this sentence, we count the number of characters in this sentence:
    d:1,y:1,u:1,v: 2,o:2,l:4,e:4,i:5,a:5, (space): 9
    We encode according to the number of occurrences of each character, the more occurrences, the smaller the encoding, such as the occurrence of spaces After 9 times, the encoding is 0. The
    following is obtained:
    0='(space)',1=a,10=i,11=3,100=k,101=l,110=o,111=v,1000=j, 1001=u,1010=y,1011=d

According to the code stipulated by the above rules, when we retransmit this sentence, the code becomes:
10010110100... (I will omit it, everyone understands anyway)
There is a problem with this code. For example, 1 can be either a or It can be regarded as 10 or i. This creates a conflict and needs to be resolved in a certain way
(the character encoding cannot be the prefix of other character encodings. The encoding that meets this requirement is called the prefix encoding, that is, it cannot match the repeated encoding. This This problem is solved in Huffman coding)

**

3. Analysis of Huffman coding principle:

**
Insert picture description here
1) Still this sentence, we count the number of characters in this sentence:
d:1,y:1,u:1,v:2,o:2,l:4,e: 4,i:5,a:5,(space): 9

2) Next, we construct a Huffman tree according to the number of occurrences of the above characters , and the number of times is used as the weight .
(The method of constructing the Huffman tree has been explained in Chapter 5, and the code representation is no longer here)

The constructed Huffman tree is shown in the figure :
Insert picture description here
3) According to the Huffman tree constructed in the above figure, specify the code for each character, the path to the left is 0, the path to the right is 1, and the code is as follows
: o:1000 ,u:10010,d:100110,y:100111,i:101,a:110,k:1110,e:1111,j:0000,v:0001,l:001,(space):01

4) According to the above Huffman code,
Insert picture description here
the corresponding code of the string is (lossless compression):
Insert picture description here
this code satisfies the prefix code, that is , the code of the character cannot be the prefix of other character codes, and will not cause the ambiguity of the match. It solves the problem of prefix conflicts in the second way. The length is 133.

Note:
This Huffman tree may not be exactly the same according to the sorting method, (for example, when multiple nodes in the tree have the same weight), the corresponding Huffman codes are not exactly the same. But wpl (the weighted path length of the tree) is the same, and both are the smallest.

Code example :
1) Create node Node (data (data: store the Acsii code value corresponding to each character), weight (weight: the number of times each character appears), left, right)
2) Get: "i like like like java do you like a java "corresponding byte[] array
3) Write a method to put the Node nodes that are ready to build the Huffman tree into the List, in the form (Node[data=97, weitht=5], Node(data= 32,weifght=9),...)
4) Construct a Huffman tree through list

The steps after the summary are :
1. Convert the string into a byte array through the getBytes() method
2. Use the getNodes(bytes) method to construct the byte array into the leaf nodes of the Huffman tree
3. Through createHuffmanTree(nodes) The method generates a Huffman tree from the leaf nodes.
4. Use the Huffman tree to create the Huffman code table corresponding to the string through the getCodes (HuffmanRoot) method.
5. Use the Huffman code table to convert the string into a binary form through the zipHuff (bytes, huffmanCodes) method, and then convert the binary Converted to the corresponding byte array.

Reference code :

package Tree06;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

public class HuffmanCode {
    
    
	public static void main(String[] args) {
    
    
		String str="i like like like java do you like a java";
		byte[] contentBytes=str.getBytes();
		System.out.println("未压缩前的长度:"+contentBytes.length);
		byte[] huffmanZip = huffmanZip(contentBytes);
		System.out.println("将字符串转换成功后得到的byte数组"+Arrays.toString(huffmanZip));
	}
	/***
	 * 生成哈夫曼编码的总方法
	 * 
	 */
	private static byte[] huffmanZip(byte[] bytes) {
    
    
		//利用传进来的bytes构建哈夫曼树的节点
		List<Node> nodes=getNodes(bytes);
		//利用构建的节点生成哈夫曼树,返回该树的根节点
		Node HuffmanRoot=createHuffmanTree(nodes);
		//利用getCodes方法获取每个字符对应的哈夫曼编码,存入到huffmanCodes编码表中
		getCodes(HuffmanRoot);
		//利用得到的huffman编码表将对应的字符串转换为字节码数组
		byte[] zipHuff = zipHuff(bytes,huffmanCodes);
		return zipHuff;
	}
	
	
	/**
	 * 将字符串对应的byte[]数组,通过生成的哈夫曼编码表,生成哈夫曼编码
	 * @param bytes 原始的字符串对应的byte[](contentBytes)
	 * @param huffmanCodes生成的哈夫曼编码表
	 * @return 返回字符串对应的哈夫曼编码
	 * 此方法的目的是将目的字符串转换成对应的哈夫曼编码后,每八位一个字节装入byte数组里去
	 */
	public static byte[] zipHuff(byte[] bytes,Map<Byte,String>huffmanCodes) {
    
    
		//1.利用huffmanCodes将bytes转换成对应的字符串
		StringBuilder stringBuilder=new StringBuilder();
		for(byte b:bytes) {
    
    
			stringBuilder.append(huffmanCodes.get(b));
		}
		//统计返回byte[] huffmanCodeBytes长度
		int len;
		if(stringBuilder.length()%8==0) {
    
    
			len=stringBuilder.length()/8;
		}else {
    
    
			len=stringBuilder.length()/8+1;
		}
		byte[] huffmanCodeBytes=new byte[len];
		int index=0;
		for(int i=0;i<stringBuilder.length();i+=8) {
    
    
			String strByte;
			if(i+8>stringBuilder.length()) {
    
    
				strByte=stringBuilder.substring(i);
			}else {
    
    
				strByte=stringBuilder.substring(i,i+8);
			}
			//将strByte转成一个byte,放入到huffmanCodeBytes
			huffmanCodeBytes[index]=(byte)Integer.parseInt(strByte, 2);
			index++;
		}
		return huffmanCodeBytes;
	}
	
	
	//生成哈夫曼编码:
	/**
	 * 1.将哈夫曼编码表存放在Map<Byte,String>形式:字符的ascii码->对应的哈夫曼编码
	 * 2.拼接路径,利用StringBuilder存储某个叶子节点的路径
	 * @param root
	 */
	static Map<Byte,String> huffmanCodes=new HashMap<Byte,String>();
	static StringBuilder stringBuilder=new StringBuilder();
	
	//重载getCodes
	private static Map<Byte,String> getCodes(Node root){
    
    
		if(root==null) {
    
    
			return null;
		}
		//处理root的左子树
		getCodes(root.left,"0",stringBuilder);
		//处理右子树
		getCodes(root.right,"1",stringBuilder);
		return huffmanCodes;
	}
	
	//生成传入的node节点的所有叶子节点的哈夫曼编码,并放入集合,code:路径
	private static void getCodes(Node node,String code,StringBuilder stringBuilder) {
    
    
		StringBuilder stringBuilder2=new StringBuilder(stringBuilder);
		stringBuilder2.append(code);
		if(node!=null) {
    
    
			if(node.data==null) {
    
    
				//递归处理
				getCodes(node.left,"0",stringBuilder2);
				getCodes(node.right,"1",stringBuilder2);
			}else {
    
    //说明是一个叶子节点
				//表示找到某个叶子节点
				huffmanCodes.put(node.data, stringBuilder2.toString());
			}
		}
	}
	//前序遍历
		public static void frontShow(Node root) {
    
    
			if(root!=null) {
    
    
				root.frontShow();
			}else {
    
    
				System.out.println("树为空");
			}
		}
	
	private static List<Node> getNodes(byte[] bytes){
    
    
		//创建List
		ArrayList<Node> nodes=new ArrayList<Node>();
		//得到:"i like like like java do you like a java"对应的byte[]数组
		Map<Byte,Integer> counts=new HashMap<>();
		for(byte b:bytes) {
    
    
			Integer count=counts.get(b);
			if(count==null) {
    
    
				counts.put(b,1);
			}else {
    
    
				counts.put(b, count+1);
			}
		}
		
		//把每一个键值对转成一个Node对象,并加入到nodes集合
		for(Map.Entry<Byte,Integer> entry:counts.entrySet()) {
    
    
			nodes.add(new Node(entry.getKey(),entry.getValue()));
		}
		return nodes;
	}
	//通过List创建对应的哈夫曼树
	private static Node createHuffmanTree(List<Node> nodes) {
    
    
		while(nodes.size()>1) {
    
    
			Collections.sort(nodes);
			//取出第一颗最小的二叉树
			Node leftNode=nodes.get(0);
			Node rightNode = nodes.get(1);
			//创建一棵新的二叉树(没有data,只有权值)
			Node parent=new Node(null,leftNode.weight+rightNode.weight);
			parent.left=leftNode;
			parent.right=rightNode;
			
			//将已经处理两颗二叉树移除
			nodes.remove(leftNode);
			nodes.remove(rightNode);
			//加入新的
			nodes.add(parent);
			
		}
		
		return nodes.get(0);
	}
	
	
}

//创建Node
class Node implements Comparable<Node>{
    
    
	Byte data;//存放数据:字符对应的ascii码值
	int weight;//权值,字符出现的次数
	Node left;
	Node right;
	public Node(Byte data, int weight) {
    
    
		super();
		this.data = data;
		this.weight = weight;
	}
	@Override
	public int compareTo(Node o) {
    
    
		// TODO Auto-generated method stub
		//升序
		return this.weight-o.weight;
	}
	@Override
	public String toString() {
    
    
		return "Node [data=" + data + ", weight=" + weight + "]";
	}
	
	//前序遍历
	public void frontShow() {
    
    
		System.out.println(this);
		if(this.left!=null) {
    
    
			this.left.frontShow();
		}
		if(this.right!=null) {
    
    
			this.right.frontShow();
		}
	}
}


Output result:
Insert picture description here

Guess you like

Origin blog.csdn.net/qq_45273552/article/details/109129499