Haskll Lesson：Huffman编码实现文本压缩

微信公众号：牛顿一号
欢迎关注我，一起学习，一起进步!

一、Huffman编码理论

学习资料

https://web.stanford.edu/class/archive/cs/cs106b/cs106b.1126/handouts/220%20Huffman%20Encoding.pdf

基本内容概述：huffman编码是一种变长编码方式，向我们熟悉的ascii码是定长编码的一种，冗余信息较多。变长编码的基本思路是高频率出现的内容做短编码，低频率内容长编码。

针对文本编码考虑话，要求编码方式必须是无损编码，而且压缩率要求能接近理论值。

变长编码通过前缀码实现，即每个编码都不是任意其他编码的前缀，这样保证解码的时候没有歧义。

二、哈夫曼编码的Haskell实现

思路分析：首先通过前面的分析知道了Haskell如何定义代数数据类型，这里我们通过递归的方式定义了一个Huffman树，Huffman树其实是一棵二叉树，树的每个节点代表一个字符。路径表征编码，从根节点开始，向左子树走一步编码成0，往右子树走一步编码成1，知道到达叶子结点。

每个节点包含两个参数：一是权重，二是字符。通过贪心算法，每次将权重最小的几点提取出来进行融合，生成一个新的节点，新的节点用字符@表示（可以替换成任意字符，只要文本中不会出现就行）。

算法流程：

初始化每个节点为只拥有根节点的二叉树，代码中称为Root Tree。
选择权重最小的两颗树进行融合，生成一棵新的Root Tree。
重复2直到只剩下一颗Root Tree。

2.1 字符串处理

定义一些针对huffman编码需要的字符串处理函数。

module Ascii
  (
    getAsciiWeight,
    getHuffmanWeight,
    splitWithChar,
    getAsciiWeightTuple
  ) where

import Data.Char (ord)

getCharCount :: Char -> [Char] -> Int
getCharCount c str = foldl (foo c) 0 str
  where foo = \c1 count c2 ->
                if c1 == c2
                then count+1
                else count

-- [Int] caintains weight for each character
-- 获取每个字符权重
getAsciiWeight :: [Char] -> [Int]
getAsciiWeight str = foldr foo [x | x<-[1..128], let x = 0] str
  where foo c cntList = take (ord c) cntList ++ [cntList!!(ord c)+1] ++ drop (ord c + 1) cntList

-- add ascii info compared with getAsciiWeight
-- 在一个tuple中保存字符及其权重
getAsciiWeightTuple :: [Char] -> [(Int, Char)]
getAsciiWeightTuple str = filter ((/= 0) . fst)
  $foldr foo [(x, y) | x<-[0..127], y<-['a'], let x = 0] str
    where foo c cntList = take (ord c) cntList ++ [(fst (cntList!!(ord c)) +1, c)] ++ drop (ord c + 1) cntList

-- change weight into huffman mode
-- 将（字符，权重）转换成权重+字符构成的字符串，譬如“2a 234b”的形式
showHuffPattern :: [(Int, Char)] -> [Char]
showHuffPattern weightPairs = foldl foo "" weightPairs
  where foo str pair = if fst pair /= 0
                       then str ++ show (fst pair) ++ [snd pair] ++ " "
                       else str

-- 用来根据任意字符切割字符串
splitWithChar :: Char -> String -> [String]
splitWithChar c str = if str /= ""
                      then [splitedStr] ++ splitWithChar c (drop (length splitedStr + 1) str)
                      else []
  where splitedStr = takeWhile (/=c) str

-- "abcd" -> "1a 1b 1c 1d "
getHuffmanWeight :: [Char] -> [Char]
getHuffmanWeight = showHuffPattern . getAsciiWeightTuple

2.2 定义Huffman二叉树

module BTree
  (
    BTree(..),
    getNodeWeight,
    getNodeChar,
    getNodePath,
    walkHuffmanTree
  ) where

data BTree = Node {
  -- string info for huffman code
  -- 三项分别表示权重、字符和路径（即编码）
  nodeInfo :: (Int, Char, String),
  leftTree :: BTree,
  rightTree :: BTree
                  } | Empty deriving (Show, Eq)

-- 获取节点权重
getNodeWeight :: BTree -> Int
getNodeWeight tree = case nodeInfo tree of
                       (weight, _, _) -> weight

-- 获取节点字符
getNodeChar :: BTree -> Char
getNodeChar tree = case nodeInfo tree of
                     (_, ch, _) -> ch

-- 获取节点路径（编码）
getNodePath :: BTree -> String
getNodePath tree = case nodeInfo tree of
                     (_, _, path) -> path

-- 获取最左叶子节点
walkOnce :: BTree -> [BTree]
walkOnce Empty = []
walkOnce (Node c leftTree rightTree) =
  (Node c leftTree rightTree) : walkOnce leftTree

-- 获取所有叶子节点
walkLeaf :: BTree -> [(Int, Char, String)]
walkLeaf (Node c Empty Empty) = [c]
walkLeaf (Node c leftTree rightTree) =
  (walkLeaf leftTree) ++ (walkLeaf rightTree)

-- 遍历左右叶子节点，同时保存每个叶子节点的路径（即编码）
-- key: 生成了每个字符对应的huffman编码
walkHuffmanTree :: String -> BTree -> [(Char, String)]
walkHuffmanTree curPath Empty= []
walkHuffmanTree curPath (Node (_, ch, _) Empty Empty) = [(ch, curPath)]
walkHuffmanTree curPath (Node info leftTree rightTree)=
  (walkHuffmanTree (curPath++"0") leftTree) ++ (walkHuffmanTree (curPath++"1") rightTree)

-- 获取二叉树深度
getTreeDepth :: BTree -> Int
getTreeDepth Empty = 0
getTreeDepth (Node c leftTree rightTree) =
  max (getTreeDepth leftTree) (getTreeDepth rightTree) + 1
    where max a b = if a >= b
                       then a
                       else b

2.3 主程序

负责IO，进行字符串格式处理，并且根据节点权重进行融合产生一个huffman编码所需的二叉树。

import Data.Char (digitToInt, ord)
import Ascii
import BTree

type Tree = BTree

-- 根据节点权重对Root Tree进行快速排序
sortRootTree :: [Tree] -> [Tree]
sortRootTree [] = []
sortRootTree (x:xs) = sortRootTree [x1 | x1<-xs, getNodeWeight x1 < getNodeWeight x] ++ [x] ++ sortRootTree [x2 | x2<-xs, getNodeWeight x2 > getNodeWeight x]

-- 融合当前权重最小的两个节点
resort :: Tree -> [Tree] -> [Tree]
resort newTree oldTrees = [x1 | x1<-oldTrees, getNodeWeight x1 < getNodeWeight newTree] ++ [newTree] ++ [x2 | x2<-oldTrees, getNodeWeight x2 >= getNodeWeight newTree]

-- generate huffman tree from sorted Root Trees
-- 生成Huffman树
genHuffmanTree :: [Tree] -> Tree
genHuffmanTree (x1:x2:xs) = genHuffmanTree (resort (Node ((getNodeWeight x1 + getNodeWeight x2), '@', "") x1 x2) xs)
genHuffmanTree a = a!!0

-- info is depicited as "1a" "2b" and so on
-- 根据字符统计信息创建相应节点
createNodes :: [(Int, Char)] -> [Tree]
createNodes ((weight, ch):nodesInfo) = (Node (weight, ch, "") Empty Empty) : createNodes nodesInfo
createNodes [] = []

{-compress text file-}
-- main
main = do
  content <- readFile "usb.c"
  --print (getAsciiWeightTuple content)
  --print ((genHuffmanTree . createNodes . getAsciiWeightTuple) content)
  writeFile "result.txt" $show ((walkHuffmanTree "" . genHuffmanTree . sortRootTree . createNodes . getAsciiWeightTuple) content)

三、运行结果与分析

我从内核代码中找了一份usb.c，并对其进行huffman编码。结果如下

生成的编码表

[('i',"0000"),('s',"0001"),('r',"0010"),('>',"0011000"),(':',"00110010"),('G',"001100110"),('Z',"001100111000"),('K',"001100111001"),('3',"001100111010"),('\\',"0011001110110"),('J',"0011001110111"),('2',"00110011110"),('+',"00110011111"),('R',"00110100"),('M',"00110101"),('S',"00110110"),('|',"001101110"),('X',"001101111"),('O',"00111000"),('Y',"0011100100"),('\'',"0011100101"),('!',"001110011"),('D',"00111010"),('T',"00111011"),(']',"00111100000"),('[',"00111100001"),('9',"00111100010"),('5',"0011110001100"),('~',"00111100011010"),('6',"00111100011011"),('W',"001111000111"),('@',"001111001"),('U',"00111101"),(',',"0011111"),(' ',"010"),('\t',"01100"),('c',"01101"),('t',"0111"),('p',"100000"),('k',"10000100"),('C',"100001010"),('F',"1000010110"),('1',"1000010111"),('z',"1000011000"),('<',"1000011001"),('&',"100001101"),('N',"10000111"),('a',"10001"),('_',"10010"),('d',"10011"),('o',"10100"),('v',"101010"),('w',"10101100"),('0',"101011010"),('j',"10101101100"),('H',"10101101101"),('"',"1010110111"),('L',"10101110"),('/',"10101111"),('h',"101100"),('*',"101101"),('\n',"10111"),('E',"11000000"),('=',"11000001"),(';',"1100001"),('-',"1100010"),('}',"110001100"),('{',"110001101"),('I',"110001110"),('#',"1100011110"),('x',"1100011111"),('g',"1100100"),('y',"11001010"),('.',"11001011"),('l',"110011"),('u',"11010"),('n',"11011"),('e',"1110"),('V',"11110000000"),('%',"11110000001"),('?',"111100000100"),('4',"111100000101"),('q',"11110000011"),('P',"111100001"),('B',"111100010"),('A',"111100011"),('m',"1111001"),(')',"1111010"),('(',"1111011"),('f',"111110"),('b',"111111")]

二进制编码

0000000 f5 6d 57 ab 4c 50 1d 61 fd c2 bf de 16 ba fa 85
0000010 ff b4 ed 6a bb 5a bd 43 bd 14 aa 04 53 04 93 c6
0000020 55 07 7b a1 b8 0b 55 71 ce a8 d0 79 c4 23 1e 75
0000030 b5 7a 87 7a 29 54 09 a6 08 26 8d 8b b9 97 26 ee
0000040 7d a8 01 ca df cd ae d0 79 c4 23 1e 8d 98 57 ab
0000050 56 e8 5d ad de a1 5e 0a 55 82 29 82 49 e3 7a bc
0000060 33 5d 44 31 8b b3 42 e7 11 8f 78 d4 d5 ea 1d ea
0000070 a5 50 25 98 22 98 34 2e 66 74 93 42 53 7a 38 2d
0000080 b6 27 78 a3 d0 79 c4 23 1e 75 b5 7a 87 7a 29 54
0000090 09 a6 08 26 8d 8b eb 1c 14 da 0c 37 99 15 3a 8f
00000a0 78 c4 23 bd f7 d5 88 97 ed 91 22 d9 06 fc ac 2f
00000b0 ba 57 57 ab 77 a8 97 42 95 60 8a 60 d2 b8 58 e2
00000c0 ce 53 e2 5a 7b 8e 81 98 57 ab 56 ad ae 56 ef 50
00000d0 2f 85 2a c1 14 c1 a4 71 71 c5 0a b2 1e 51 a9 d9
00000e0 37 cf 62 5e ad 5a b5 46 cc ab 55 eb 41 bb 5a bd
00000f0 43 bd 14 aa 04 53 04 93 c6 c5 c9 64 92 29 22 cc
0000100 0a 55 9e c1 72 d8 13 be c6 6d 3b 2d e6 d5 aa 55
0000110 ab ab 95 24 bd 17 fe 29 5f 05 f6 12 64 3d a3 db
0000120 06 f6 44 b6 89 7b 72 a8 7f 4a 8f 73 7c 8a b9 4f
0000130 8b 05 6c e3 47 af ae 56 ef 50 2f 85 2a c1 14 c1
0000140 a4 71 31 a3 9b 88 39 a9 c4 66 54 db 48 fe 8c 5b
0000150 cc ab 55 8b 79 23 e6 d5 aa c5 5c 5d 6d 57 2b 96
0000160 9b 2f e2 65 ad 9d 8f e4 cd 89 d9 c3 ba 8a 79 2a
0000170 7b 9e 9a ee 6a c5 f6 90 8b fd a8 0e ec 6d bc d1
0000180 b8 7c db e1 83 23 26 31 7b 58 37 62 de 74 ad ae
0000190 b6 ab 55 38 07 f7 00 67 71 37 80 02 d4 2e ae 68
00001a0 7d c5 79 9e 52 54 26 a8 8e 14 5d 71 9e f9 a2 e8
00001b0 cd 91 bc 01 14 60 57 ab da 2c 74 45 d9 32 cf 67
...

比特位顺序为低位在前，高位在后。

例如0xf5，对应b10101111，查表可知正好对应一个’/’。
后面跟着0x6d -> b10110110，查表可知其中的b101101对应一个’*’。
接下来0x57 -> b11101010和前面的最后两位拼起来之后得到b1011101010，查表可知b10111对应’\n’，b010对应’ ‘。
紧接着0xab -> b11010101，和前面的b10拼起来就是b1011010101，解出b101101->’*’，b010->’ ‘。
接着是0x4c -> b00110010，得到b100110010，解出b10011->’d’，b0010->’r’

解码下来就是

/*
 * dr

对应了usb.c中开头的注释，顺藤摸瓜，按照这个顺序查表即可对usb.c实现完整的解码。