Huffman decompression codec implemented file -C ++

Foreword

Huffman coding is a greedy character encoding algorithm and binary combination, has wide application background, is the most intuitive file compression. This paper describes how to implement Huffman compression and decompression codecs files, and gives the code.

Huffman coding concept

Also known as the optimal Huffman tree tree is a shortest path length weighted tree, constructed by the Huffman coding method called Huffman coding.

Huffman coding is an encoding that is performed by a Huffman tree, under normal circumstances, the character "0" and "1". Coding implementation process is very simple, as long as the Huffman tree, by traversing the Huffman tree, where we traversed downward from the root node, if the next node is a left child, then later append the string "0" if its right child, then after the string is added to "1." Termination condition for the current node is a leaf node, the string is obtained by coding a leaf node corresponding to the characters.

Huffman tree implementation

According to the idea greedy algorithm to achieve, the more the frequency of characters with character appears slightly shorter coding, and appear less frequently with a little character encoding slightly longer. Huffman tree is achieved in accordance with this idea, the following example will create a specific process analysis Huffman tree. Each row of the table below correspond to the character and frequency of occurrence, will be able to create a Huffman tree based on that information.

character	Frequency of occurrence	coding	The total number of bits
a	500	1	500
b	250	01	500
c	120	001	360
d	60	0001	240
e	30	00001	150
f	20	00000	100

Below, each node as a character, all characters with frequency into the priority queue, the queue each taking two nodes a minimum frequency and b (where a minimum frequency as the left subtree ), then create a new node R, setting the frequency of the node and two nodes, and the node R as the new parent node of the nodes a and B. Finally the R placed in the priority queue. This process is repeated until only one element in the queue, is the root node of the Huffman tree.

由上分析可得，哈夫曼编码的需要的总二进制位数为 500 + 500 + 360 + 240 + 150 + 100 = 1850。上面的例子如果用等长的编码对字符进行压缩，实现起来更简单，6 个字符必须要 3 位二进制位表示，解压缩的时候每次从文本中读取 3 位二进制码就能翻译成对应的字符，如 000，001，010，011，100，101 分别表示 a，b，c，d，e，f。则需要总的二进制位数为（500 + 250 + 120 + 60 + 30 + 20）* 3 = 2940。对比非常明显哈夫曼编码需要的总二进制位数比等长编码需要的要少很很多，这里的压缩率为 1850 / 2940 = 62%。哈夫曼编码的压缩率通常在 20% ~90% 之间。

下面代码是借助标准库的优先队列 std::priority_queque 实现哈夫曼树的代码简单实现，构造函数需要接受 afMap 入参，huffmanCode 函数是对象的唯一对外方法，哈夫曼编码的结果会写在 codeMap 里面。这部分是创建哈夫曼树的核心代码，为方便调试，我还实现了打印二叉树树形结构的功能，这里就补贴代码，有兴趣的同学可以到文末给出的 github 仓库中下载。

using uchar = unsigned char;

struct Node {
    uchar c;        
    int freq;
    Node *left;
    Node *right;
    Node(uchar _c, int f, Node *l = nullptr, Node *r = nullptr)
        : c(_c), freq(f), left(l), right(r) {}
    bool operator<(const Node &node) const { //重载，优先队列的底层数据结构std::heap是最大堆
        return freq > node.freq;
    }
};

class huffTree {
public:
    huffTree(const std::map<uchar, int>& afMap) {
        for (auto i : afMap) {
            Node n(i.first, i.second);
            q.push(n);
        }
        _makehuffTree();
    }
    ~huffTree() {
        Node node = q.top();
        _deleteTree(node.left);
        _deleteTree(node.right);
    }
    void huffmanCode(std::map<uchar, std::string>& codeMap) {
        Node node = q.top(); 
        std::string prefix;
        _huffmanCode(&node, prefix, codeMap);
    }
private:    
    static bool _isLeaf(Node* n) {
        return n->left == nullptr && n->right == nullptr;
    }
    void _deleteTree(Node* n) {
        if (!n) return ;
        _deleteTree(n->left);
        _deleteTree(n->right);
        delete n;
    }
    void _makehuffTree() {
        while (q.size() != 1) {
            Node *left = new Node(q.top()); q.pop();
            Node *right = new Node(q.top()); q.pop();
            Node node('R', left->freq + right->freq, left, right);
            q.push(node);
        }
    }
    void _huffmanCode(Node *root, std::string& prefix, 
                      std::map<uchar, std::string>& codeMap) {
        std::string tmp = prefix;
        if (root->left != nullptr) {
            prefix += '0';
            if (_isLeaf(root->left)) {
                codeMap[root->left->c] = prefix;
            } else {
                _huffmanCode(root->left, prefix, codeMap);
            }
        }
        if (root->right != nullptr) {
            prefix = tmp;
            prefix += '1';
            if (_isLeaf(root->right)) {
                codeMap[root->right->c] = prefix;
            } else {
                 _huffmanCode(root->right, prefix, codeMap);
            }
        }
    }
private:
    std::priority_queue<Node> q;
};

文件压缩实现

首先需要给出文件压缩和下面将要提到的文件解压缩的公共头文件，如下：

//得到index位的值，若index位为0，则GET_BYTE值为假，否则为真
#define GET_BYTE(vbyte, index) (((vbyte) & (1 << ((index) ^ 7))) != 0)
//index位置1
#define SET_BYTE(vbyte, index) ((vbyte) |= (1 << ((index) ^ 7)))
//index位置0
#define CLR_BYTE(vbyte, index) ((vbyte) &= (~(1 << ((index) ^ 7))))

using uchar = unsigned char;

struct fileHead {
    char flag[4];                //压缩二进制文件头部标志 ycy
    uchar alphaVariety;            //字符种类
    uchar lastValidBit;            //最后一个字节的有效位数
    char unused[10];            //待用空间
};                                //这个结构体总共占用16个字节的空间

struct alphaFreq {
    uchar alpha;                //字符,考虑到文件中有汉字，所以定义成uchar
    int freq;                    //字符出现的频度
    alphaFreq() {}
    alphaFreq(const std::pair<char, int>& x) 
      : alpha(x.first), freq(x.second) {}    
};

下面是文件压缩的代码具体实现。过程其实相对简单，理解起来不难。首先需要读取文件信息，统计每一个字符出现的次数，这里实现是从 std::map 容器以字符为 key 累加统计字符出现的次数。然后，用统计的结果 _afMap 创建哈夫曼树，得到相应的每个字符的哈夫曼编码 _codeMap。最后，就是将数据写入压缩文件，该过程需要先写入文件头部信息，即结构体 fileHead 的内容，这部分解压缩的时候进行格式校验等需要用到。接着将 _afMap 的字符及频率数据依次写入文件中，这部分是解压缩时重新创建哈夫曼树用来译码。到这一步就依次读取源文件的每一个字符，将其对应的哈夫曼编码写进文件中去。至此压缩文件的过程结束。下面的代码不是很难，我就不加注释了。

class huffEncode {
public:
    bool encode(const char* srcFilename, const char* destFilename) {
        if (!_getAlphaFreq(srcFilename)) return false;
        huffTree htree(_afMap);
        htree.huffmanCode(_codeMap);
        return _encode(srcFilename, destFilename);
    }
private:
    int _getLastValidBit() {
        int sum = 0;
        for (auto it : _codeMap) {
            sum += it.second.size() * _afMap.at(it.first);
            sum &= 0xFF;
        }
        sum &= 0x7;
        return sum == 0 ? 8 : sum;
    }
    bool _getAlphaFreq(const char* filename) {
        uchar ch;
        std::ifstream is(filename, std::ios::binary);
        if (!is.is_open()) {
            printf("read file failed! filename: %s", filename);
            return false;
        }
        is.read((char*)&ch, sizeof(uchar));
        while (!is.eof()) {
            _afMap[ch]++;
            is.read((char*)&ch, sizeof(uchar));
        }
        is.close();
        return true;
    }
    bool _encode(const char* srcFilename, const char* destFilename) {
        uchar ch;
        uchar value;
        int bitIndex = 0;
        fileHead filehead = {'e', 'v', 'e', 'n'};
        filehead.alphaVariety = (uchar) _afMap.size();
        filehead.lastValidBit = _getLastValidBit();

        std::ifstream is(srcFilename, std::ios::binary);
        if (!is.is_open()) {
            printf("read file failed! filename: %s", srcFilename);
            return false;
        }
        std::ofstream io(destFilename, std::ios::binary);
        if (!io.is_open()) {
            printf("read file failed! filename: %s", destFilename);
            return false;
        }

        io.write((char*)&filehead, sizeof(fileHead));
        for (auto i : _afMap) {
            alphaFreq af(i);
            io.write((char*)&af, sizeof(alphaFreq));
        }

        is.read((char*)&ch, sizeof(uchar));
        while (!is.eof()) {
            std::string code = _codeMap.at(ch);
            for (auto c : code) {
                if ('0' == c) {
                    CLR_BYTE(value, bitIndex);
                } else {
                    SET_BYTE(value, bitIndex);
                }
                ++bitIndex;
                if (bitIndex >= 8) {
                    bitIndex = 0;
                    io.write((char*)&value, sizeof(uchar));
                }
            } 
            is.read((char*)&ch, sizeof(uchar));
        }

        if (bitIndex) {
            io.write((char*)&value, sizeof(uchar));
        }
        is.close();
        io.close();
        return true;
    }
private:
    std::map<uchar, int> _afMap;
    std::map<uchar, std::string> _codeMap;
};

文件解压缩实现

文件解压缩其实就是哈夫曼编码的译码过程，处理过程相对于压缩过程来说相对复杂一点，但其实就是将文件编码按照哈夫曼编码的既定规则翻译出原来对应的字符，并将字符写到文件中的过程。较为详细的过程是先读取文件头部信息，校验文件格式是否是上面压缩文件的格式（这里是flag的四个字符为even），不是则返回错误。然后根据头部信息字符种类 alphaVariety（即字符的个数）依次读取字符及其频率，并将读取的内容放到 _afMap 中，然后创建哈夫曼树,得到相应的每个字符的哈夫曼编码 _codeMap，并遍历 _codeMap 创建以字符编码为 key 的译码器 _decodeMap，主要方便是后面译码的时候根据编码获取其对应的字符。然后读取压缩文件剩余的内容，每次读取一个字节即 8 个二进制位，获取哈夫曼树根节点，用一个树节点指针pNode指向根节点，然后逐个读取二进制，每次根据二进制位的值，当值为 0 指针走左子树，当值为 1 指针走右子树，并将值添加到 std::string 类型的字符串 code 后面，直到走到叶子结点位置为止。用 code 作为 key 可在译码器 _decodeMap 中取得对应的字符，将字符写到新文件中去。然后清空 code，pNode重新指向根节点，继续走上面的流程，直到读完文件内容。文件最后一个字节的处理和描述有点不一样，需根据文件头信息的最后一位有效位 lastValidBit 进行特殊处理，这里特别提醒一下。

class huffDecode {
public:
    huffDecode() : _fileHead(nullptr), _htree(nullptr) {
        _fileHead = new fileHead();
    }
    ~huffDecode() {
        if (!_fileHead) delete _fileHead;
        if (!_htree) delete _htree;
    }
private:
    static bool _isLeaf(Node* n) {
        return n->left == nullptr && n->right == nullptr;
    }
    long _getFileSize(const char* strFileName) {
        std::ifstream in(strFileName);
        if (!in.is_open()) return 0;

        in.seekg(0, std::ios_base::end);
        std::streampos sp = in.tellg();
        in.close();
        return sp;
    }
    bool _getAlphaFreq(const char* filename) {
        std::ifstream is(filename, std::ios::binary);
        if (!is) {
            printf("read file failed! filename: %s", filename);
            return false;
        }
        
        is.read((char*)_fileHead, sizeof(fileHead));
        if (!(_fileHead->flag[0] == 'e' && 
              _fileHead->flag[1] == 'v' &&
              _fileHead->flag[2] == 'e' &&
              _fileHead->flag[3] == 'n')) {
            printf("not support this file format! filename: %s\n", filename);
            return false;
        }
        for (int i = 0; i < static_cast<int>(_fileHead->alphaVariety); ++i) {
            alphaFreq af;
            is.read((char*)&af, sizeof(af));
            _afMap.insert(std::pair<char, int>(af.alpha, af.freq));
        }
        is.close();
        return true;
    }
    bool _decode(const char* srcFilename, 
                 const char* destFilename) {
        long fileSize = _getFileSize(srcFilename);
        
        std::ifstream is(srcFilename, std::ios::binary);
        if (!is) {
            printf("read file failed! filename: %s", srcFilename);
            return false;
        }
        is.seekg(sizeof(fileHead) + sizeof(alphaFreq) * _fileHead->alphaVariety);

        Node node = _htree->getHuffTree();
        Node* pNode = &node;

        std::ofstream io(destFilename, std::ios::binary);
        if (!io) {
            printf("create file failed! filename: %s", destFilename);
            return false;
        }
        
        uchar value;
        std::string code;
        int index = 0;
        long curLocation = is.tellg();
        is.read((char*)&value, sizeof(uchar));
        while (1) {
            if (_isLeaf(pNode)) {
                uchar alpha = _decodeMap[code];
                io.write((char*)&alpha, sizeof(uchar));
                if (curLocation >= fileSize && index >= _fileHead->lastValidBit) {
                    break;
                }
                code.clear();
                pNode = &node;
            }

            if (GET_BYTE(value, index)) {
                code += '1';
                pNode = pNode->right;
            } else {
                pNode = pNode->left;
                code += '0';
            }
            if (++index >= 8) {
                index = 0;
                is.read((char*)&value, sizeof(uchar));
                curLocation = is.tellg();
            }
        }
        
        is.close();
        io.close();
        return true;
    }
public:
    bool decode(const char* srcFilename, const char* destFilename) {
        if (!_getAlphaFreq(srcFilename)) return false;
        long fileSize = _getFileSize(srcFilename);
        _htree = new huffTree(_afMap);
        _htree->watch();  
        _htree->huffmanCode(_codeMap);

        for (auto it : _codeMap) {
            _decodeMap.insert(std::pair<std::string, uchar>(it.second, it.first));
        }

        return _decode(srcFilename, destFilename);
    }
private:
    fileHead *_fileHead;
    huffTree *_htree;
    std::map<uchar, int> _afMap;
    std::map<uchar, std::string> _codeMap;
    std::map<std::string, uchar> _decodeMap;
};

总结

利用哈夫曼编解码实现文件的解压缩其实原理不是很难，但其需要用的编程知识其实相对较多，有优先队列、位运算、满二叉树、容器及文件操作等，想要实现的优雅其实不是很容易。而我在网上查到的 C++ 实现都不甚满意，所以决定自己实现，个人觉得还算比较满意，但因个人水平有限肯定会存在某些问题，请发现的朋友留言探讨。我觉得这个过程还是比较非常能锻炼自己的编程能力，作为一个小项目来练手再合适不过，不仅能够加深自己对位运算、C++标准库、二叉树及文件操作的理解，而且能够锻炼面向对象的编程思维。对了，不能忘记了，我代码实现的主要思想主要参考这位兄弟的文章，他是用 C 语言实现的，其实已经非常优雅，文章链接：https://blog.csdn.net/weixin_38214171/article/details/81626498。

最后给出实现的源码链接：https://github.com/evenleo/huffman