Python - Huffman Tree Huffman tree implementation and application

Table of contents

I. Introduction

2. Huffman Tree theory

1. Definition

2. Structure

3. Structure

3. Huffman Tree implementation

1. Generate Huffman tree

2. Encoding Huffman coding

3. Decoding Huffman codes

4. Huffman tree encoding and decoding practice

Four. Summary


I. Introduction

In the previous Word2vec article, it was pointed out that the calculation of Softmax of N words in the lexicon each time is very large, and it can be optimized by negative sampling and hierarchical Softmax. Huffman Tree is used in hierarchical Softmax. The following is a brief introduction to the Huffman tree. And its python code implementation.

2. Huffman Tree theory

1. Definition

Given N weighted objects as N leaf nodes, construct a binary tree. If the weighted path length of the tree reaches the minimum, such a binary tree is called an optimal binary tree, also known as a Huffman tree. The Huffman tree is the tree with the shortest weighted path length, and the nodes with larger weights are closer to the root.

2. Structure

The Huffman tree can also be referred to as the optimal tree. According to the definition, we can get several keywords:

- path

A path is a tree branch that a node in a tree goes through to another node, and is called a node path. As shown in the figure above, we mark the left and right branches of each node as path 0 and path 1, taking point d as an example, its path is 110, and this path encoding is also called Huffman encoding, which is used to uniquely Identify the path from the root node to the corresponding node. Taking Word2vec as an example, each leaf node can represent a word in the lexicon.

- length

The total number of branches on the path is called the path length. Taking point d as an example, its code 110 means that the root node reaches d through three steps, so the path length is 3.

- Weight

The N weighted objects in the definition can be understood as each binary Node node has a Freq variable to identify its weight or frequency. For example, each word in Word2vec can be regarded as each leaf node, and the frequency of each word is can be used as weight.

- node weighted path length

Take point d as an example, its Huffman code is 110, the path length is 3, and the weight is 1, then the weighted path length is 3x1 = 3.

- tree weighted path length

Adding the weighted path lengths of all leaf nodes is the weighted path length of the tree, assuming there are N points:

\sum =l_1w_1+1_2w_2 + \cdots +l_nw_n

3. Structure

S stands for String, which is the character, and F stands for Freq, which is the frequency. According to the definition of the Huffman tree, it requires the minimum weighted path length of the tree, so the lower the frequency of the word, the longer the path, and the higher the frequency of the word, the shorter the path , so that ∑ W x L will be the smallest.

- sorting and merging

First sort the character list by frequency, and pop the two smallest Nodes to form a new NewNode, where the left subtree of NewNode is a Node with a smaller weight, and the right subtree is a Node with a larger weight, and NewNode The weight is the sum of the weights of the left and right subtrees.

- loop merge

Add the newly merged NewNode to the list of remaining Nodes, and continue to execute the logic of sorting and merging in the previous step. Until there is only one Node left in the list, this Node is the top node at this time, and the Huffman tree is also constructed.

- generate examples

P1 [1,1,2,3,5,5], B-1 and F-1 with the smallest weight are first merged into a new node with a weight of 1+1=2

P2 [2,2,3,5,5] New-2 and C-2 are the two smallest nodes, they are merged into a new node, and the weight of the new node is 4

P3 [3,4,5,5] A-3 and New-4 form a new node with a weight of 7

P4 [5,5,7] D-5 and E-5 constitute a new node as New-10

P5 [7,10] constitutes a new node New-17, the small one is on the left, that is, 7 is on the left, and the large one is on the right, that is, 10 is on the right

P6 [17] At this time, the list has only one element, and the iteration is complete 

3. Huffman Tree implementation

class Node:
    def __init__(self, freq, char=None):
        self.left = None
        self.right = None
        self.freq = freq
        self.char = char

    def __lt__(self, other):
        return self.freq < other.freq

First define the Node class of the binary tree node, and add the freq variable to represent the occurrence frequency of the corresponding character char.

1. Generate Huffman tree

def create_tree(nodes):
    while len(nodes) > 1:
        nodes.sort()
        left_node = nodes.pop(0)
        right_node = nodes.pop(0)
        new_node = Node(left_node.freq + right_node.freq)
        new_node.left = left_node
        new_node.right = right_node
        nodes.append(new_node)

    return nodes[0]

The loop condition is that the length of Nodes is greater than 1, so the construction is completed when the length of Nodes is 1, you can refer to the example constructed above, here you can use the minimum heap to optimize the time consumption of sorting.

2. Encoding Huffman coding

def huffman_encoding(text):
    # 1.构建字典
    freq_dict = {}
    for char in text:
        if char in freq_dict:
            freq_dict[char] += 1
        else:
            freq_dict[char] = 1

    # 2.构造霍夫曼树
    nodes = [Node(freq_dict[char], char) for char in freq_dict]
    root = create_tree(nodes)

    # 3.递归获取每个char的霍夫曼编码
    huffman_codes = {}
    traverse(root, "", huffman_codes)

    # 4.获取原 Text 的霍夫曼编码
    encoded_text = ""
    for char in text:
        encoded_text += huffman_codes[char]

    return encoded_text, huffman_codes

According to the given text, first perform word segmentation and word frequency statistics, then construct Nodes array and create Huffman, and finally use recursion to obtain the Huffman code corresponding to each char. The logic of the traverse function is as follows:

def traverse(node, code, huffman_codes):
    if node.char:
        huffman_codes[node.char] = code
    else:
        traverse(node.left, code + '0', huffman_codes)
        traverse(node.right, code + '1', huffman_codes)

Only the nodes with char characters, that is, the leaf nodes, will trigger the behavior of saving Huffman codes to the map, and the rest of the newly generated nodes will only pass the call. 

3. Decoding Huffman codes

def huffman_decoding(encoded_text, huffman_codes):
    decoded_text = ""
    code = ""
    for bit in encoded_text:
        code += bit
        for char in huffman_codes:
            if huffman_codes[char] == code:
                decoded_text += char
                code = ""
                break

    return decoded_text

Constantly add bits to the code and judge whether it is in the previously constructed char - hufuman_code map, reset the code and continue new decoding after finding out.

4. Huffman tree encoding and decoding practice

if __name__ == '__main__':
    text = "Hello BitDDD"

    encoded_text, huffman_codes = huffman_encoding(text.lower())
    decoded_text = huffman_decoding(encoded_text, huffman_codes)

    print("Original text:", text)
    print("Encoded text:", encoded_text)
    print("Huffman codes:", huffman_codes)
    print("Decoded text:", decoded_text)

- Original copywriting 

('Original text:', 'Hello BitDDD')

- Coding copywriting

('Encoded text:', '0001110101101001110011011111100010101')

- Encoding dictionary

('Huffman codes:', {' ': '1100', 'b': '1101', 'e': '1110', 'd': '01', 'i': '1111', 'h': '000', 'l': '101', 'o': '001', 't': '100'})

- Decode text

('Decoded text:', 'hello bitddd')

Four. Summary

The basic concept and construction method of the Huffman tree are introduced above. Here is a brief analysis of how the Huffman tree is applied to the hierarchical softmax of word2vec. For more details, please refer to: Gensim Word2Vec practice .

- Optimize the number of calculations

By constructing a Huffman tree for N words in the lexicon, calculating the probability of each word only needs to pay attention to its binary tree path, that is, the length calculation of Huffman coding, and the calculation amount is reduced from N to LogN.

- path reduction

Since the Huffman tree is constructed based on word frequency, the higher the high-frequency words are, the shorter the path is, so the calculation amount for high-frequency words is relatively less.

Guess you like

Origin blog.csdn.net/BIT_666/article/details/129894355