Information Theory II: Optimal Binary Tree and Huffman Coding

At a glance:


  1. "Noise" and "Signal to Noise Ratio" of JSON

  2. The theoretical upper limit of the amount of noise, inverse Polish expression

  3. Information Theory and Compression Technology: String vs. Byte String

  4. Best binary tree, FPS/2.0

  5. Huffman coding

  6. Message Pack

  7. Huffman tree for Message Pack

  8. Prefix VS separator

  9. Message Pack defect, host environment bug

  10. The limit of serialization, two basic axioms

  11. UTF-8 extreme compression

  12. Rational number: variable length type shift

  13. Dictionary compression method

  14. Tail mutilation

  15. Ultra Pack and the principle of space-time replacement

  16. V8 engine metaphysics (What's happening under the hood)


"I  don't talk about technology here, only thinking. "

Some nonsense.

Originally, this ppt was intended to be displayed at the company's FEConf conference, but the new coronavirus epidemic at the beginning of the year gave this to the pigeons. It is said that in the spring of 16XX, a tragic plague broke out in the London area, and then the great God Newton produced a series of top academic achievements including the binomial theorem and calculus when isolated at home, which led to the first human theory. Physics explosion...

There has been no progress in theoretical physics for humans for nearly 100 years , and all applications are still based on the theory of relativity & quantum mechanics at the beginning of the last century. So I and other social animals, encouraged by the "Theoretical Reserve Emergency" campaign on Weibo and Moments, also tried to use the time of isolation at home to pretend to engage in science, especially theoretical research that is out of the application interest.

My research direction is Shannon’s information theory (from the evolutionary section on station B, it is said that evolution is based on information theory????), I want to have a better understanding of information, entropy, and life, and reduce myself through learning Information entropy, about the relationship between information and entropy, refer to the previous article "Information and Entropy [Part 1] Life feeds on information" . This article is the second in the series. The topic is about the information compression algorithm in information theory. The ppt that was prepared for the company's speech a year ago is completely translated into an article, and the content is more back to computer software.

01

JSON noise


Theme: Looking for the limits of serialization.

What is serialization? Serialization is a kind of lossless conversion of multi-dimensional data into one-dimensional linear data. It is an encoding method. The reason why it is converted into one-dimensional is for better storage and transmission. For the knowledge of serialization, refer to this article .

Information theory believes that "encoding" is the process of converting information from one form or format to another. For example, character encoding is the conversion of information in text format into bytecode format. On the contrary, base64 encoding is the process of converting bytecode format. Convert to text format. "Decoding" is the reverse process of encoding.

Term explanation: coding

Then JSON is such a popular serialization format. JSON supports the four basic types of real numbers, strings, booleans, and null and the two composite types of lists and dictionaries. It is very easy to use and of course has its shortcomings. JSON The "signal-to-noise ratio" is very low , but it is difficult to say how much.

According to the view of information theory, data = information + noise. The formula of this theory in the text-based serialization format is: JSON = type bit + information bit . The information bit is the effective amount of information contained in json, and the type bit is all the remaining noise, including double quotes, commas, square brackets, curly brackets, etc.

The noise in json is compressible, and some optimizations can be seen at a glance: if you remove the double quotes of the "key" in the key-value pair, replace true and false with the letters t and f. Will produce ambiguity.

There is a more high-end gameplay: replace all the parentheses in json with reverse Polish expressions . It is also an effective way to reduce volume.

In addition, the real number type is stored in the form of decimal characters, which not only causes a lot of noise, but also increases the time for conversion.

If you look carefully, you can find many redundant data in json, which can be compressed continuously, but where is the limit of this compression? There must be a limit, json cannot be compressed infinitely.

02

Is there a limit to data compression?

An old man named González in Spain designed a json compression algorithm, which is also text-based. It is said that it can compress deeply nested json to 55%. For example, there is a json like this:

{
    "type": "world",
    "name": "earth",
    "children": [
        {
            "type": "continent",
            "name": "America",
            "children": [
                {
                    "type": "country",
                    "name": "Chile",
                    "children": [
                        {
                            "type": "commune",
                            "name": "Antofagasta"
                        }
                    ]
                }
            ]
        },
        {
            "type": "continent",
            "name": "Europe"
        }
    ]
}

This string is obtained through my "compression algorithm":

type|world|name|earth|children|continent|America|country|Chile|commune|Antofagasta|Europe^^^$0|1|2|3|4|@$0|5|2|6|4|@$0|7|2|8|4|@$0|9|2|A]]]]]|$0|5|2|B]]]

And the compression rate is amazing, even surpassing MessagePack, which will be discussed later, but after careful study, it is found that it actually uses people's habit of using json to compress . For example, people often use TypedArray (with type list), like json-schema In the same way, the attributes of the objects in the list are restricted. Brother González collects and uses the same key names that often appear, such as name, id, and children, which reduces the coupling and reduces the volume.

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/TypedArray

Introduction to TypedArray

But this kind of compression algorithm using habit is not lasting after all, because habit will change, so it is not recommended. In certain situations, or in a fixed team, this algorithm is still useful.

But it cannot be said that using this format today, thinking of a better compression algorithm tomorrow, and changing to a new format, how much better this "better" is, how to quantify it, and what the upper limit is, these are the concerns of information theory. In order to find this limit, it is necessary to study the "fundamental" format of information: a binary format; not a text format based on a binary format. Improvements and optimizations in text format can never reach the limit of information compression . Therefore, the core contradiction in the evolution of serialization formats is: text format vs. digital format.

In short, the text format is a "lazy" transition format in the development of the times . Because of the simple and extensible characteristics of the text format, many early serialization formats, http, and even ip protocol families still retain the text format. Especially the http/1.X protocol, which occupies more than 90% of Internet traffic. Since the header and body (json) of http/1.1 are all in text format, http has also encountered a performance bottleneck: in terms of time, http text compilation and analysis A waste of time; in terms of space, a large number of fields in the http header waste space.

03


Information Theory and Compression Technology

But starting from http/2.0 (hereinafter referred to as h2), this unhealthy atmosphere has begun to change. For head and body, h2 uses different compression algorithms to improve efficiency. For static headers, h2 uses fixed-length encoding to compress, that is, assign a fixed number to each commonly used header such as content-type: text/html. For the dynamic head, you can customize the head and value values, and h2 uses the ASCII variable-length encoding: Huffman encoding.

So far, the head part of http has been fully digitized, and the body part is user-defined, and still retains the text format of json, but w3c appeals to everyone to use json alternatives, but there is no direct statement about which alternatives, let us Feel free to choose the binary serialization format.

Note: The process of changing the text format into a binary format is called "digitization" because the binary format is more like a "digital format".

04


Optimal binary tree

Since w3c recommends us to use Huffman-encoded binary serialization format, it is necessary to understand Huffman's data structure: the optimal binary tree.

The most direct way to design a set of codes is fixed-length encoding, that is, the length of each type/character is fixed, such as ASCII encoding fixed-length 8bit. As shown in the figure, designing 5 kinds of characters with fixed-length encoding requires at least 3bit encoding length. In the figure, each leaf of a full binary tree with a depth of 3 is a character, but the remaining 3 leaves are wasted because they are not used.

At this time, the length of the most frequently used character can be compressed from 3 to 1. The meaning of this is to sacrifice the number of meaningless codes to save the length of meaningful codes. An optimal binary tree is generated after assigning a coding length of 1 bit to the character with the highest frequency of use.

05


Huffman coding

This "optimal binary tree coding" is actually a synonym for "Huffman coding". Huffman coding is a variable-length coding, that is , the length of each coding object is different, but any arrangement and combination will not cause ambiguity . But the cost of switching from fixed-length encoding to variable-length encoding is: increased volume (loss of total depth of leaves). Therefore, when the frequency of use of all objects is constant (or the frequency is unpredictable), it is more efficient to use fixed-length coding, which means " a certain perimeter and the largest square area ".

Of course, the generation algorithm of the Huffman tree itself is based on the frequency of use of different objects, from the leaf to the root of the tree, so that the binary tree is optimal, and the details of the algorithm are omitted.

<To be continued>


Preview of the next episode: "Looking for the Limits of Serialization"

Guess you like

Origin blog.csdn.net/github_38885296/article/details/104853355