Information Theory III: Looking for the Limits of Serialization

List of all chapters in this series:


  1. "Noise" and "Signal to Noise Ratio" of JSON

  2. The theoretical upper limit of the amount of noise, inverse Polish expression

  3. Information Theory and Compression Technology: String vs. Byte String

  4. Best binary tree, FPS/2.0

  5. Huffman coding

  6. Message Pack

  7. Huffman tree for Message Pack

  8. Prefix VS separator

  9. Message Pack defect, host environment bug

  10. The limit of serialization, two basic axioms

  11. UTF-8 extreme compression

  12. Rational number: variable length type shift

  13. Dictionary compression method

  14. Tail mutilation

  15. Ultra Pack and the principle of space-time replacement

  16. V8 engine metaphysics (What's happening under the hood)


From the third chapter of the [Strange Knowledge] series, it is the continuation of Chapters 1~5 of "Optimal Binary Trees and Huffman Coding" above . This article starts from Chapter 6.

La la la.


06


Message Pack

Message Pack, hereinafter referred to as msp or msgPack, is such a popular binary serialization format based on Huffman encoding and compatible with json.

Msp is compatible with json because msp supports all data types of json (4 basic types and 2 composite types). In addition, msp also has its own types, including pure binary format (also called byte string), datetime format, Custom reserved type, variable length basic type.

The reason why msp is based on Huffman is that each data type in msp is an encoding object.

The basic types of variable length include variable length real numbers, variable length character strings, and variable length byte strings. The variable-length type means that you can use the shortest possible space to store smaller "smaller" data, for example, a positive integer within 127 only occupies 1 byte.

msp also supports "seamless streaming". Imagine that it is very troublesome to stream json. There are many "streaming" solutions. The most commonly used ndjson also consumes newline characters to separate each json. But because msp uses a prefix to limit the length, no separator/terminator is needed, and the two msp objects before and after can be seamlessly connected.

for example.

In the demo in the picture, the 29-byte json object becomes 20 bytes after being compressed by msp. The highlighted bytes/characters in the figure represent the effective amount of information, the remaining gray part represents noise, and the amount of information/noise = signal-to-noise ratio. Obviously, the signal-to-noise ratio of msp is higher and the volume is smaller. Of course, this calculation of the signal-to-noise ratio is not rigorous, and the actual situation should also consider factors such as the type usage probability, but an example is sufficient.

For a comprehensive comparison between msp and json, you can refer to the article "MessagePack: Most likely to replace the existence of JSON" . The conclusion of the article is: msp is theoretically smaller, faster and richer than json .

07


Huffman tree for Message Pack

After boasting the msp, let's take a look at the msp specification.

Put all the data types supported by msp into a tree according to the prefix code to get the Huffman tree in the above figure. Because the tree is too large, I will "110 prefix node" as the demarcation point, and divide the msp Huffman tree into "110 Two parts: Before 110 and after 110: Before 110 are common types with a length of 1 to 4 bits, and after 110 are relatively "unpopular" types with a length of 8 bits.

08


Prefix VS separator

The test data in the figure is performed under the python platform. The reason why the python platform was chosen instead of the JS platform will be explained at the end of the article ε=ε=ε=┏(゜ロ゜;)┛.

It can be seen that under python3 with the same amount of information json and msp, the volume of msp is reduced by 16.2%, the decoding speed is greatly improved, only the encoding takes longer, and overall the performance of msp is better than json.

But why does msp encoding take longer? My personal guess is that json is a format that uses separators to divide elements, and msp is a format that uses prefixes to divide elements. The advantage of the prefix is ​​that it can speed up the decoding speed, because the prefix implies the length of the next element, so that the decoder can decode it "jumping" , unlike json that needs to scan character by character, and stops when it encounters a separator or a rest.

However, encoding and decoding are a pair of inverse processes. As the decoding speed increases, the encoding speed will naturally decrease. This is a natural law that cannot be violated. For the delimiter type serialization format, the encoding process is a one-stop tiling process without any pause , but the prefix type serialization needs to calculate the length of the element after each element is written, and then insert the length into the element At the beginning, naturally it takes more time.

This is why msp is slower than json in encoding speed.

09


Defects of MsgPack

Although the "signal to noise ratio" of msp is not known, the naked eye can see that msp also has some defects. For example, the Huffman tree of msp needs to be optimized. Remember the tree "after 110" before. The prefix length of the 32 data types on that tree is completely symmetrical. Common sense tells us that the more tidy things are, the lower the performance, the more the Huffman tree. "Tidy" means that the variable length coding has not been well designed. After all, it is impossible for every type of use frequency to be the same.

Msp has too many reserved types, it actually has 9 reserved types, including extended types and "never used" types. Too many reserved types are meaningless, after all, the frequency of reserved types is also very low.

The ecology of msp is not perfect. Although there are open source codecs in dozens of languages, it is difficult for msp to be officially recognized without standard library support.

In short, msp can be further compressed, where is the limit of compression? No one knows.

10


The limits of serialization

From the initial text format to the later serialization format, we have been looking for the limit of serialization. Where is this limit? We can't look for it blindly. It seems that we need to define this limit. So I specified two principles, as the basic axioms of serialization limits, please comment and see if they are reasonable:

  1. Principle 1: Any byte string is meaningful

  2. Principle 2: Different byte strings have different meanings

What do these two sentences mean? For principle 1, if you are given a keyboard with only 0 and 1, let you type it, and send the byte string after your meal to a decoder for decoding. If the decoding is always successful, it means that the encoding format complies with Principle one, if it is possible to report an error, it violates principle one.

Obviously, whether it is json, msp, or even utf-8, it violates principle one, while ASCII obeys principle one, because 256 types of characters represented by one byte exist. In fact, most variable length encoding formats violate Principle 1.

For principle 2, it means that n kinds of byte strings have n kinds of meanings. If two different byte strings express the same meaning, from the perspective of information theory, this is a waste.

Both of these principles ensure that the data volume is compressed to the limit without considering the speed of encoding and decoding. Since the subject of this article only cares about space and does not consider time, the problem of time complexity is not studied in this series.

In short, as long as a serialization format (encoding format) satisfies principles one and two, we call it the (spatial) limit of serialization.

Strange knowledge is getting more and more

11


UTF-8 extreme compression

In order to reach the compression limit of serialization, we analyze each data type one by one, starting with the simplest string.

Uft8 is a familiar character encoding, and it is a variable length encoding. The Huffman table of utf8 is as shown in the figure above. At present, the length of utf8 characters ranges from 1 to 4 bytes. Each character has a different prefix, but there are two special ones. The prefixes are:

  1. Subsequent byte prefix (10)

  2. Reserved type prefix (11111)

Subsequent byte prefix 10 is a byte prefix other than the first byte, single-byte characters do not. Although the existence of the 10 prefix has the advantages of character verification and reverse indexing, from the perspective of information theory, this is redundant noise, and there is no need for it. For specific reasons, please refer to this article: "Is this UTF-8 character encoding? Design flaws?

The reserved type prefix 11111 is reserved for new characters that may appear in the future, and they are mainly characters longer than 4 bytes.

Either 10 or 11111 violates principle 1, because the presence of these prefixes in inappropriate positions directly causes utf8 parsing to fail.

The reason why these two prefixes are special is that they exist on the Huffman tree of utf8 but cannot represent specific encoding objects, as shown in the following figure:

The two prefixes marked in red in the figure are the two prefixes that violate principle 1. What if these two leaves are removed from the tree? After being removed, it becomes minUTF8, which is the second simpler Huffman tree in the figure. minUTF8 is a compressed version of UTF8, removing useless prefixes and greatly reducing storage costs.

I chose the name minUTF-8 casually, it only represents a possible encoding scheme, and may not be used in practice.

<To be continued>

The next preview: "Host, Time and Space Replacement, V8 Metaphysics"


Guess you like

Origin blog.csdn.net/github_38885296/article/details/104853356