Original inversion precision algorithm: the ultimate encoding of decimals

In the last issue, I took you early to try the "how fast, save the" Zipack format: "more" means more functions; "fast" means faster resolution; " saving " means small size. But what users are most curious about must be the underlying principles of Zipack, after all, it "arrogantly" claims to have better encoding than UTF8 and IEEE floating point numbers. This issue introduces in detail how the bottom layer of Zipack replaces the classic IEEE floating point numbers with the original decimal encoding "reverse precision algorithm".

At present, the mainstream decimal code is naturally IEEE floating-point numbers. The advantages and disadvantages of IEEE floating-point numbers have been discussed in the previous issue of "Design Defects of IEEE Floating-Point Numbers " . Here is a summary of the conclusions of that issue:

IEEE floating point is a classic fixed-length floating-point number encoding, compatible with integers, and there are many excellent ideas, such as eliminating the redundancy of effective digits through the "implicit complement one" method, and letting the decimal point float from the left end, reducing the number of exponents Absolute value. If you don’t understand what I’m talking about, refer to the previous article .

But the reason why IEEE floating-point numbers are eliminated is that it violates the "one-to-one mapping" principle of information theory . For example, it distinguishes +0 and -0, and there are many code segments that represent NaN. These flaws are intolerable in Zipack: each type of Zipack is one-to-one mapping. In other words, writing a random binary bit stream can parse out a valid Zipack object.

"Fine inverse algorithm". . Smell inside

Huh? Why is there a smell of "schizophrenia, anti-social personality" in it, let's forget it.

So how does the inversion precision algorithm (referred to as the precision inversion algorithm or the precision inversion encoding) play? Here is another background knowledge: VLQ offset natural number [how, Zipack is complicated]. The principle will not be repeated, as long as it is known that it is a "one-to-one mapping" natural number code, and it is variable length and unlimited. Using VLQ to offset natural numbers can represent the binary form of any natural number (0, 1, 2....). The idea of ​​"refined inverse algorithm" is to express a decimal by two natural numbers: one for the integer part and one for the decimal part.

There are 5 types of real numbers in Zipack's "Number Family", which are small natural numbers, positive integers, negative integers, positive decimals, and negative decimals. The five types are complementary, which means that there is no overlap between them. In theory, they can represent all the numbers on the real number axis, as long as unlimited bytes are allowed. Among them, the positive and negative decimals are related to the precision inversion algorithm. Since the positive and negative decimals are completely symmetrical, we only need to consider the case of unsigned positive decimals.

Use your imagination to represent each unsigned decimal in the form of a string, so that it can be divided into two parts by the decimal point: the integer part and the decimal part. The left part is just a natural number, which can be represented by a VLQ; for the right part, it can also be uniquely represented by a natural number, but some skills are required: first, determine according to the knowledge points of elementary school mathematics [the last decimal place is meaningless] The last digit must be a "1", and according to the knowledge points of kindergarten mathematics [the highest digit of a positive integer must be 1], we successfully mapped the reverse order of the decimal part to the positive integer one by one. (The above are all binary cases)

As for the VLQ offset natural number starts from 0, we only need to shift it by 1 unit to start from 1. The following is an example to describe how to use the inverse algorithm to encode the binary number 110.0101.

  1. trim: remove meaningless "0" at both ends

  2. split: split 110.0101 into 110 and 0101 parts

  3. encode: encode the 110 on the left into a VLQ natural number, denoted as A

  4. reverse: Reverse 0101 on the right to 1010

  5. offset:1010 - 1 = 1001

  6. encode: encode 1001 into a VLQ natural number, denoted as B

  7. concat: seamlessly splice A and B, and output AB

This is the exquisite and inverse algorithm, seven steps, simple and easy to understand. The reason why it is better than IEEE floating-point number is because the precise inverse algorithm achieves one-to-one mapping, there is no ambiguity, no redundancy, and no upper limit (determined by the nature of VLQ).

In the above example, after we get AB, we need to add a prefix to synthesize a Zipack object: the prefix of positive decimals is 0xF2, and the prefix of negative decimals is 0xF3. These prefixes are usually one byte and are used to indicate the type of objects that follow. We can go to Zipack official website to experience the compression efficiency of a fine inverse algorithm:

In the example shown in the figure, the decimal "-0.125" is first converted to binary "-0.001", and then serialized into Zipack's negative decimal type: [F3 00 03]. Among them, F3 means negative decimal, 00 means integer part, and 03 means decimal part.

The refined inverse algorithm is just one of the cores of Zipack. All the core ideas and design concepts are recorded in Zipack's specification documents, making Zipack's performance a big drop in JSON. At present, Zipack is still in the promotion stage, and there is an urgent need for talents like you. Gitee warehouse: https://gitee.com/zipack/spec

Guess you like

Origin blog.csdn.net/github_38885296/article/details/107041513