Talk about the original intention of Zipack format

This issue mainly talks about my original intention and plan for designing the Zipack format. Many places in the article directly quote the Zipack design draft written at the beginning of the year.


What is the serialization format

The serialization format is a linear arrangement of binary data for storage and transmission. The serialization format is used to exchange common data formats on different platforms. For example, JSON is a popular serialization format.

 

Performance evaluation standard of serialization format

The performance of a serialization format can be evaluated from the perspective of time and space.

  • Time: the speed of encoding and decoding in serialized format.

  • Space: The volume of the serialized format under the same amount of information.

 

Why develop a serialization format

Designing private communication protocols/data formats is very common in large enterprises. For example, Google uses private routing protocols to replace TCP/IP, and domestic BAT uses private RPC communication protocols to replace HTTP.

The Internet industry can be roughly classified into two categories: platform-oriented and content-oriented. As the volume of any platform increases, the internals inevitably tend to become bloated. For this reason, it is particularly important to consider new protocols or formats from the bottom.

Zipack intends to learn from the messagepack design ideas (not based on), refer to the prefix format of the TCP/IP routing protocol, and design a set of formats suitable for enterprises and Internet platforms.

Evaluating the quality of a product is usually evaluated from the perspective of function, performance, and security, and the serialization format is no exception.

  • Function: Compared with the popular JSON format, Zipack can support new data types, such as bytearray, timestamp, extension (extension type)...

  • Performance: Zipack will use a variety of data compression algorithms. In terms of space, it is expected to save 20% to 40% of the volume compared to JSON; in terms of time, compared to text format JSON, binary format Zipack can encode and decode faster

  • Security: The internal communication of the platform uses a private format, which increases the difficulty of hacking to a certain extent (such as crawlers)

Possible limitations of Zipack

  • The speed of Zipack on the v8 engine (JavaScript engine) may be slower than the native JSON (the reason is that the speed close to the hardware cannot be obtained on the v8 engine)

  • Need to introduce Zipack dependency when using

  • Visual read and write Zipack plug-ins need to be developed separately, such as browser debugging and IDE editing of Zipack files

 

Application scenarios

  • Memory cache

  • letter of agreement

  • Configuration file

  • ......

 

Huffman prefix encoding (optimal binary tree)

Zipack will absorb various excellent coding ideas in the world, including JSON, message pack, protoBuf, etc., and combine them into a set of original compression formats.

In general, Zipack uses Huffman to encode all data types. Each type is assigned a prefix. The prefix length ranges from 1 to 8. See the Zipack tree diagram at the bottom for details.

Due to hardware limitations, whether it is the type prefix , length segment, or content payload , they are all integral multiples of bytes. The prefix and payload of less than 1 byte together form a whole byte.

Two principles of information coding: no ambiguity, no redundancy

Information theory requires a one-to-one correspondence between encoded values ​​(serialized binary values) and actual meanings in order to compress the information to a minimum. There are two types of situations that break the one-to-one correspondence:

  • Ambiguity: The same code has multiple different meanings

  • Redundancy: multiple codes correspond to the same meaning

Each data type of Zipack avoids these two situations.

Two modes of scanning termination signal: prefix VS rest

When the scanner (decoder) scans from left to right on a piece of serialized data, when scanning to a certain "sub-element/object/character/value", when it ends is a key point, usually there are two ways To hint when to stop.

  • Prefix: store the length of the child element in the prefix.

  • Rest: A "rest" at the end is used to prompt the scanner. It can be a termination character or a termination byte.

The former way of writing the length in the prefix is ​​very common in binary protocol formats, such as many IP sub-protocols and binary serialization formats; the latter way of terminating by "rests" is common in massive text formats and In the ancient text-based communication protocol, even the decoding of DNA used "terminators" to separate peptide chains.

Zipack will use a combination of two termination modes. The VLQ to be discussed next belongs to the latter kind of "rest".

 

VLQ variable length coding (Variable-length quantity)

VLQ technology is a popular "variable length" encoding scheme. When a value is scanned (big-endian) from left to right in a serialized message, if the highest bit of each byte is 1, the value is still Subsequent bytes, until the byte with the highest bit of 0 is scanned.

VLQ saves space while retaining a certain degree of scalability. VLQ coding often appears in Zipack's design.

 

VLQ natural number [key point], VLQ natural shift

VLQ natural numbers refer to binary natural numbers stored on the basis of VLQ coding. The actual value of a multi-byte VLQ natural number is equal to its face value plus an offset value, which is equal to the maximum number of bytes in the previous level plus one, which is the minimum value of this level.

The reason for the offset is that in the natural state, different real number lengths share a part of the real number space. For example, a 3-byte real number contains the entire space of 2 bytes. For example, 00 00 01 and 00 01 are both 1.

Therefore, the actual value of the real number of each length must be added to the sum of all the spaces of the shorter length. For example, 00 01 represents 1, and 00 00 01 represents 257 (255+2).

The VLQ integers of different numbers of bytes and the corresponding actual values ​​have the following relationship:

Number of bytes

Integer space

min

max

1 2^7 0 -1+2^7
2 2^14 2^7 -1+2^7+2^14
3 2^21 2^7+2^14 -1+2^7+2^14+2^21
n 2 ^ 7n 2^7+2^14+...+2^7(n-1)

-1 + 2 ^ 7 + 2 ^ 14 + ... + 2 ^ 7n


Among them, each min is equal to max+1 of the previous line.

min represents the meaning of "all 0s" in several 7-bit groups in this integer space, and max represents the meaning of "all 1s" in this integer space.

" VLQ offset natural number " is the basis of Zipack. Regardless of the type of real number or the number of various objects, there are VLQ offset natural numbers, hereinafter referred to as VLQ natural numbers.

 

 

VLQ characters and strings

VLQ characters refer to Unicode characters mapped on the basis of VLQ natural numbers. Each VLQ natural number corresponds to a Unicode serial number.

The VLQ character string is a seamless splicing of several VLQ characters, and a VLQ natural number is required before the character string to indicate the number of characters. VLQ string is one of the basic types of Zipack.

Compatibility is the root of all evil. utf8 is a serious waste of space from the perspective of information theory. Zipack's character encoding adopts the Unicode-on-VLQ encoding scheme, which is completely decoupled from utf8, and stores the Unicode serial number (natural number) of each character as a VLQ integer , Stitched together to form a Zipack string.

 

 

VLQ length prefix

The VLQ length prefix refers to the VLQ natural number. The VLQ natural number prefix implies the length of a certain data type. The so-called length is divided into 4 cases:

  • Byte stream: In pure binary type (byte stream), VLQ natural number implies the number of bytes .

  • String: In the string type (character stream), the VLQ natural number implies the number of characters .

  • List: In the list type (array), VLQ natural number implies the number of elements in the list .

  • Dictionary: In the type of dictionary storing key-value pairs, VLQ natural numbers imply the number of key-value pairs .

  • Floating point number: In the floating point number type, VLQ natural number implies the size of the exponent bit . (Not necessarily used in Zipack)

Note that in the case of a short number, the length will be stored in the first byte in the form of a fixed-length natural number. The current width is 5bit (see the Zipack tree at the bottom).

 

Zipack data types

  • Small natural numbers

  • Positive integer (positive natural number)

  • Negative integer

  • Positive non-integer

  • Negative non-integer

  • Small list

  • List

  • Small dictionary

  • dictionary

  • Short string

  • String

  • Byte stream

  • True

  • False

  • null

  • (Retention type)

Byte stream type (pure binary type)

This is easier to understand, is to store the unformatted byte stream (byteArray). Types that cannot be easily stored in byte stream text format.

 

List (array)

A list is a nested type, and its format is a seamless splicing of several elements in sequence. Zipack streams are also spliced ​​like this.

 

Dictionary (key-value pair)

A dictionary is a nested type, and its format is a seamless splicing of several key-value pairs: [key, value, key, value...].

First, lock the type of the key as a VLQ string (length prefix is ​​required), thereby eliminating the type byte.

Originally, according to the theory of "unordered dictionary", string keys should be sorted by force, and increments should be used to replace actual values. However, because we use VLQ characters uniformly, the upper limit of the Unicode number of characters is uncertain (not limited to 65535), so it is impossible to perform The strings are sorted, so our dictionary is still "ordered dictionary".

 

 

About non-integer numbers (decimals)

Regarding non-integer encoding, Zipack uses an original " precision inversion algorithm " to replace IEEE floating point numbers.

In the history of computer development, storing natural numbers in binary is the most "natural" choice. Later, in order to store negative numbers, codes such as original codes, 2's complement codes, zigzag, etc. appeared, and then 3 codes appeared to store floating-point numbers. Mainstream IEEE standards:

  • Half precision half: 16bit

  • Single precision: 32bit

  • Double precision double: 64bit

However, the above three types are all floating-point number encodings. Floating-point number encoding is only one of non-integer encodings. There are 3 types of all types:

  • Floating-point number encoding: storage [exponent, effective part]

  • Fraction coding: storage [numerator, denominator]

  • Decimal point separation: storage [integer part, decimal part]

This is the three paradigms of encoding a non-integer with two natural numbers. Since positive and negative decimals are completely symmetrical (because 0 is not included), only another sign bit is needed to imply positive and negative.

It should also be pointed out that the IEEE 3 precisions are just 3 implementations of the floating-point number encoding paradigm, and they also include special types such as integer, NaN, and Infinity. Therefore, a lot of space is wasted . Zipack, which aims at compactness, naturally cannot be copied. , We want an original non-integer format.

After comprehensive consideration, Zipack plans to use decimal point-separated encoding, that is, the decimal is expressed as the natural value of the integer part and the decimal part.

 

 

Single-byte true, false, null

These three are relatively simple and are all single-byte constants. For details, please refer to Zipack's specifications: https://gitee.com/zipack/spec

 

Short strings, short lists, small dictionaries, weak-precision floating-point numbers

The short/little here refers to the smallness of the string length, the number of list elements, the number of dictionary key-value pairs, and the exponent (base 2) of the floating point number. For these 4 cases, VLQ natural numbers are not used to specify the length, and the 4~5 bits embedded in the first byte are directly used to represent the length of 0~31.

Compared with these 4 types, byte stream types are generally large (such as pictures), and their units are the smallest (byte), so short byte streams are not considered.

 

Retention type

In addition to the dozens of existing formats, there are several vacancies on the Zipack tree, which can be used as "customizable" types, such as those commonly used in company internal communication. The most common usage is to specify one by a number Object, avoid passing the entire object.

 

Special preferential real number type: small natural number (small non-negative integer 0~127)

In all real numbers, classified according to frequency of use, there are roughly the following three "trends" (the ">" symbol below compares the frequency of use):

  • Integer> floating point

  • Number with small absolute value> number with large absolute value

  • Positive number> negative number

Combining the three real numbers on the left of ">" together, the most frequently used type is born: the smaller positive integer and 0, that is, the small natural number. Of course, the status of small natural numbers is the highest and should receive the highest treatment, so the prefix of small natural numbers must be the shortest, that is, 1 bit:0. The overall length of small natural numbers is 1 byte, and the range that can be represented is an integer between 0 and 127.

 

 

About integer encoding (positive and negative number separation), positive offset and negative offset

In the coding scheme for signed integers, there are mainly three mainstream coding schemes, namely:

  • Original code: It is simple and intuitive to express the sign of an integer through the highest bit, but it causes redundancy of "+0" and "-0".

  • 2's complement: The most popular integer encoding. It is encoded by "shifting" a negative number to a positive number, which is easy to calculate.

  • zigzag: Starting from 0, the positive and negative numbers are alternately coded. The characteristic is that the smaller the absolute value of the integer, the "shorter" the code.

But in the serialization format, without considering how to be compatible with all integers, you can treat positive integers and negative integers as different data types, and treat them in parallel with other types without any difference.

Zipack treats positive integers and negative integers as two types, and both use VLQ binary natural numbers as their absolute values, but this absolute value has to add an offset value to get the actual value.

The actual value of the VLQ positive integer is equal to the VLQ value plus 128, because we need to reserve a small natural number for special treatment, and the maximum value of the small natural number is 127.

The actual value of a negative VLQ integer is equal to -1 minus the VLQ value, because negative integers start counting from -1.

 

 

Zipack style

The Zipack format naturally supports streaming. When the Zipack unit is too large, it can be split into multiple Zipack objects, and the objects are seamlessly connected, so as to achieve "transmission while parsing". Don't need a separator to separate different JSON objects like JSON, refer to ndJSON .

 

 

Zipack's optimal binary tree

 

Zipack reference

Official website: http://zipack.github.io/ (same as Gitee)

Encyclopedia: https://baike.baidu.com/item/zipack

Warehouse: https://github.com/zipack (same as Gitee)

Introduction: https://jimmy.blog.csdn.net/article/details/107031802

Author: https: //jimmy.blog.csdn.net/

Guess you like

Origin blog.csdn.net/github_38885296/article/details/107058907