Deep Git underlying format: VLQ offset natural number

The previous article revealed the design flaws of IEEE floating-point numbers from the perspective of information theory, with the purpose of proposing a set of coding schemes that can replace IEEE floating-point numbers: precision inversion algorithm. But first we must understand the basis of the algorithm: VLQ coding.

Base127 VLQ: Variable length physical quantity

VLQ refers to variable length quantity, that is, the quantity of variable length. This quantity can be the quantity of any information. I have to say that big companies are very particular about naming, and generally like to bypass the use of the noun itself and quote more abstract meanings, such as PWA: progressive web application, progressive web application, which looks very tall, but is actually a set that can be installed locally The api of the web application.

The same is true for VLQ. Originally, VLQ was only used to encode integers. It was hoped that integers with a smaller absolute value would occupy a smaller space . Later, it was discovered that any physical quantity can be encoded by VLQ, hence the name.

This naming convention is said to have scientific basis. The so-called "industry barriers" refer to the knowledge that the insider will always unconsciously raise the learning threshold of the outsider. Although the individual is an unconscious self-interested behavior, the result of the group behavior is Increase of industry barriers. Nouns that are originally very simple will be "mysterized".

VLQ is a variable-length encoding based on the 7-bit group. It is very simple in itself. It uses the first bit of each byte to indicate whether there are subsequent bytes .

  • 0XXXXXXX

  • 1XXXXXXX  0XXXXXXX

  • 1XXXXXXX  1XXXXXXX  0XXXXXXX

  • ......

The advantage of this is that there is no need to write the length of the object in the prefix, and the "0" at the end of each byte represents a "rest". This is another important concept of information theory.

For example, the schematic diagram of converting a decimal natural number 106,903 into a VLQ byte string is as above, 106903 = 6*2^14 + 67*2^7 + 23, which is simple and clear.


Two modes of scanning termination signal: prefix VS rest

When the scanner (decoder) scans from left to right on a piece of serialized data, when scanning to a certain "sub-element/object/character/value", when it ends is a key point, usually there are two ways To hint when to stop.

  • Prefix : store the length of the child element in the prefix.

  • Rest : A "rest" at the end is used to prompt the scanner. It can be a termination character or a termination byte.

The former way of writing the length in the prefix is ​​very common in binary protocol formats, such as many IP sub-protocols and binary serialization formats; the latter way of terminating by "rests" is common in massive text formats and In the ancient text-based communication protocol, even the decoding of DNA used "terminators" to separate peptide chains.

The advantage of rests relative to prefixes is flexibility, so you don’t have to worry about the upper limit of length, such as the EOF termination character of a string: as long as the scanner does not touch EOF, it will continue scanning. Obviously, our VLQ belongs to the "rest type".

VLQ offset natural number (redundancy cancellation and cancellation)

But it is not enough. The two basic principles of coding mentioned in the previous article: "no ambiguity" and "no redundancy". If VLQ is used to represent a natural number, there will be such a situation: the number that can be represented by 1 byte (0~127) can be represented by 2 bytes (0~16383). The n-byte VLQ is compatible with n-1 byte VLQ. We reject such compatibility. If a single byte VLQ represents a natural number from 0 to 127, the double byte counts from 128 at the beginning.

The actual value of a multi-byte VLQ natural number is equal to its face value plus an offset value, which is equal to the maximum number of bytes in the previous level plus one, which is the minimum value of this level.

The reason for the offset is that in the natural state, different real number lengths share a part of the real number space. For example, a 3-byte real number contains the entire space of 2 bytes. For example, 00 00 01 and 00 01 are both 1.

Therefore, the actual value of the real number of each length must be added to the sum of all the spaces of the shorter length. For example, 00 01 represents 1, and 00 00 01 represents 257 (255+2).

The VLQ integers of different numbers of bytes and the corresponding actual values ​​have the following relationship:

Number of bytes

Integer space

min

max

1 2^7 0 -1+2^7
2 2^14 2^7 -1+2^7+2^14
3 2^21 2^7+2^14 -1+2^7+2^14+2^21
n 2 ^ 7n 2^7+2^14+...+2^7(n-1)

-1 + 2 ^ 7 + 2 ^ 14 + ... + 2 ^ 7n


Among them, each min is equal to max+1 of the previous line.

min represents the meaning of "all 0s" in several 7-bit groups in this integer space, and max represents the meaning of "all 1s" in this integer space.

The mapping relationship between the binary denomination and actual value of VLQ:

VLQ
Natural number
0000 0000

0‍

......
0111 1111
127
1000 0000  0000 0000
128
......
1111 1111  0111 1111

16511

1000 0000  1000 0000  0000 0000 16512
......

There is a one-to-one mapping (bijective) , even if a string of bytes is taken randomly, it can be parsed into a unique natural number, which not only realizes the lengthening in terms of space efficiency, but also does not waste any space. This is the basis of the "precision inversion algorithm": VLQ offset natural number, referred to as VLQ natural number. The mutual conversion function between VLQ and natural numbers also came into being.

Note that the VLQ offset natural number is not my original creation (I thought it was my original creation, but after thinking about it, I was not so smart, I can think of others). After searching, I found that Git has already implemented this algorithm. He also gave him a special name: bijective numeration means one-to-one mapping.

Code realization of bijective VLQ

  const r7 = 2 ** 7;
  const r14 = 2 ** 14;
  const r21 = 2 ** 21;
  const r28 = 2 ** 28;
  const r35 = 2 ** 35;
  const r42 = 2 ** 42;
  
  const R7 = r7;
  const R14 = r14 + r7;
  const R21 = r21 + r14 + r7;
  const R28 = r28 + r21 + r14 + r7;
  const R35 = r35 + r28 + r21 + r14 + r7;
  const R42 = r42 + r35 + r28 + r21 + r14 + r7;


  const r = [1, r7, r14, r21, r28, r35, r42];
  const R = [0, R7, R14, R21, R28, R35, R42];

Some constants are pre-calculated above to exchange space for time for use below.

The function of converting natural numbers into VLQ byte string:

function nature2vlq(number) {
    let tobeUint8Array;


    R.find((RR, index) => {
      if (number < RR) {
        const faceValue = number - R[index - 1];
        tobeUint8Array = Array.from({ length: index })
          .map((x, i) => (faceValue / r[i]) % r7 | (i ? r7 : 0))
          .reverse();
        return true;
      } else return false;
    });


    return new Uint8Array(tobeUint8Array);
}

Function to convert VLQ byte string into natural number:

function vlq2nature(vlq) {
    return (
      vlq
        .map((x) => x & (r7 - 1))
        .reduce((sum, next, i) => {
          sum += next * r[vlq.length - i - 1];
          return sum;
        }, 0) + R[vlq.length - 1]
    );
}

With VLQ offset natural numbers, variable-length integer coding is a matter of course, as long as it combines any traditional integer coding such as 2's complement or zigzag, and then maps it to VLQ as a natural number.


Picture: "Rick & Morty" Season 4

Guess you like

Origin blog.csdn.net/github_38885296/article/details/106394266