Protobuf- data encoding rules

Reference documents: https://developers.google.cn/protocol-buffers/docs/encoding

文章是本人对官方文档的理解,可能理解有误,望指正。^^

1.A Simple Message Simple Message Format

The simplest protobuf message definition:

message Test1 {
  optional int32 a = 1;
}

If a 150 assignment, its byte stream (hexadecimal) as follows:

08 96 01

Converted to a binary representation as follows:

    0    8    9    6    0    1
→  0000 1000 1001 0110 0000 0001
Flag Field Number Field Type Flag Low field values Flag High field value
0 0001 000 1 0010110 0 0000001

protobuf are based 8bit (1byte) as a parsing unit.

Flag 0: indicates the end of parsing unit, the byte is a new analytical unit, 1: represents a parsing unit is not completed, after a part of this is the high byte of the parsing unit.

Field number: Field Number protobuf message body, as shown above, it is converted to decimal 1

Field Type: protobuf field type of the message body, is converted to decimal 0

Field type table

Type Value Type Name scenes to be used
0 Varint int32, int64, uint32, uint64, sint32, sint64, bool, enum
1 64-bit fixed64, sfixed64, double
2 Length-delimited string, bytes, embedded messages, packed repeated fields
3 Start group groups (deprecated)
4 End group groups (deprecated)
5 32-bit fixed32, sfixed32, float

We further resolve them, reject flag, exchange high and low bit-field values

Field Number Field Type High field value Low field values
0001 000 0000001 0010110
1 0 The combined low and high removal left 0 10010110
1 0 Decimal calculation 150

This gave the type 1 field number value = int32 150

2.Base 128 Varints

varints integer type encoding rules, the size is an integer multiple of 1byte

If 1byte of Varints, can represent a positive integer of 0-127, it will be strange here, 1byte not 8bit, can actually represents a positive integer 0-255, why less than half.

Here we should mention speaking in front of the flag, Varints each byte has two types of parts, high-order flag is 1bit, used to represent the remaining available 7bit integer. If the high-order 1bit is 1, the next byte is also considered part of Varints until the next byte is 0 1bit upper end Varints parsing unit.

Note: If the varints 1byte is greater than, the height position change needs to be done, because the front lower portion represents an integer, back, showing the high integer, in general, when calculating the values ​​necessary to remove the flag, a set of reverse 7bit

For example, 300 this value, we look at it in binary

 1010 1100 0000 0010
→ 010 1100  000 0010
→  000 0010 ++ 010 1100 (高低位倒转)
→  100101100
→  256 + 32 + 8 + 4 = 300

3.More Value Types other data types

3.1 signed integer

The previous section, we mentioned type field = 0, denotes the encoding varints. But sint32, sint64 than int32, int64 represent negative numbers above, but also save space.
Take int32 For example, it represents a negative number, it normally takes 5byte, if sint32, which uses ZigZag coding, may be less than 5byte, note that just can not absolute, specific values, such as large value will reach 5byte

Original value Encoded value
0 0
-1 1
1 2
-2 3
2147483647 4294967294
-2147483648 4294967295

As can be seen from the table, the number of small positive and negative values, binary encoding, 0,32 are the high bit integer data may be removed after high, transmission, decoding, at the high bit of 0, 32 gather . Reducing the amount of transmission data.

ZigZag codec formula

The reason why this formula can be coded because:
the original structure made of two values, the most significant bit 0 is a positive number, the most significant bit is a negative
secondary structure after coding system, the least significant bit positive number is 1, the lowest bit is 0 negative

With this feature, we can know the current value of positive or negative, while using a different codec.

n表示数值
正数编码:n<<1
正数解码:n>>>1

负数编码:(n<<1)^~(n&0)
负数解码:(n>>>1)^~(n&0)

For example, the calculation according to Equation 2 and -2 encoding such

Transmission is not employed ZigZag

2
→ 0000 0000 0000 0000 0000 0000 0000 0010
→ 10(压缩值)
→ 0000 0010(protobuf传输内容,注意每8bit的最高位是标志位,不是数值部分)

Here transmission 2 needs 1byte

Transmission is not employed ZigZag

-2
→ 1111 1111 1111 1111 1111 1111 1111 1110 (-2是2取反补码得到的,这是负数在二进制中的表示规则)
→ 1111 1111 1111 1111 1111 1111 1111 1110(压缩值,可以看出无法压缩,高位都是1)
→ 1000 1110 1111 1111 1111 1111 1111 1111 0111 1111  (protobuf传输内容,注意每8bit的最高位是标志位,不是数值部分,注意超过1byte,需要每个字节进行高低位倒转)

Here transmission -2 need 5byte

The transmission uses ZigZag

2
→ 0000 0000 0000 0000 0000 0000 0000 0010
→ 0000 0000 0000 0000 0000 0000 0000 0100(编码)
→ 100(压缩值)
→ 0000 0100(protobuf传输内容,注意每8bit的最高位是标志位,不是数值部分)

Here transmission 2 needs 1byte

The transmission uses ZigZag

-2
→ 1111 1111 1111 1111 1111 1111 1111 1110 (-2是2取反补码得到的,这是负数在二进制中的表示规则)
→ 0000 0000 0000 0000 0000 0000 0000 0011(编码)
→ 11(压缩值)
→ 0000 0011 (protobuf传输内容,注意每8bit的最高位是标志位,不是数值部分)

Here required transmission -2 1byte

From comparison of the above two can be seen in negative transmission ZigZag coding, space can be saved. With higher transmission efficiency. However, if the large number of positive and negative values, the compression space is very small.

3.2Non-varint Numbers non varint NUMERICAL

fixed64, sfixed64, double, fixed32, sfixed32, float are Non-varint Numbers,
can be seen from the literal, fixed64, sfixed64, double size 64bit (i.e. 8byte), fixed32, sfixed32, float 32bit size (i.e. 4byte ),
but may exceed the actual transfer process, because of the flag is present.

这块的部分没有找到详细的资料说明,只是说编码采用标志位+高低位逆序编码(就是varint编码规则),没太懂!!!!!!!!!!

Character Strings

All strings with a length defined by the length-delimited format encoding, encoding, there is one type of part is represented varint length.

message Test2 {
  optional string b = 2;
}

b="testing"

Specific coding:

Field number & Field Type Field Length Field Values
12 07 74 65 73 74 69 6e 67

0x12 → field number = 2, type = 2

0x07 → 7 bytes

Embedded Messages nest messages

As we say nested message structure, the c field is the Test3 Test1 type, as we have a set of 150

message Test1 {
  optional int32 a = 1;
}
message Test3 {
  optional Test1 c = 3;
}

The following is the actual coding, we come under the resolution,
the previously mentioned Test1.a = 150, its code is 089601, you may find the following coding the second half exactly the same.

 1a 03 08 96 01

The remaining 1a 03 is Test3.c coding, we resolved under

1a 03
→ 0001 1010 0000 0011
→ 0(标志位)0011(字段编号)010(字段类型)0(标志位)0000011(字段长度)
→ 3(字段编号)2(字段类型)3(字段长度)

Control field type table, we can confirm that the type of message using nested longitudinal length-delimited defined type, according to the length field value is taken, parses it, and then according to the sub-message format, and decoding.

Compression 3.3Packed Repeated Fields list of fields

In proto2 release, repeated defaults [packed = false], no compression.
In proto3 version, for value type (refer to field type is 0,1,5) defaults [packed = true], in grpc packet, which is (+ column number field type 1+ + + Elemental element number of elements 2 ....) structure, as shown below:

message Test4 {
  repeated int32 d = 4 [packed=true];
}
22        // key (field number 4, wire type 2)
06        // payload size (6 bytes)
03        // first element (varint 3)
8E 02     // second element (varint 270)
9E A7 05  // third element (varint 86942)

Length-delimited rest of the type, it can use this compression, their organizational structure grpc packet like this (field number field type + + string length limit 1+ 1+ + field number field type length defining + 2 + 2 ..... string value)
can be seen from the message, and field number field types are repeating elements need to occupy a certain byte.

注意:这边有一个官方对于protobuf解码器的要求,比如你传递的grpc的报文,列表类型数据,采用packed=false,不进行压缩,但是接收者的protobuf定义的又是配置了packed=true,这时候,解码器需要兼容这种情况,对报文做正确解析。

Here do next extra packed = false description of the encoding grpc message features:

  1. All the array elements may not be continuous, may be interspersed among the other fields of the packet
    2. The order of all array elements is maintained, after decoding, the display order of the array elements and consistent before encoding.
  2. Array elements, the multi-format of the key-value, transmitted in the network.

Guess you like

Origin www.cnblogs.com/lowezheng/p/11778052.html