Reference documents: https://developers.google.cn/protocol-buffers/docs/encoding
文章是本人对官方文档的理解,可能理解有误,望指正。^^
1.A Simple Message Simple Message Format
The simplest protobuf message definition:
message Test1 {
optional int32 a = 1;
}
If a 150 assignment, its byte stream (hexadecimal) as follows:
08 96 01
Converted to a binary representation as follows:
0 8 9 6 0 1
→ 0000 1000 1001 0110 0000 0001
Flag | Field Number | Field Type | Flag | Low field values | Flag | High field value |
---|---|---|---|---|---|---|
0 | 0001 | 000 | 1 | 0010110 | 0 | 0000001 |
protobuf are based 8bit (1byte) as a parsing unit.
Flag 0: indicates the end of parsing unit, the byte is a new analytical unit, 1: represents a parsing unit is not completed, after a part of this is the high byte of the parsing unit.
Field number: Field Number protobuf message body, as shown above, it is converted to decimal 1
Field Type: protobuf field type of the message body, is converted to decimal 0
Field type table
Type Value | Type Name | scenes to be used |
---|---|---|
0 | Varint | int32, int64, uint32, uint64, sint32, sint64, bool, enum |
1 | 64-bit | fixed64, sfixed64, double |
2 | Length-delimited | string, bytes, embedded messages, packed repeated fields |
3 | Start group | groups (deprecated) |
4 | End group | groups (deprecated) |
5 | 32-bit | fixed32, sfixed32, float |
We further resolve them, reject flag, exchange high and low bit-field values
Field Number | Field Type | High field value | Low field values |
---|---|---|---|
0001 | 000 | 0000001 | 0010110 |
1 | 0 | The combined low and high removal left 0 | 10010110 |
1 | 0 | Decimal calculation | 150 |
This gave the type 1 field number value = int32 150
2.Base 128 Varints
varints integer type encoding rules, the size is an integer multiple of 1byte
If 1byte of Varints, can represent a positive integer of 0-127, it will be strange here, 1byte not 8bit, can actually represents a positive integer 0-255, why less than half.
Here we should mention speaking in front of the flag, Varints each byte has two types of parts, high-order flag is 1bit, used to represent the remaining available 7bit integer. If the high-order 1bit is 1, the next byte is also considered part of Varints until the next byte is 0 1bit upper end Varints parsing unit.
Note: If the varints 1byte is greater than, the height position change needs to be done, because the front lower portion represents an integer, back, showing the high integer, in general, when calculating the values necessary to remove the flag, a set of reverse 7bit
For example, 300 this value, we look at it in binary
1010 1100 0000 0010
→ 010 1100 000 0010
→ 000 0010 ++ 010 1100 (高低位倒转)
→ 100101100
→ 256 + 32 + 8 + 4 = 300
3.More Value Types other data types
3.1 signed integer
The previous section, we mentioned type field = 0, denotes the encoding varints. But sint32, sint64 than int32, int64 represent negative numbers above, but also save space.
Take int32 For example, it represents a negative number, it normally takes 5byte, if sint32, which uses ZigZag coding, may be less than 5byte, note that just can not absolute, specific values, such as large value will reach 5byte
Original value | Encoded value |
---|---|
0 | 0 |
-1 | 1 |
1 | 2 |
-2 | 3 |
2147483647 | 4294967294 |
-2147483648 | 4294967295 |
As can be seen from the table, the number of small positive and negative values, binary encoding, 0,32 are the high bit integer data may be removed after high, transmission, decoding, at the high bit of 0, 32 gather . Reducing the amount of transmission data.
ZigZag codec formula
The reason why this formula can be coded because:
the original structure made of two values, the most significant bit 0 is a positive number, the most significant bit is a negative
secondary structure after coding system, the least significant bit positive number is 1, the lowest bit is 0 negative
With this feature, we can know the current value of positive or negative, while using a different codec.
n表示数值
正数编码:n<<1
正数解码:n>>>1
负数编码:(n<<1)^~(n&0)
负数解码:(n>>>1)^~(n&0)
For example, the calculation according to Equation 2 and -2 encoding such
Transmission is not employed ZigZag
2
→ 0000 0000 0000 0000 0000 0000 0000 0010
→ 10(压缩值)
→ 0000 0010(protobuf传输内容,注意每8bit的最高位是标志位,不是数值部分)
Here transmission 2 needs 1byte
Transmission is not employed ZigZag
-2
→ 1111 1111 1111 1111 1111 1111 1111 1110 (-2是2取反补码得到的,这是负数在二进制中的表示规则)
→ 1111 1111 1111 1111 1111 1111 1111 1110(压缩值,可以看出无法压缩,高位都是1)
→ 1000 1110 1111 1111 1111 1111 1111 1111 0111 1111 (protobuf传输内容,注意每8bit的最高位是标志位,不是数值部分,注意超过1byte,需要每个字节进行高低位倒转)
Here transmission -2 need 5byte
The transmission uses ZigZag
2
→ 0000 0000 0000 0000 0000 0000 0000 0010
→ 0000 0000 0000 0000 0000 0000 0000 0100(编码)
→ 100(压缩值)
→ 0000 0100(protobuf传输内容,注意每8bit的最高位是标志位,不是数值部分)
Here transmission 2 needs 1byte
The transmission uses ZigZag
-2
→ 1111 1111 1111 1111 1111 1111 1111 1110 (-2是2取反补码得到的,这是负数在二进制中的表示规则)
→ 0000 0000 0000 0000 0000 0000 0000 0011(编码)
→ 11(压缩值)
→ 0000 0011 (protobuf传输内容,注意每8bit的最高位是标志位,不是数值部分)
Here required transmission -2 1byte
From comparison of the above two can be seen in negative transmission ZigZag coding, space can be saved. With higher transmission efficiency. However, if the large number of positive and negative values, the compression space is very small.
3.2Non-varint Numbers non varint NUMERICAL
fixed64, sfixed64, double, fixed32, sfixed32, float are Non-varint Numbers,
can be seen from the literal, fixed64, sfixed64, double size 64bit (i.e. 8byte), fixed32, sfixed32, float 32bit size (i.e. 4byte ),
but may exceed the actual transfer process, because of the flag is present.
这块的部分没有找到详细的资料说明,只是说编码采用标志位+高低位逆序编码(就是varint编码规则),没太懂!!!!!!!!!!
Character Strings
All strings with a length defined by the length-delimited format encoding, encoding, there is one type of part is represented varint length.
message Test2 {
optional string b = 2;
}
b="testing"
Specific coding:
Field number & Field Type | Field Length | Field Values |
---|---|---|
12 | 07 | 74 65 73 74 69 6e 67 |
0x12 → field number = 2, type = 2
0x07 → 7 bytes
Embedded Messages nest messages
As we say nested message structure, the c field is the Test3 Test1 type, as we have a set of 150
message Test1 {
optional int32 a = 1;
}
message Test3 {
optional Test1 c = 3;
}
The following is the actual coding, we come under the resolution,
the previously mentioned Test1.a = 150, its code is 089601, you may find the following coding the second half exactly the same.
1a 03 08 96 01
The remaining 1a 03 is Test3.c coding, we resolved under
1a 03
→ 0001 1010 0000 0011
→ 0(标志位)0011(字段编号)010(字段类型)0(标志位)0000011(字段长度)
→ 3(字段编号)2(字段类型)3(字段长度)
Control field type table, we can confirm that the type of message using nested longitudinal length-delimited defined type, according to the length field value is taken, parses it, and then according to the sub-message format, and decoding.
Compression 3.3Packed Repeated Fields list of fields
In proto2 release, repeated defaults [packed = false], no compression.
In proto3 version, for value type (refer to field type is 0,1,5) defaults [packed = true], in grpc packet, which is (+ column number field type 1+ + + Elemental element number of elements 2 ....) structure, as shown below:
message Test4 {
repeated int32 d = 4 [packed=true];
}
22 // key (field number 4, wire type 2)
06 // payload size (6 bytes)
03 // first element (varint 3)
8E 02 // second element (varint 270)
9E A7 05 // third element (varint 86942)
Length-delimited rest of the type, it can use this compression, their organizational structure grpc packet like this (field number field type + + string length limit 1+ 1+ + field number field type length defining + 2 + 2 ..... string value)
can be seen from the message, and field number field types are repeating elements need to occupy a certain byte.
注意:这边有一个官方对于protobuf解码器的要求,比如你传递的grpc的报文,列表类型数据,采用packed=false,不进行压缩,但是接收者的protobuf定义的又是配置了packed=true,这时候,解码器需要兼容这种情况,对报文做正确解析。
Here do next extra packed = false description of the encoding grpc message features:
- All the array elements may not be continuous, may be interspersed among the other fields of the packet
2. The order of all array elements is maintained, after decoding, the display order of the array elements and consistent before encoding. - Array elements, the multi-format of the key-value, transmitted in the network.