ProtoBuf协议讲解

protobuf 主要是以数据编码小为著名，主要是用于数据交互和数据存储等，降低带宽、磁盘、移动端网络环境较差减少报文大小等场景，关于序列化速度主要是取决于你用的sdk，所以本文不会关心序列化速度！本文将以proto3语法进行介绍！

协议讲解

pb3 与 pb2差别基本上就是，

pb3对于基本类型都引入了默认值，导致数据传输的时候默认值是不会进行序列化的，也就是很难区分这个值是否设置了！所以后来 pb3又支持了 optional！
如果你的业务需要required关键字，但是pb3是没有required 关键字的！
以及pb3不支持默认值设置，以及不支持group message，所以我还是推荐如果你是做业务开发的还是pb2比较好！

基本上语法如下，具体可以看官方文档: developers.google.com/protocol-bu…

syntax = "proto3";

message TestData {
  enum EnumType {
    UnknownType = 0; // 必须以0开始！
    Test1Type = 1;
    Test2Type = 2;
  }
  message TestObj {
    int64 t_int64 = 1;
  }
  string t_string = 1;
  int64 t_int64 = 2;
  bool t_bool = 3;
  fixed64 t_fix64 = 4;
  repeated int64 t_list_i64 = 5;
  map<int64, string> t_map = 6;
  EnumType t_enum = 7;
  TestObj t_obj = 8;
  repeated TestObj t_list_obj = 9;
  map<string, TestData> t_map_obj = 10;
}
复制代码

one of 先不说

编译:

/Users/fanhaodong/software/protoc-3.17.3-osx-x86_64/bin/protoc \
--experimental_allow_proto3_optional \
--proto_path=/Users/fanhaodong/go/src/code.xxxxxxxxx/fanhaodong.516/tool/pb_idl/test \
--plugin=protoc-gen-go=/Users/fanhaodong/go/bin/protoc-gen-go \
--go_opt=Mtest.proto=github.com/fanhaodong.516/go-tool/pb_gen/test \
--go_out=/Users/fanhaodong/go/src \
--plugin=protoc-gen-go-grpc=/Users/fanhaodong/go/bin/protoc-gen-go-grpc \
--go-grpc_opt=Mtest.proto=github.com/fanhaodong.516/go-tool/pb_gen/test \
--go-grpc_out=/Users/fanhaodong/go/src \
test.proto
复制代码

pb 序列化核心用到的思想就是varint, 具体可以看官方文章: developers.google.com/protocol-bu…
本文的目标是可以做到简单的序列化 message 和反序列化message！

编码+解码

关于消息各个类型的编码逻辑: github.com/protocolbuf…

简单例子

下面是一个测试用例，可以看到输出如下

func Test_Marshal_Data(t *testing.T) {
   var request = test.TestData{
      TString: "hello",
      TInt64:  520,
   }
   marshal, err := proto.Marshal(&request)
   if err != nil {
      t.Fatal(err)
   }
   t.Log(hex.Dump(marshal))
   // 00000000  0a 05 68 65 6c 6c 6f 10  88 04                    |..hello...|
}
复制代码

消息格式介绍（key编码介绍）

具体我们分析一下是如何编码的

消息本身就是一系列kv对，在pb中一般key是字段的ID，value即消息体！但是实际上key是字段ID+字段类型，每个key都适用varint编码，一般看pb源码的时候会成为这个叫做tag来代表key！

pb提供的类型很多，它在传输的时候对于类型又进行了一次map分类，大概上就是下面这6类，具体可以看官方文章: developers.google.com/protocol-bu…！所以可以使用三个bit表示，因此tag = (field_number << 3) | wire_type

tag的前三个bit用来表示类型，剩余的用来表示ID！

这里谈个小技巧，其实看协议编码这种源码的时候，很多位运算，其实一般来说 |表示set bit操作， &表示get bit操作！

这就是为什么pb中1-15字段可以使用一个字节存储，是因为 varint只有7字段存储数据，但是tag中3字段存储类型，所以剩余的4字节存储number，也就是1<<4 -1 个字段ID；官方文章介绍: developers.google.com/protocol-bu…

其次这里再补充下为啥最大字段是2^29-1，是因为nuber最大是ui32编码，然后 32-3个类型占用的位，就剩余29位了，所以是 2^29-1

所以这块具体实现的代码如下：

type pbCodec struct {
	buffer []byte
}

func (p *pbCodec) EncodeTag(fieldNumber uint32, writeType uint8) error {
	tag := fieldNumber<<3 | uint32(writeType)
	return p.EncodeVarInt(int64(tag))
}

// 一会实现
func (*pbCodec) EncodeVarInt(data int64) error {
	return nil
}

func (p *pbCodec) DecodeTag() (fieldNumber uint32, writeType uint8, err error) {
	result, err := p.DecodeVarInt()
	if err != nil {
		return 0, 0, err
	}
	// 1 取出低3位
	writeType = uint8(result&1<<3 - 1)
	// 2 移除低3位
	fieldNumber = uint32(result >> 3)
	return
}
复制代码

varint 编码

wiki介绍 en.wikipedia.org/wiki/Variab…

大体意思就是使用低位的7bit表示数据，1bit表示最高有效位(msb)，最小1个字节，最大10个字节表示 int64

case1：

比如: data=15 -> 0000 1111,

编码逻辑：

varint 表示为 0000 1111，是因为他能够用7字节表示！所以不需要设置 msb!

解析逻辑：

我们拿到 0000 1111 取出msb 发现1 ，这里拿到msb有多种方式，可以比较大小，也能通过位运算进行取，比如 0000 1111 & 1<<7 == 0 就可以说明没有设置msb，然后取出低7位即是真实数据，这里由于8位也是0其实可以忽略这个操作！

case2:

比如 data=520 -> 0000 0010 0000 1000 (大端表示法，低位在高地址)

编码逻辑：

首先确定7个bit放不下，所以先取出前7个字节( data & (1<<7) - 1) = 000 1000，然后设置msb 1000 1000, 这个是第一轮；

第二轮剩余字节 0000 0010 0= 4 , 发现可以用7个字节放下，所以是 0000 0100

所以最终结果是 1000 1000 0000 0100 ，也就是 [136,4]，你会发现它输出的其实是个小端表示法！

解析逻辑：

首先varint 其实输出的是一个小端表示法，因此我们需要从低位开始！

首先是取出第一个字节1000 1000 ，发现msb，然后得到结果是 000 1000 = 8

然后是取出第二个字节0000 0100，发现不是msb，然后得到结果 000 0100，我们需要将它放到 000 1000后面去！怎么做了，其实跟简单 000 0100 << 7 | 000 1000 即可得到结果是 000 0100 000 1000 = 0000 0010 0000 1000

代码实现

func (p *pbCodec) EncodeVarInt(data int64) error {
	// 1. 取出低7位（如果7个字节不可以放下！）
	// 2. 然后设置高8位标识符号
	// 3. 然后右移
	for data > (1<<7 - 1) {
		p.buffer = append(p.buffer, byte(data&(1<<7-1)|(1<<7)))
		data >>= 7
	}
	p.buffer = append(p.buffer, byte(data))
	return nil
}

func (p *pbCodec) DecodeVarInt() (int64, error) {
	var (
		x int64
		n = 0
	)
	defer func() {
		p.buffer = p.buffer[n:]
	}()
	for shift := uint(0); shift < 64; shift += 7 { // 偏移量从0开始，每次+7
		if n >= len(p.buffer) {
			return 0, fmt.Errorf("not enough buffer")
		}
		// 1. 取出第一个自己
		// 2. 然后取出低7位
		// 3. 然后由于数据是小端，所以取出的数据需要移动偏移量
		// 4. 然后设置进去原来的数据中！
		b := int64(p.buffer[n])
		n++
		x |= (b & 0x7F) << shift
		if (b & 0x80) == 0 {
			return x, nil
		}
	}
	return 0, fmt.Errorf("not support")
}
复制代码

非 varint 编码类型

fixed 64 类型

编码: 把字节从低位到高位进行编码，输出小端编码

解码：相对于编码来说反过来

例如：

520 int64
00 00 00 00 00 00 02 08
=>编码后
08 02 00 00 00 00 00 00
复制代码

fixed 32 (同fixed 32)
string

编码，由于string是长度变化的，所以就是 len + content , len 采用的是varint编码，content采用的是utf8编码

field_number + field_type (proto.WireVarint)
content_len (varint编码)
content 
复制代码

embed message

编码么，我们需要知道消息体有多大(类似于string)，所以这里就是 len+payload， len也是采用的varint编码！

field_number + field_type (proto.WireVarint)
payload_len (varint编码)
payload 
复制代码

zigzag 编码

其实上面我们会发现 varint 的高位8设置的是mbs，然后我们知道负数的最高位一定是1（原因就是因为取反码），然后就会发现个问题，负数的varint编码会很大，起码10个字节，所以pb提出了sint32、sint64编码，解决这个问题，核心其实就是使用了 zigzag 编码

例如这个代码：

func Test_Marshal_Data(t *testing.T) {
	t.Run("number", func(t *testing.T) {
		var request = test.TestData{
			TInt64: 520,
		}
		marshal, err := proto.Marshal(&request)
		if err != nil {
			t.Fatal(err)
		}
		t.Log(hex.Dump(marshal))
		t.Logf("len: %d\n", len(marshal))
	})
	t.Run("negative number", func(t *testing.T) {
		var request = test.TestData{
			TInt64: -520,
		}
		marshal, err := proto.Marshal(&request)
		if err != nil {
			t.Fatal(err)
		}
		t.Log(hex.Dump(marshal))
		t.Logf("len: %d\n", len(marshal))
	})
}

==  输出，可以发现差距蛮大的！！！
=== RUN   Test_Marshal_Data
=== RUN   Test_Marshal_Data/number
    demo_test.go:45: 00000000  10 88 04                                          |...|
        
    demo_test.go:46: len: 3
=== RUN   Test_Marshal_Data/negative_number
    demo_test.go:56: 00000000  10 f8 fb ff ff ff ff ff  ff ff 01                 |...........|
        
    demo_test.go:57: len: 11
复制代码

其实很简单就是

我们可以发现就是其实就是用无符号的一半表示正数一半表示负数！

比如32位的： (n << 1) ^ (n >> 31)

比如64位的: (n << 1) ^ (n >> 63)

异或：相同为0，相异为1

比如

# -1
1111 1111 1111 1111 1111 1111 1111 1111

# d1=uint32(n) << 1
1111 1111 1111 1111 1111 1111 1111 1110
# d2=uint32(n >> 31) (负数左移添加1)
1111 1111 1111 1111 1111 1111 1111 1111
# d1 ^ d2
0000 0000 0000 0000 0000 0000 0000 0001


# 1
0000 0000 0000 0000 0000 0000 0000 0001
#n<<1
0000 0000 0000 0000 0000 0000 0000 0010
#n>>31
0000 0000 0000 0000 0000 0000 0000 0000
# 输出
0000 0000 0000 0000 0000 0000 0000 0010
复制代码

代码实现（这里可能需要注意的是32位的实现）

// EncodeZigZag64 does zig-zag encoding to convert the given
// signed 64-bit integer into a form that can be expressed
// efficiently as a varint, even for negative values.
func EncodeZigZag64(v int64) uint64 {
	return (uint64(v) << 1) ^ uint64(v>>63)
}

// EncodeZigZag32 does zig-zag encoding to convert the given
// signed 32-bit integer into a form that can be expressed
// efficiently as a varint, even for negative values.
func EncodeZigZag32(v int32) uint64 {
	return uint64((uint32(v) << 1) ^ uint32((v >> 31)))
}

// DecodeZigZag32 decodes a signed 32-bit integer from the given
// zig-zag encoded value.
func DecodeZigZag32(v uint64) int32 {
	return int32((uint32(v) >> 1) ^ uint32((int32(v&1)<<31)>>31))
}

// DecodeZigZag64 decodes a signed 64-bit integer from the given
// zig-zag encoded value.
func DecodeZigZag64(v uint64) int64 {
	return int64((v >> 1) ^ uint64((int64(v&1)<<63)>>63))
}
复制代码

repeated （list）

上文都没有讲解到集合类型，protbuf 提供了 repeated关键字来提供list类型！

关于 repeated 具体编码实现有两种

packed （pb3默认, pb2 v2.1.0 引入的）
unpacked

官方文档: developers.google.com/protocol-bu…

packed

可以根据官网提供的demo为例子：

message Test4 {
  repeated int32 d = 4 [packed=true];
}
复制代码

编码后输出

22        // key (field number 4, wire type 2)
06        // payload size (6 bytes)
03        // first element (varint 3)
8E 02     // second element (varint 270)
9E A7 05  // third element (varint 86942)
复制代码

其实这个就是我们正常的思路， id - 类型 - 数据！但是支持的类型比较少，只支持以下几种类型！

	WireVarint     = 0
	WireFixed32    = 5
	WireFixed64    = 1
复制代码

unpacked

还是上面那个为例子，假如unpacked

message Test5 {
  repeated int32 d = 4 [packed = false];
}
复制代码

输出:

00000000  20 03 20 8e 02 20 9e a7  05                       | . .. ...|
20  // key (field number 4, wire type=proto.WireVarint)
03 // (varint 3)
20 //  key (field number 4, wire type=proto.WireVarint)
8e 02 // (varint 270)
20 //  key (field number 4, wire type=proto.WireVarint)
9e a7  05 // (varint 86942)
复制代码

map

其实pb对于map的封装你可以在 file desc 中可以看到, 其实map就是 repeated kv message!

{
    "name": "TestData",
    "field": [
        {
            "name": "t_map",
            "number": 6,
            "label": 3,
            "type": 11,
            "type_name": ".TestData.TMapEntry",
            "json_name": "tMap"
        },
    ],
    "nested_type": [
        {
            "name": "TMapEntry",
            "field": [
                {
                    "name": "key",
                    "number": 1,
                    "label": 1,
                    "type": 3,
                    "json_name": "key"
                },
                {
                    "name": "value",
                    "number": 2,
                    "label": 1,
                    "type": 9,
                    "json_name": "value"
                }
            ],
            "options": {
                "map_entry": true
            }
        }
    ],
    "enum_type": []
}
复制代码

所以编码的时候也很简单，例如

message TestMapData1 {
// 这里无法定义 TMapEntry，会报错！
  map<int64, string> t_map = 6;
}

==> 实际上是生成了这个代码！

message TestMapData2 {
  message TMapEntry {
    int64 key = 1;
    string value = 2;
  }
  repeated TMapEntry t_map = 6;
}
复制代码

所以编码上可以大概：

t_map= {1:"1",2:"2"}
=>
32 05 08 01 12 01 31 32  05 08 02 12 01 32

=> 
32 // field_number=6 and write_type=proto.WireBytes
05 // entry data length=5
08 // entry data key field_number=1 and write_type=proto.WireVarint
01 // entry data key_value=varint(1)
12 // entry data value field_number=2 and write_type=proto.WireBytes
01 // entry data value len= varint(1)
31 // entry data value="1"

32  // field_number=6 and write_type=proto.WireBytes
05  // entry data length=5
08 02 12 01 32 // 同上！
复制代码

field order

pb 编码不在意字段的编码顺序，也就是encode的时候顺序不一样会导致输出的数据不一样！

还有就是map的key顺序也会影响！

所以一般api都会指定是否支持 deterministic，如果设置为true，结果一般都会保证一样，否则可能不一样！

但是你懂得，实际上效果吧，就是开启之后一定比不开启慢，因为需要进行order！

pb 协议整体概括

下面这个是一个类似于bnf范式的东西，具体可以参考: developers.google.com/protocol-bu…

message   := (tag value)*     You can think of this as “key value”

tag       := (field << 3) BIT_OR wire_type, encoded as varint
value     := (varint|zigzag) for wire_type==0 |
             fixed32bit      for wire_type==5 |
             fixed64bit      for wire_type==1 |
             delimited       for wire_type==2 |
             group_start     for wire_type==3 | This is like “open parenthesis”
             group_end       for wire_type==4   This is like “close parenthesis”

varint       := int32 | int64 | uint32 | uint64 | bool | enum, encoded as
                varints
zigzag       := sint32 | sint64, encoded as zig-zag varints
fixed32bit   := sfixed32 | fixed32 | float, encoded as 4-byte little-endian;
                memcpy of the equivalent C types (u?int32_t, float)
fixed64bit   := sfixed64 | fixed64 | double, encoded as 8-byte little-endian;
                memcpy of the equivalent C types (u?int64_t, double)

delimited := size (message | string | bytes | packed), size encoded as varint
message   := valid protobuf sub-message
string    := valid UTF-8 string (often simply ASCII); max 2GB of bytes
bytes     := any sequence of 8-bit bytes; max 2GB
packed    := varint* | fixed32bit* | fixed64bit*,
             consecutive values of the type described in the protocol definition

varint encoding: sets MSB of 8-bit byte to indicate “no more bytes”
zigzag encoding: sint32 and sint64 types use zigzag encoding.
复制代码

protoc 命令讲解

这里讲解一下protoc 的架构图，生成架构图

第一个节点protoc其实是 c++写的， github.com/protocolbuf…

第二个节点是 protoc 输出的二进制报文CodeGeneratorRequest ，具体可以看 github.com/protocolbuf…

第三个节点是业务逻辑，后端，我们可以根据请求去输出产物给pb文件

第四个节点输出 CodeGeneratorResponse

目前虽然有很多项目在解析protoc 的 ast时，都是自己实现的词法解析和语法解析，所以假如可以把protoc封装个lib库直接使用 go/java 对接就行了！

pb 一些其他术语

package（包），一个包下不能存在相同定义的message和枚举以及枚举字段
include（import），基于include_path 作为根目录的relative pata
option，可以理解为是一些注解，可以对于字段、消息、枚举、method进行标记！