Bytedance open source dynamicgo: high performance + dynamic Go data processing based on raw byte stream

Warehouse address: https://github.com/cloudwego/dynamicgo

background

Currently, Thrift is the main RPC serialization protocol used inside Byte. After being optimized and used in the CloudWeGo/Kitex project, its performance has a greater advantage than using protocols that support generic encoding and decoding, such as JSON. However, in the process of in-depth cooperation and optimization with the business team, we found that some special business scenarios cannot enjoy the high performance brought by static code generation :

  1. Dynamic reflection: dynamically read, modify, and tailor certain fields in data packets, such as field masking in privacy compliance scenarios;
  2. Data arrangement: combine multiple subpackets for sorting, filtering, shifting, merging and other operations, such as some BFF (Backend For Frontent) services;
  3. Protocol conversion: As a proxy, convert data of a certain protocol into another protocol, such as http-rpc protocol conversion gateway.
  4. Generic calls: RPC services that require second-level hot updates or very frequent iterations, such as a large number of Kitex generic-call users

It is not difficult to find that these business scenarios are characterized by the difficulty of uniformly defining static IDL . Even if this problem can be avoided through distributed sidecar technology, the traditional code generation method is often abandoned due to the dynamic update of business needs, and some self-developed or open-source Thrift generic codec libraries are used to make generalized RPC calls. After performance analysis, we found that these libraries have a huge performance drop compared to the code generation method. Taking Byte’s BFF service as an example, the CPU overhead generated by Thrift generalization calls alone accounts for nearly 40%, which is almost 4 to 8 times that of normal Thrift RPC services. Therefore, we have developed a set of Go basic libraries that can dynamically process RPC data (no code generation is required) while ensuring high performance - dynamicgo.

design and implementation

First of all, we must figure out why the performance of these generalized call libraries is poor? The core reason is that some inefficient generic container is used to carry the data in the intermediate processing (typically, map[string]interface{} in thrift-iterator). As we all know, the cost of Go's heap memory management is extremely high (GC + heap bitmap), and the use of interface will inevitably lead to a large amount of memory allocation. But in fact, quite a few business scenarios do not really need these intermediate representations. For example, in the pure protocol conversion scenario in the http-thrift API gateway, its essence is to convert JSON (or other protocol) data into Thrift code according to user IDL (and vice versa), which can be translated word by word based on the input data stream. Similarly, we also counted the specific codes of generalized calls in a BFF service in Douyin, and found that the fields that really need to be read (Get) and written (Set) accounted for less than 5% of the entire data packet fields. Unnecessary fields can be skipped instead of deserialized. The core design idea of ​​dynamicgo is to process and convert data in-place based on the original byte stream and dynamic type description . To this end, we have designed different APIs for different scenarios to achieve this goal.

dynamic reflection

For the usage scenarios of thrift reflection proxy, in summary, there are the following usage requirements:

  1. It has a complete set of structural self-description capabilities, which can express scalar data types, and also express nested structure mapping, sequence and other relationships;
  2. Support addition, deletion, modification (Get/Set/Index/Delete/Add) and traversal (ForEach);
  3. Guarantee that data can be read concurrently, but does not need to support concurrent writing. Equivalent to map[string]interface{} or []interface{}

Here we refer to the design idea of ​​Go reflect, and package the quasi-static type description obtained through IDL parsing (you only need to update it once with IDL) TypeDescriptor and the original data unit Node into a completely self-describing structure - Value, providing a complete set of The reflection API.

// IDL 类型描述
type TypeDescriptor interface {
    Type()          Type // 数据类型
    Name()          string // 类型名称
    Key()           *TypeDescriptor   // for map key
    Elem()          *TypeDescriptor   // for slice or map element
    Struct()        *StructDescriptor // for struct
}
// 纯TLV数据单元
type Node struct {
    t Type // 数据类型
    v unsafe.Pointer // buffer起始位置
    l int // 数据单元长度
}
// Node + 类型描述descriptor
type Value struct {
    Node
    Desc thrift.TypeDescriptor
}

In this way, as long as the TypeDescriptor contains enough type information and the corresponding thrift raw byte stream processing logic is robust enough, it can even realize various complex business scenarios such as data clipping and aggregation .

protocol conversion

The process of protocol conversion can be expressed by a finite state machine (FSM). Taking the JSON->Thrift process as an example, the conversion process is roughly as follows:

  1. Preload user IDL and convert it to runtime dynamic type description TypeDescriptor;
  2. Read a json value from the input byte stream and determine its specific type (object/array/string/number/bool/null):
  3. If it is an object type, continue to read a key, and then find the subtype description of the matching field through the corresponding STRUCT type description;
  4. If it is an array type, recursively find the sub-element type description of the type description;
  5. For other types, use the current type description directly.
  6. Based on the obtained dynamic type description information, convert the value into an equivalent Thrift byte and write it to the output byte stream;
  7. Update input and output byte stream positions, jump back to 2 for loop processing, until input terminates (EOF).

picture

<p align=center>Figure 1 JSON2Thrift data conversion process</p>

The whole process can be done completely in-place, only need to allocate memory once for the output byte stream.

data orchestration

Slightly different from the previous two scenarios, the data orchestration scenario may involve changes in data location (heterogeneous conversion), and often accesses a large number of data nodes (worst complexity O(N)). We discovered similar problems in the joint research and development with the Douyin privacy compliance team. One of their important business scenarios: it is necessary to traverse the child nodes of an array horizontally, find out whether there is any illegal data, and erase the entire row. In this scenario, searching and inserting directly based on the original byte stream may cause a lot of repeated skip positioning and data copy overhead , which will eventually lead to performance degradation. So we need an efficient deserialized (with pointers) structure representation to process the data. Based on past experience, we thought of DOM (Document Object Model) , which is widely used in JSON generic parsing scenarios (such as rappidJSON, sonic/ast), and its performance is much better than that of map+interface generics.

To use DOM to describe a Thrift structure, you first need a positioning method that can accurately describe the relationship between data nodes - Path. Its type should include list index, map key, and struct field id, etc.

type PathType uint8 

const (
    PathFieldId PathType = 1 + iota // STRUCT下字段ID
    PathFieldName // STRUCT下字段名称
    PathIndex // SET/LIST下的序列号
    PathStrKey // MAP下的string key
    PathIntkey // MAP下的integer key
    PathObjKey// MAP下的object key
)

type PathNode struct {
    Path            // 相对父节点路径
    Node            // 原始数据单元
    Next []PathNode // 存储子节点
 }

On the basis of Path, we combine the corresponding data unit Node , and then dynamically store child nodes through a Next array , so that a generic structure similar to BTree can be assembled .

picture

<p align=center>Figure 2 thrift DOM data structure</p>

How is this generic structure better than map+interface? First of all, the underlying data unit Node is a reference to the original thrift data, and there is no binary codec overhead caused by converting the interface ; secondly, our design ensures that the memory structure of all tree nodes PathNode is exactly the same, and because the underlying parent-child relationship The core container is a slice, and we can further adopt the memory pool technology to pool the memory allocation and release of the child nodes of the entire DOM tree to avoid calling the go heap memory management. The test results show that in an ideal scenario (the number of DOM tree nodes to be deserialized later is less than or equal to the maximum number of nodes deserialized before - this is basically guaranteed due to the buffering effect of the memory pool itself), the number of memory allocations can be 0 , performance increased by 200%! (See [Performance Test - Full Serialization/Deserialization] section).

Performance Testing

Here we define the two benchmark structures of simple (Small) and complex (Medium) respectively to compare the performance of different data levels, and add two corresponding subsets of simple part (SmallPartial) and complex part (MediumPartial) for [Reflection-Cropping] Performance comparison of scenes:

  • Small: 114B, 6 valid fields
  • SmallPartial: a subset of small, 55B, 3 valid fields
  • Medium: 6455B, 284 valid fields
  • MediumPartial: Subset of medium, 1922B, 132 effective fields

Secondly, we divide it into three sets of APIs based on the above business scenarios: reflection, protocol conversion, and full serialization/deserialization, and conduct performance tests based on the code generation library kitex/FastAPI , the generalization call library kitex/generic , and the JSON library sonic . Other test environments remain the same:

  • Go 1.18.1
  • CPU intel i9-9880H 2.3GHZ
  • OS macOS Monterey 12.6

reflection

the code

dynamicgo/testdata/baseline_tg_test.go

Example

  • GetOne: Find the last 1 data field in the byte stream
  • GetMany: Find the first, middle and last 5 data fields
  • MarshalMany: Secondary serialization of the results in GetMany
  • SetOne: Set the last data field
  • SetMany: Set the front, middle and back 3 node data
  • MarshalTo: crop large Thrift packets into small thrift packets (Small -> SmallPartial or Medium -> MediumParital)
  • UnmarshalAll+MarshalPartial: Code generation/generalization call method tailoring - first deserialize the full amount of data and then serialize part of the data. The effect is equivalent to MarshalTo.

result

  • Simple (ns/OP)

  • Complex (ns/OP)

in conclusion

  • dynamicgo's one-time lookup + write overhead is about 2 to 1/3 of the code generation method, and 1/12 to 1/15 of the generalized call method, and the advantages increase with the increase of the data volume;
  • The clipping overhead of dynamicgo thrift is close to the code generation method, about 1/10 to 1/6 of the generalized call method, and the advantage weakens as the data volume increases.

protocol conversion

the code

Example

  • JSON2thrift: convert JSON data to equivalent structured thrift data
  • thrift2JSON: Convert thrift data to JSON data of equivalent structure
  • sonic + kitex-fast: means to process json data (with structure) through sonic , and process thrift data through kitex code generation

result

  • Simple (ns/OP)

  • Complex (ns/OP)

in conclusion

  • The dynamicgo protocol conversion cost is about 1-2/3 of the code generation method, and 1/4-1/9 of the generalized call method, and the advantages increase with the increase of the data volume;

Total ordering/anti-ordering

the code

dynamicgo/testdata/baseline_tg_test.go#BenchmarkThriftGetAll

Example

  • UnmarshalAll: Deserializes all fields. There are two modes for dynamicgo:

    • new: DOM memory is reallocated each time;
    • reuse: use the memory pool to reuse DOM memory.
  • MarshalAll: Serializes all fields.

result

  • Simple (ns/OP)

  • Complex (ns/OP)

in conclusion

  • The cost of full serialization of dynamicgo is about 6 to 3 times that of the code generation method, and 1/4 to 1/2 of the generalized calling method, and the advantage weakens as the data volume increases;
  • The cost of Dynamigo full deserialization + memory reuse is about 1.8 to 0.7 of the code generation method, and 1/13 to 1/8 of the generalized call method, and the advantages increase with the increase of the data volume.

Applications and Prospects

Currently, dynamicgo has been applied to many important business scenarios, including:

  1. Business privacy compliance middleware (thrift reflection);
  2. The downstream data of a Douyin BFF service is delivered on demand (thrift tailoring);
  3. ByteDance API gateway protocol conversion (JSON<>thrift protocol conversion).

And gradually go online and get benefits. At present, dynamic is still in iteration, and the next work includes:

  1. Integrate into the Kitex generalized call module to provide more users with high-performance thrift generalized call module;
  2. Thrift DOM accesses DSL (GraphQL) components to further improve the performance of BFF dynamic gateway;
  3. Support Protobuf protocol.

Interested individuals or teams are also welcome to participate and develop together!


project address

picture

The country's first IDE that supports multi-environment development——CEC-IDE Microsoft has integrated Python into Excel, and Uncle Gui participated in the framework formulation. Chinese programmers refused to write gambling programs and were pulled out 14 teeth, with 88% body damage . Podman Desktop, an open-source imitation Song font, breaks through 500,000 downloads. Automatically skips opening screen advertisements. The application "Li Tiao Tiao" stops updating indefinitely. There is a remote code execution vulnerability Xiaomi filed mios.cn website domain name
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4843764/blog/8604415