Pilota: Why is a code generation tool so complicated丨GOTC Rust series sharing

For a Rust RPC framework, code generation based on IDL is to make it easier for users to use the framework. The quality of generated code and peripheral capabilities will have a very, very direct impact on the user's development experience.

Therefore, Bytedance CloudWeGo developed a framework like Pilota to generate good code for users.

At the "GOTC Global Open Source Technology Summit - Rust Forum" held on May 28, 2023, ByteDance service framework R&D engineer Liu Yifei introduced Pilota, a code generation tool. Today we will introduce the content of this sharing.

 

Why do you need a code generation tool

First, why do we need code generation tools?

Because in the RPC world, users describe a service interface in the form of IDL.

For example, there is a service here, and the service also provides some methods. It is necessary to describe the type of the input parameters and the type of the returned structure of this method. The user describes it through this IDL.

 

In actual code, we cannot directly use such IDL, so we need to convert IDL into Rust code and hand it over to the user framework for use.

 

Such a translation process is what our code generation tool needs to do, that is, translate an IDL into Rust code.

At the same time, some encoding and decoding logic generation needs to be done.

For example, we need to provide the encode decode method for a defined Message trait. This encode is to encode its own structured data into a piece of binary data, and decode is to get a piece of structured data from a piece of binary data. These codes The decoding logic also requires code tools to perform specific generation.

 

With our code generation tool, it is very convenient to call an RPC. You only need to write which IDL file depends on the built.rs file. Our code generation tool will generate a piece of code for it, and then only You need to import the generated code through the include! macro in your business logic, and then you can construct a client very conveniently, just like completing an ordinary function call to complete an RPC call.

 

 

This approach makes RPC calls very simple.

The challenge of generating Rust code

So, what are the challenges when generating code?

In fact, many times, you may think about how complicated it is. You only need to convert an IDL into AST, and then translate it line by line. We only need to simply implement a Parser. In fact, we also thought so at the beginning. Our first version of the code generation tool was made with this idea, which is very simple.

But then we encountered a lot of problems.

For example, we have a circular dependency problem in IDL, and we have struct a that depends on struct b. Then struct b depends on struct a.

 

 

If you directly translate to Rust code and compile, the compiler will report an error, telling you that the sizes of these two structs cannot be calculated.

In the Rust world, it is not allowed to describe circular dependencies in this way, so some means must be used to solve this size problem.

For example, add Box to these Fields here, because Box is actually a pointer, and the size of this structure is the size of a pointer, then the circular dependency of size calculation is solved.

 

So how do we solve this problem?

In fact, it is very simple, that is, we will generate a graph, and each type of IDL will become a node in the graph, and the edges between nodes are the feilds of references. For example, structure a uses structure b through ab, and type b uses structure a through ba.

Then now we can create a new graph, and I found that there is a ring in this graph, we need to detect the ring of this graph, and then add a Box to all the things in this ring, we can solve this problem .

 

At this time, the user put forward a new requirement, asking if you can help me generate those commonly used derive macros in Rust, such as Hash and Eq.

But can we add the derive macro of Hash to all types?

No, because in the Rust world, like float, it does not actually implement the Hash Trait.

If we implement such a structure and add a derive Hash to it, we will get an error from the compiler, because the float64 type does not implement Hash, so we need to set a rule, if all the fields in a structure can be If Hash is implemented, then this structure needs to be able to derive Hash.

 

For example, there are three structures a, b, and c. Structure a depends on structure b and structure c. Structure b has a dependent structure a. Then, structure c has a feild of double type, and the corresponding Rust type is float64 .

 

 

So at this time we start to calculate which structures can be hashed. Let's first look at the type of structure a, we will find that the establishment of Hash(A) depends on the establishment of the two propositions Hash(B) and Hash(C) at the same time, and then look at field b, then the establishment of Hash(B) It depends on the establishment of Hash(A). The establishment of Hash(C) depends on the establishment of Hash(double).

 

After obtaining these three logical propositions, it is necessary to judge whether these logical propositions are true.

First look at the first question Hash(A), because the establishment of Hash(A) depends on the establishment of Hash(B) and Hash(C), and then the establishment of Hash(B) depends on the establishment of Hash(A).

If you deal with it very simply and straightforwardly through recursion, you will find that a depends on b, and then b depends on a in turn, and recursion can never escape.

At this time, some means are needed. Processing Hash(A) To calculate Hash(B) and Hash(C), when calculating Hash(B), because it depends on Hash(A), and Hash(A) is in the calculation, so it needs to calculate Hash(B) ) to create a lazy dependency, which will be calculated later, regardless of it.

At this point, replace Hash(B) with Hash(A), and then return to the calculation of Hash(A), we will find that Hash(A) depends on the establishment of Hash(C) and Hash(A). Let's deal with Hash(C) again. The establishment of Hash(A) actually depends on the establishment of Hash(double). But in fact, the float32 corresponding to the double type does not implement Hash, so we finally get a problem, that is, the establishment of Hash(A), this proposition is equivalent to false & hash(A), because there is false, so this proposition is no longer valid , so the proposition Hash(A) is not valid.

At this point there is another question: Now that Hash(A) has been calculated, what about Hash(B)? Because the calculation of Hash(B) is half done just now, and it is in the intermediate state of lazy dependency, it is necessary to recalculate Hash(B). Because Hash(A) has been found to be invalid, it can be successfully calculated that Hash(B) is invalid.

Therefore, the three propositions Hash(A), Hash(B), and Hash(C) are not valid.

 

But if you change the scene, assuming that the field in the c structure is not a double, it is an int32, and then calculate it according to the method just now, and find that Hash(C) is established, then you can get that the establishment of Hash(A) depends on Hash (A) Self-establishment. How do we handle this situation?

In the Rust world, if a struct has one and only one field, and the type of this field is itself, we can derive hash for this struct. Therefore, in this case we can formulate a rule: If a proposition depends only on its own establishment, then this proposition is established. Then Hash(A) is also established.

Next, we will go back to calculate Hash (B), because Hash (B), before lazy dependency Hash (A), then because Hash (A) is established. Then when I calculate Hash(B), I will also get a valid result, then at this time, Hash(A), Hash(B), and Hash(C) are all valid.

 

In the process of dealing with IDL, you may also encounter constants, because in all thrifts you can define a constant through const.

 

How should this constant be generated?

For example, the constant type of thrift string will generate constants of type &'static str, and the Rust compiler will perform some optimizations on constants, so it is undoubtedly better for our code generation tools to generate &'static str types for users .

 

But thrift allows users to write very free types. For example, use const to specify a map type, and then there is a type like list in the map.

 

But in Rust, we can't express it through const, because in Rust, Hashmap, vector and other types all involve memory allocation. So it needs to be processed through lazy static, first create a static reference, which generates such a structure, this Hashmap and some construction logic for constructing this vector.

 

So what problems will be introduced? Because there are two types of representations in the current design, one is const, and the other is in struct.

 

Now we have generated &'static str for the string type inside const.

But what if we deal with the string in the struct? The string type field in the struct is actually likely to be used by the user to construct a request, or be output as a response by the server, but the user needs to construct such a field very easily.

But if the string here is treated as a &'static str, then the user basically cannot construct it, because the conditions for &'static str construction are too high.

Therefore, a rule needs to be defined at this time. In the scope of struct, the string type defined in string corresponds to the Rust type string. Then there will be a problem of two types of representation.

For example, the "hello world" is a literal now, but if it is in a const scope, the corresponding type is &'static str, and this expression can be used directly. But if it is in a non-const form, it needs to be converted into a string, so it needs to be converted into a string through the to_owned() method. But if the situation becomes more complicated, here is not a literal but a symbol, then we need our code generation tool to have its own type system to deal with the way of conversion between different types.

In some practices, we also encountered the problem of excessive code generation.

In our business code, there may be dozens of thrift idl files for business IDL files, and then our generation tool generates 1.5 million lines of Rust code, and it takes 10 minutes to execute a cargo check. Because the logic of our code generation tool at that time would put all the Rust code generated by IDL into one file, so the user's operation check process was very slow.

But in the process, we will find a problem. For example, there is an entry file a, and there is a service in it. This is a service that users really want to use, but this file should be statistics. Now we find these two There are 5 structures in the file, namely service, a, aa, b, bb.

But in fact, users only care about the service at the entrance, and users don't care about the structure that this service depends on.

So at this time, we can make an optimization method, use this service as the entry node, scan which structures are dependent and which structures need to be generated, so as to reduce the structures that need to be generated. After this optimization, only 100,000 lines of the previous 1.5 million lines of code remain, and the compilation efficiency has been greatly improved.

There is an RPC framework Volo under CloudWeGo, which provides users with the ability to call RPC, so how is IDL used by Volo?

We need a code generation tool as a bridge between IDL and our RPC framework. This tool is Pilota, which is a code generation tool we open source.

Pilota's Design Architecture

The structure of Pilota looks very simple, its entry is a Parser, the input of Parser is an IDL, and the output is an AST. After Naming Resolve symbol analysis, an intermediate representation IR is generated. Based on this intermediate representation, we will deal with circular dependencies, type conversions, dependency collection and other issues. Finally, continue to execute the user-defined Plugin, and hand over the final result to the Backend production code.

First, let's take a look at the Parser part. Now our Parser supports both Thrift and Protobuf, and its output is an AST.

As long as the IDL format can be converted to Pilota's AST representation, then it can be connected to the Pilota system, and the complex problems mentioned above can be easily solved.

When we get the AST, we need to enter the Naming Resolve stage.

In fact, Pilota's AST is already very close to Rust's representation, but there may be some special features here, because there are some symbols with the same name, such as Mod Test and Struct Test

Symbols with the same name are allowed in Pilota's AST, but this will bring some challenges to symbol resolution. Why allow such a design?

Because there is a very special design in Protobuf, which is Nested Item, you can define message in message, but you can’t define Struct in Struct in Rust, so in order to support this feature of Protobuf, you need Pilota’s AST can be expressed in this form. For this form, when Pilota generates Struct, it will also generate a Mod with the same name for the related nested item.

So when a symbol with the same name appears, it will use different namespaces in the Naming resolve process like Rust, because Rust now has two namespaces in the Naming Resolve process, one is Type, and the other is in Value Pilota There is also a similar design, which will be divided into three different namespaces, namely type, value, and Mod.

Let's take a look at the AST structure diagram first. All items in these files are traversed once, and IDs are assigned to symbols.

After the ID is allocated, it is necessary to calculate which ID the path in each field points to in each structure.

However, symbols with the same name may introduce ambiguity problems, so different parsing results need to be filtered according to namespace.

In Rust, there will be some commonly used third-party library derive macros, such as serde. Then Pilota needs to provide a flexible ability to allow users to customize the attributes in the generated code.

In Pilota, each item field varaint will have a corresponding adjust field. Users can write a plugin to access this structure of adjust, and then modify the attributes inside to customize the attributes in the generated code and use the nested_item field to control additional generated Rust code such as Impl Block.

The future of Pilota

After solving these problems, Pilota can be used in most scenarios. But Pilota is also exploring some different ways of generating code.

Some developers said that the amount of generated code is still quite large. Even after these optimizations, the amount of code is still about 5W-10W, and some business parties also hope that they do not want to perceive the existence of IDL, but in the current system Users must generate code through IDL. But can users change the way they use it?

For example, if a user wants to generate corresponding service codes based on three IDLs, then three crates can be generated for user convenience.

However, there may be some structures that are used simultaneously in these three crates, and a coomon crate is needed to store such structures. What benefits does this generation method bring us?

First of all, you can make full use of the compiler's cache, because the cache compiled by Rust is actually a crate level. For example, hundreds of thousands of lines of code may have been generated originally, and hundreds of thousands of lines of code are put into a crare in the business. Now that multiple crates are generated, hundreds of thousands of lines of code may have been completely put into 6 crates. Each crate only shares tens of thousands of lines of code, and the cache granularity is finer. When the code of a crate is updated, only need Just recompile these tens of thousands of lines of code.

After generating these crates, developers can use cargo workspace to manage them, and then publish these crates to git or other places. Then other developers can directly use the generated code to complete the RPC call, and don't need to care about the existence of IDL.

Okay, that's it for today's sharing, thank you all.

GitHub address:

https://github.com/cloudwego/pilota

Guess you like

Origin blog.csdn.net/weixin_47098359/article/details/131116034