ByteDance's exploration and practice in the direction of Rust microservices

Recently, the Qcon Global Software Development Conference 2022 (Shanghai Station) came to a successful conclusion. The conference is a comprehensive technology event hosted by InfoQ China. Nearly 100 domestic and foreign technology experts shared cutting-edge technology cases and innovative practices on the spot. This article is compiled from the content shared by ByteDance Volcano Engine Infrastructure Service Framework Engineer Wu Di at the conference. The theme is "ByteDance's Exploration and Practice in the Rust Microservice Direction".

This sharing content is mainly divided into the following three parts:

  1. Why we chose the Rust language;
  2. what we did;
  3. Looking to the future: opportunities and challenges.

Guest introduction

I will mainly introduce to you why we chose the Rust language, because you may have heard that ByteDance’s more famous framework is called Kitex, which is a Go framework. ByteDance has invested a lot in the direction of Go. Why start to explore Rust now? What about the direction? Secondly, what have we done in this direction, some problems encountered, and our solutions. Third, I will share with you what we think are the current opportunities and future challenges for Rust.

Why we chose the Rust language

At the beginning, I was doing Go language development and Go's RPC framework, but at that time we encountered many problems in Go language.

Shackles of the Go language

  • Deep optimization is difficult

In Go, it is very, very difficult for you to optimize deeply, because when our volume becomes larger and larger, deep optimization becomes more and more important. But if you want to do some in-depth optimization in Go, you will often find that you are fighting a lot with the runtime and the compiler, and you need to use some very hacky methods to bypass some of its limitations.

  • Toolchain and package management are not mature enough

The tool chain and package management in Go are relatively immature, and students who have used our open source Kitex framework may understand it very well. For example, in Go, if you want to call a gRPC service or a Thrift service, you need to call a service that needs to generate code. I need to generate the code during development first, and use a command line tool After the generation is complete, submit the generated code to the version management together. To put it bluntly, this is a very stupid approach. C++, Java, and Python may all adopt some other solutions, but Go must do this, because its compiler does not have this ability, and there is no way to compile it. Generate this thing. Another point, for example, I may be able to adjust a script to generate it when compiling, but the problem is that there is no such file locally, and there are no code generation and code completion prompts. The IDE will directly give you all the underlines in it, which is an experience. very bad thing.

  • Weak abstract ability

The abstraction ability in Go is relatively weak, and it does not have the concept of zero-cost abstraction.

Deep optimization is difficult

I share a real and interesting case I encountered before. In the process of serialization and deserialization, we may encounter some errors. In our previous version, the code was very simple. When there was an error in serialization or deserialization, the error was returned directly. Later, in order to optimize the user experience, we wanted to return more error messages, for example, we told him which struct it was in and which field the error occurred when reading. This is a good original intention, but when our new code was launched, a business side came to say: "Is there a problem with your new code? Why is our performance dropped by 20%?" We think Impossible, all our serialization and deserialization logic have not been changed, only this line has been changed. At that time, we were also very puzzled, thinking that there was a problem with the test environment of the business side itself. Later, we checked for a long time, and finally found out the situation.

new version code

old version code

You can see that this is the assembly generated by the new code. In Go, its assembly generation or compiler should be said to be very, very unintelligent. It does not make some adjustments such as code position, or does not do this kind of code. The rearrangement of instructions leads to error messages such as the error we just saw. He directly inserts all error messages or these strings into the normal process. What's the problem? It brings a lot of improvement in our L1 cache miss, because the L1 cache will greatly affect our execution performance and running performance, so it leads to a decline in performance.

how to solve this problem? You may think that the compiler's problem is not unsolvable. Later we used a very hacky method, since the compiler will not do code rearrangement, we can only do it ourselves. We define all the errors behind the normal return statement, and use Go to to jump to the back and jump to the label when there is an error. You may have heard that Goto should be used with caution when writing code or learning, and try not to use it. But in this scenario, we can only do so. At this time, when the Goto statement is directly compiled, an assembled jmp instruction is generated, and the performance measured in the end is better than that of the old version, that is, the direct return error. Because its cache miss has dropped directly from about 2.4 before to 1.8 now, which has improved a lot.

This is a very interesting example, and it also shows that it is actually very difficult for us to do in-depth optimization on Go.

zero-cost abstraction

Another thing is that there is actually no zero-cost abstraction in Go. Zero-cost abstraction means that if we don't use something, we don't need to pay for it; and if we use something, whether it is a compiler, a standard library or a third-party library provider, it should be Be the best you can be, it's impossible to do better. Maybe students who write C++ and Rust understand this concept better, but it doesn't exist in Go.

Thrift codec abstraction

And why is there no zero-cost abstraction concept in Go? For example, we make the codec abstraction of Thrift, so Apache's official Thrift supports many different combinations of Transport and Protocol. The underlying transport layer, as well as the upper serialization layer, actually have many different protocols, such as the transport layer, there is a transport called framed, there is a transport called buffered, and there are even some like memory directly in the memory . In addition to transport and protocol, there are basically two types: Binary and Compact. As you know, it comes in many different combinations. In order to support a variety of such combinations, the official implementation uses the interface in Go to make abstraction, but we remove the abstraction later, and we directly rely on the implementation of a specific protocol, that is, rely on a Concrete struct.

The cost of Interface

So why don't we use its abstraction? Because abstraction has a price. The price is that in Go, its interface is dynamically distributed, that is, the runtime uses the metadata and pointers of the type to dynamically call the required interface, which may cause one more memory addressing.

But this is not the most important thing, the most important thing is that it will affect inline. Moreover, Go does not provide a zero-cost abstraction solution. It does not have a box dyn in Rust, which is very similar to interface. There is also a static compilation method, static distribution, and the type is directly singled out during compilation. This is a zero-cost method, but it is not available in Go.

Sonic

Another project is actually very interesting. Our CloudWeGo community has open-sourced a project called Sonic, which should be the fastest JSON serialization and deserialization project in Go. Why is this project fast? Because its secret is the fastest Go code in the world, don't write it in Go, just write it in assembly and C and you're done. You can look at the statistics of the code language on Github. In fact, Go accounts for 27.1%.

You will find that in fact, all Go codes basically use Go to generate assembly. So this is our conclusion, the fastest Go code in the world, don't write it in Go, just write it in assembly.

The best performing Go JSON library

But although the JSON library uses a lot of black technology to optimize, we can see that the green one is a classic one in Rust, called the serde library. The serde JSON library is the result of the benchmark. We did the benchmark and found that it is still not as good as this rust library. So we later made up our minds to study the direction of Rust and try to implement it.

Rust History

When it comes to the language in the direction of Rust, you must know its history. Rust was originally developed by a guy named Graydon, a programming language engineer at Mozilla by profession. At that time, Mozilla wanted to implement an engine called Servo, felt that this language was very valuable, and decided to use it to sponsor this language.

In 2015, Rust released version 1.0. Version 1.0 actually represents a promise of stability. Version 1.31 was released in 2018, and version 1.31 stands for productivity. Async await asynchronous Rust was introduced in edition 2018. Looking at it now, I give him an evaluation that the future can be expected.

Rust 2024

Rust has a plan in 2024 called scaling empowerment, which is to expand authorization. Why is it called extended authorization? Because this is the question I asked everyone at the beginning of the session, what is the goal and vision of Rust? Its vision is to empower everyone to build reliable and efficient software, so it wants to empower everyone to create reliable and efficient software.

In the 2024 plan, everyone has heard that Rust has a relatively steep learning curve. At present, Rust officials have actually learned that there are some problems in the use of async Rust, and they also attach great importance to this matter. So its goal in 2024 is mainly to make Rust easier to use, easier to use, and able to land more projects.

Three advantages of Rust

Rust has three big advantages in our opinion: performance, safety, and collaboration. Everyone may know more about performance and security, or they may hear more.

performance

For example, this is a result of a benchmark game made by debian, and I chose the result of a pure calculation case. It can be seen in it that the Rust language is actually far ahead of several other languages, especially Go, which has about a 4-fold improvement.

Some students will ask why Rust has better performance than C and C++. In fact, this is also because Rust has a requirement for programmers, because its code restrictions are stricter, which directly leads to more aggressive compilers. Some optimizations. So its performance can surpass C and C++ in some cases.

safety

In fact, there are a lot of relevant information on the Internet, so I won't repeat them too much. I will only talk about the most important conclusion, which is that after Rust 1.0, it is impossible to have memory safety problems in non-unsafe code.

This conclusion may sound a bit cloudy to everyone. In fact, one of its inferences is more important, that is, all memory and concurrency security issues are caused by unsafe code. This directly means that if a coredump occurs in an online service, or there are some problems with memory safety and concurrent security, you don't need to look at the safe code, just look at the unsafe code. Because there are very few unsafe codes in Rust, unlike C and C++, it can be said that they are all unsafe. If you look for them, you may not know when you will find them. But in Rust, as long as you look at the newly added unsafe codes in the changes, these codes must be the source of the problem. This is the benefit of Rust's security.

cooperation

I think Rust is great for collaboration because it's really a language of real engineering practice. It has a very smart compiler, complete documentation, a very complete tool chain, and mature package management. And most importantly, you can completely trust other people's code, which is not possible in C and C++ or even in Go.

If some students have experience or experience in reviewing other team members' C or C++ code, they should be able to understand that when you review this kind of C and C++ code, to a large extent, you can't see where it went wrong. What kind of wild pointers come out, or a memory safety problem comes out, only when the core is down on the line, and the place where the core is down is likely to be thousands of miles away, you can't think of it as being caused by this code. Therefore, there is often no way to trust other people's code. But it doesn't matter in Rust, as long as it compiles it, the code is safe. So when we are reviewing, we only need to pay attention to whether the logic of those businesses is correct.

Rust Developer Survey

On Stack Overflow, there is a developer survey every year. Rust has been the most popular language for seven years in a row, and it can be seen that the gap between it and the second place is quite obvious.

Industry Application Cases

I will also briefly introduce the application cases in the industry, because in addition to the wide application of a language in the community, acceptance by enterprises is also a very important indicator. First accepted at Meta (Facebook), it is already a backend officially supported language. In our company ByteDance, Rust has been used in many scenarios, especially Feishu. If you have used Feishu before, you can know that all the logic of Feishu is written in Rust, and there are many companies in Google, Ant Financial and the following.

Rust for Linux

The application of Rust has also become more and more widespread recently. There is also a very heavyweight project called Rust for Linux. This is the only language other than C accepted by the Linux kernel so far . It should be a very heavyweight representative.

Compared with C++ and Go

Let me simply make a comparison between C and C++ and Go. I think both C++ and Rust are high in terms of learning difficulty, and both C++ and Rust are high in performance. But in terms of security, Rust actually blows up these two other languages, especially in terms of collaboration. As mentioned before, there is no native package management tool for C++, and there is no way for you to trust other people's code. As for the cost of use, I think that the cost of using C++ is relatively high, and the cost of using Go and Rust is medium. Why is it said that the cost of using Rust is medium? Because the cost of use is not only the cost of development, it also involves the cost of debugging in the middle of a stable state after a service is launched, and some losses if you have an accident , these are all to be taken into consideration. On the whole, I think the cost of using Rust is medium.

Here's another question, why is Rust so good? It is the same level language as C++, why can't C++ do it so well? Of course, we all say that there is no silver bullet in software engineering. This is because C++ has too much historical baggage. Rust has no historical baggage, so it is not designed to be compatible with old usage patterns like C++. It cannot be said that updating C++21, all the old codes cannot be compiled, so who is willing to do it. Rust actually has a bit of an edge here.

what did we do

Ecological situation within the company (before)

Next, let me introduce to you what we have done in ByteDance. This is a very sad story and a very sad number. When we started, the ecology in the company was actually 0, and there was nothing on the server side, and we had to start building everything ourselves.

How to build an ecology

Considering this issue at the time, the first thing was to make it usable by everyone, which involved some infrastructure for compiling, packaging and running. I believe that each company will have its own compilation and packaging process, online process, and operating environment. These are some of the more critical facilities, including some intranet infrastructure such as crates.io and docs.rs , which are required for development, as well as some basic libraries and development frameworks. The environment of each company may be different. What I introduced is a general process. If there are companies who want to try this process, they can refer to it together.

base library

The basic libraries are probably logs, monitoring, link tracking, mysql, redis, dynamic configuration, mq, etc. These are some basic libraries that we think are unnecessary and very important. These may require a promoter, for example, our team acts as a promoter to build all of these.

Next, there are some non-essential basic libraries, which may be individual libraries for certain businesses, which can mobilize the power of the masses, because it only needs the most basic things. For example, it can complete a CRUD basic service, and the rest can be written by yourself while developing the business process.

Development Framework

We also prepared 3 frameworks for development:

The first, Axum-based web framework. Axum is considered tokio's popular official HTTP web framework.

The second one, the RPC framework, supports GRPC and thrift, called Volo. It has been open sourced under the organization of CloudWeGo. If there is a need for RPC in the future, you can directly use this framework. The third one is the Monoio framework for asynchronous runtime. This is mainly to consider the provision of some performance-critical businesses and infrastructure, that is, the use of infrastructure services. Its advantage is that it adopts the Thread Per Core model, which can solve many problems of Tokio, such as the problem that its future must be added with Sync. Because in the case of thread per core, it can guarantee that a task must be executed in a thread, so that in many cases, the constraints of send and sync are not needed, and TLS (thread local storage) or other technologies can be used directly, and Some lock-free techniques to program, which can greatly improve performance. The second is that it uses the latest io_uring technology released by Linux to do the IO layer. If you have students with very high performance requirements, you can learn about it.

problem dicovered

After all, we were eating crabs at the time, and we must have encountered some problems. We mainly encounter bugs in open source libraries, and the problem that open source libraries do not fully meet the needs. For example, recently we encountered a business that was using the Snappy library, a library for compression and decompression. He found that the compressed and decompressed writers could not be reused, so we raised a PR for him later to support it.

So at the stage of eating crabs, you may need to be able to solve problems by yourself, submit PRs, and fork some open source libraries for use. We must be mentally prepared for this, because this is a problem we actually encounter in practice. There is another problem that may not be particularly related to technology: we found that many front-line students actually like Rust and want to use Rust. As mentioned earlier, Rust is the top developer favorite language list on stuck workflow for the 7th consecutive year. But many leaders worry that there is only one person in our team who knows Rust. If this person transfers or leaves, will the service be unmaintainable? This may be a problem that many leaders will worry about.

How to promote landing

At this time, we actually need to intervene as a promoter and help him develop some projects. At this time, he may not be allowed to develop a business service, because after all, business services sometimes require the whole group to discuss and make a decision before choosing a certain technical station, but personal projects can be used.

Many engineers who are very interested in Rust just lack a person to take the lead, or lack an opportunity. The most important point is to find some typical businesses to jointly develop and gain benefits. In our opinion, these typical businesses have a characteristic, that is, the proxy business, which is a proxy business with heavy calculation, but the logic is relatively simple, so although we need some early investment when promoting the implementation, it is worth it.

Practice: nightly + GAT + TAIT

The specific benefits will be shared with you later. Here I will first introduce some of our insights in practice. Don't regard nightly as a scourge, nightly is really fragrant. Nightly has many, many features that are great, such as the newly stabilized GAT and TAIT. These two features can greatly reduce the resistance to promoting Rust.

A Timeout middleware

We can look at the comparison without TAIT, that is, tower. Students who write Rust may know that there is an abstract middleware library such as tower. Here, no matter whether you know Rust or not, you can not pay too much attention to its details. Take a look at the tower. I want to implement a Timeout middleware without cost or performance loss, which requires two screens of code.

But if we use the two features of GAT and TAIT, the amount of code only needs so much, which is actually a very obvious comparison. In fact, we started using it very early.

Therefore, during the actual promotion, you can consider turning on all these useful features. In particular, there is a feature called async fn in trait, in which asynchronous functions can be defined. This feature has reached MVP on nightly, although there are some problems, but at least it is usable.

Achievements

Next, we will introduce some results of our landing. First of all, there is a proxy business. Its CPU usage has been reduced from about 630% to 380%, which is almost doubled. The second is its memory, that is, its memory usage has been reduced from 9GB to 2GB. Then its P99 and AVG have also been greatly improved.

  1. Business A (Proxy class)
  • CPU Usage 630% -> 380%
  • MEM 9GB -> 2GB
  • P99 150-200ms -> 20-35ms
  • AVG 4-5ms -> 1.5ms
  1. Business B (with a lot of business logic)
  • CPU 400% -> 130%
  1. Business C.
  • CPU-50% ,MEM-95%

You can see the picture on the left. The red line is the time when we went online. Before and after, there is a clear contrast in its spikes. Another business b has a lot of business logic. It probably reduces the CPU usage from 400% to 130%, which is about a three-fold increase. Another business c has also been significantly improved.

There is also a relatively important online business, and its improvement is also very obvious. For example, its cost is reduced by 50%. Everyone will ask: What can these data prove and represent? All I know is that the CPU has decreased, the memory has decreased, the ABG has decreased, and the delay has decreased. What can it explain? Let's do the simple math.

  1. Business D (important online business):
  • Throughput +95.96%
  • Cost -48.97%
  • avg -14.29%
  • p99 -18.18%
  • p999 -10.26%

do the math

This is a screenshot of the price of a certain cloud. The monthly price of a 64-core 128G machine is 6262 yuan. Because you can get a 30% discount if you buy a certain cloud for 5 years, so I will calculate it for everyone according to the most favorable plan. 28000 a year, which is equivalent to 437 yuan / CPU * year. The business uses 10,000 cores, and we have just calculated that the cost has been reduced by 48.97%. I will calculate by 50%, saving 5,000 cores, which is equivalent to saving more than 2 million a year. This is the cost saved for one year.

We have another classmate meeting and said that you didn't count the manpower for development. If I count the manpower for development, it is already a very, very high development cost that is almost impossible to achieve. Because our service took about two or three months, and the refactoring was finished after writing. Even if it took me 6 months, the cost of development plus office work, I also chose a very, very high value. In fact, it should not be reached. It's worth it. If he is 1.2 million, his net income in the first year is close to 1 million, and every year after that is a net income, and there is no cost. Of course, its benefits should be far more than this value, because the CPU cost is not so cheap. Students who have done cost accounting and benefit accounting may have a better understanding of its cost value. An experience value is about 1000 per core, which can be calculated according to this value. Because in addition to the simple CPU price of the CPU, there is also the cost of the network, such as the cost of the operation and maintenance personnel in the computer room. In fact, the cost is far more than 437 per CPU. Another point is that if the AVG, that is, the latency delay is reduced, it can actually drive business growth. 

Looking Ahead: Opportunities and Challenges

The state of Rust

The current state of Rust: easy to use. It works, it's functional, it's ergonomic, but it's not good enough.

Secondly, its abstraction ability and expression ability are quite strong, but its high-level abstraction still has some problems, especially when writing Rust. Students who have used HRTB may know that there are some pitfalls when HRTB is used with GAT or TAIT, but the usage scenarios are too high-level, basically not used for business development, so it is acceptable. Third, Rust's asynchronous ecology is relatively complete now, but there are some separations between it and the synchronous ecology. For example, a function Fn cannot be asynchronous and synchronous at the same time, the asynchronous version and the synchronous version must write two different functions.

Of course, Rust officials have already felt these problems and are solving them, especially in the 2024 roadmap. Like Niko is also trying a new solution, it hopes that a function can be two versions of async and sync at the same time. If the caller is async, it will call an async and generate an async implementation at compile time. He will do such a thing.

There is also some good news that developers love it very much, and user loyalty is very high. I believe that the students who wrote it should not consider switching to other languages, at least I will not go back to write go. Open source projects are also growing explosively, especially in the past two years, it can be clearly felt that more and more open source projects adopt Rust. Whether it's new projects or refactored projects, more and more companies are accepting to start using Rust.

Directions for Rust apps

In fact, Rust's application direction is very, very wide. I will briefly list it, including Rust for Linux, which is actually equivalent to complementing it. The last piece of the puzzle of a Rust application, the OS layer and the embedded layer, can be written in Rust. There is also a recently well-known and popular direction, which is WebAssembly. Basically, Rust also belongs to the first echelon, and it is the best language.

challenges

Some of the challenges we are currently facing. The first one is that there are not enough positions and talents in Rust. In fact, it is a mutual relationship. If there are more talents, there will be more positions. If there are more positions, there will be more talents. So this may not only require a certain company to invest in what to do, but hope that all programmers who love Rust will work together to build the entire ecosystem. The second is that the reputation of Rust in China is not good enough. Although it is slowly improving now, there is indeed a gap in reputation between Rust and Go, which can be seen from the number of training courses.

The third is that Rust lacks a killer app like K8S. This is also a point that everyone has been mentioning. The following is a picture of an official survey conducted by Rust. 22.51% of people think that Rust is for the majority of my coding, and most of their main codes are written in Rust. 17% said it was one of the languages ​​they all use, and 18% said they only used it occasionally. But this picture actually shows whether the Rust language can really generate value.

The picture below is an official Rust question, do you think the Rust language has helped you achieve something? 80% of people think that Rust has helped us achieve our goal, Rust has helped us achieve our goal. But there is a bad news that 47% of people think that using Rust is a challenge, and it is challenging. There are 70% behind. 80% of people think that Rust is worth our money, and 90% of people think that we will still use Rust in the future.

opportunity

Now we have also encountered a new opportunity to reduce costs and increase efficiency , and now everyone is paying more and more attention to the underlying technology, and Rust has a very strong control, so it is very, very handy in the field of underlying technology tool. And now Rust's attention is high enough, and the community is also in the process of rapid development.

Embrace open source and give back to the community

We have also built a lot within the company, and invested a lot of time and energy to develop some related infrastructure. We also take it from open source and use it from open source. We exposed all of our core competencies. A Volo is our RPC framework. If you have students or companies who need to use the framework of microservices, you can consider using it. Monoio has just been introduced, it is a Thread Per Core model, using an ultra-high performance asynchronous framework of io_uring. We have also created an abstraction of the middleware of the benchmark tower, that is, Motore directly uses the two important features of GAT and TAIT. And in order to solve some well-known network problems in China, we have also made a proxy for everyone. If you often encounter the problem that the package cannot be pulled down, you can also consider using it.

• Volo:https://github.com/cloudweGo/volo

• Monoio:https://github.com/bytedance/monoio

• Engine:https://github.com/cloudweGo/engine

• RsProxy:https://rsproxy.cn/

The country's first IDE that supports multi-environment development——CEC-IDE Microsoft has integrated Python into Excel, and Uncle Gui participated in the framework formulation. Chinese programmers refused to write gambling programs and were pulled out 14 teeth, with 88% body damage . Podman Desktop, an open-source imitation Song font, breaks through 500,000 downloads. Automatically skips opening screen advertisements. The application "Li Tiao Tiao" stops updating indefinitely. There is a remote code execution vulnerability Xiaomi filed mios.cn website domain name
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4843764/blog/5699348