Save 90% compilation time, this is ByteDance's open source Rust-based front-end construction tool

Rspack is a high-performance build engine based on Rust, which can interact with the Webpack ecosystem and provide better build performance.

When dealing with monolithic applications with complex build configurations, Rspack can provide a 5-10x improvement in compilation performance.

After ByteDance open sourced Rspack, it has 4700+ stars on GitHub.

At the "GOTC Global Open Source Technology Summit - Rust Forum" held on May 28, 2023, ByteDance front-end engineer He Xiangjun introduced Rspack, a new generation of front-end construction tools. Today we will introduce the content shared this time .

Content outline:

  • Introduction to Rspack
  • Native technology selection of front-end tool chain
  • Solutions to problems encountered
  • Rspack performance gain
  • Future Outlook for Rspack

Rspack Introduction and Technical Architecture

In recent years, the scale of web applications has become larger and larger. A medium-to-large project may have tens of thousands of modules. It may take 5 to 10 minutes to package it with Webpack.

Although some build tools have solved the problem of slow build speed of Webpack in recent years, such as esbuild and vite, they still cannot completely replace Webpack functionally.

In this context, we decided to use Rust to re-port Webpack to improve the build performance as much as possible without reducing the flexibility and rich functions of Webpack as much as possible.

 

Briefly introduce the architecture of Rspack. The architecture of Rspack is similar to that of Webpack, and multi-threaded parallel acceleration is done for many stages.

It can be mainly divided into two parts. The first stage is the make stage, which mainly analyzes project dependencies, and then generates a module dependency graph; the second stage, the seal stage, is mainly for code product optimization and final product generation.

Product optimization mainly includes tree-shaking and bundle-splitting, code-splitting and minify.

Tree-shaking uses a mark-sweep algorithm similar to garbage collection to traverse all codes that may be executed and delete all codes that will not be executed.

code-splitting recombines modules and uses some strategies to split them into several chunks, ultimately achieving faster browser loading and higher CDN cache hit rate.

Technology Selection

So, how did we choose the technology for Rspack?

 

Our goal, or most of the native tools on the market today, may have only two goals: one is to maintain compatibility with the Javascript API of the target migration tool, and the other is to increase the construction speed as much as possible.

After doing a simple survey on the target language ecology, we left 3 options:

1. Rust

2. Javascript(Node.js)

3. Golang

Why not use JavaScript (Node.js)?

We don't have to worry about API compatibility when using Node.js, but Node.js has little potential for single-thread optimization, so try to use the multi-thread capability provided by Node.js to improve performance.

We found some problems when actually using Node.js for multi-thread programming. Although Node.js provides worker-thread to provide multi-threading, it simulates multi-threading by creating new V8 instances. These V8 instances are There is no way to share memory.

If you want to do inter-thread communication, you can only use message passing. But there is a problem with worker-thread messaging. All messages need to be copied structurally, that is, deep copy. There is no way to directly move objects to another thread like in Rust, which increases the communication overhead to a certain extent.

The second is that its concurrent programming ecology is relatively poor. It does not provide rich underlying data structures and concurrency primitives like the Rust community, such as no ready-made lock-free concurrent data structures, and only supports several basic atomic types, etc. .

In order to give you a more intuitive feeling, I made a relatively simple Benchmark.

Simple Multithreaded Benchmarking: Using Multithreading to Solve a Producer-Consumer Problem

 

result:

 

Why not Golang?

Golang itself is good enough in terms of performance, but we did not choose it for the following two reasons.

1. Due to language positioning and its own ecological reasons, Golang does not support napi well.

Why is napi so important to us?

Because Webpack's plugin API is very flexible, in addition to literals and object types, it also supports passing functions for runtime dynamic configuration.

Although the traditional IPC can also be used to simulate function calls, we need to serialize the parameters first when calling a Javascript function on the native side, pass it to Javascript through IPC, then deserialize it on the Javascript side, and finally execute the Javascript function Then transfer the return value back to the native side, and one function call requires two cross-process communications.

The number of function calls may be directly proportional to the number of modules. When the number of modules is relatively large, these additional costs cannot be ignored. napi can pass function pointers to the native side to reduce the consumption of some inter-process communication.

2. Golang's own front-end tool chain ecology is not mature and prosperous enough.

The Golang community provides the infrastructure for building a front-end building tool, such as Javascript passer, CSS passer, and can also do some simple analysis, but it does not support translating ES6 to ES5. We have to find some other transpilers to do this, which will undoubtedly increase additional consumption (two transpilers will seriously affect performance).

Why is transpiling to lower versions of ES5 important to us?

Because the average domestic browser version is not very high, in order to support some users with lower versions, we must translate the code to ES5.

Why use Rust?

The reason for using Rust is relatively simple, because it does not have the previous problems (this is not to say that rust is perfect, but there are no problems with the first two types of selection in the current scenario).

1. Rust's performance is very good, at the same level as C/CPP.

2. The napi support is good, which reduces our mental burden when compatible with the complex API of webpack. In addition, because of the support of macros, we can save a lot of boilerplate code.

3. As a first-class citizen of WASM, Rust has better support for WASM features and faster follow-up of new features, making it easier for us to migrate existing tools to the web.

4. SWC in the Rust ecosystem provides rich AST modification API, and provides support for translating to lower version ES5.

summary

At this stage, if you want to improve the speed of front-end tools through porting, Rust is definitely worth a try. The reasons are as follows:

1. If you need support to experience the tool on the web side, Rust has excellent support for WebAssembly (WASM). Combined with wasm-pack, you can migrate your tools to the web platform at a fraction of the cost.

2. Since the Rust community (napi-rs) has mature support for napi, you can easily make complex JavaScript API compatible.

3. In the past few years, many novice-friendly tutorials have emerged in the Rust community, which has greatly lowered the barrier to entry.

4. The Rust community has many ready-made front-end tool porting cases to learn from. Compared with other languages, the front-end tool ecosystem is more prosperous.

performance gain

In our experiments, Rspack took significantly less time than Webpack.

 

In the production mode, the time is shortened by nearly 90% from 146 seconds to 16 seconds.

In development mode, the time is reduced by 87%.

Problems Encountered & Solutions

Below we introduce some performance optimization techniques for two issues.

Multi-thread optimization (given an example to solve the poor concurrent parsing performance of SWC)

  • In the Development mode, there will not be too much optimization, and parsing is the main bottleneck of the stage
  • Through the profiler, it is found that there are a lot of locked system calls when parsing
  • Finally, it was found that swc used a string-intern library string-cache.

A brief introduction to string cache

In many programming languages, string constants (literal) are usually immutable, which means that if you use the same string constant multiple times in a program, each instance will create a new object in memory. Doing so uses a large amount of memory and may reduce the performance of the program.

To avoid this problem, some programming languages ​​provide string pool (string pool) or string cache (string cache) mechanism. The string pool is a place to store string constants. It will be automatically maintained when the program is running, and it is guaranteed that each string constant has only one instance. This way, if the same string constant is used multiple times in a program, each instance will point to the same object in the pool, saving memory and improving program performance.

 

 

Briefly introduce string-cache

 

 

 

 

 

 

Performance bottleneck of String cache

Use Mutux to lock the insert operation of the entire string intern, so that only one thread can perform string intern at any time in a multi-threaded scenario, and other threads can only wait, which is why there are many lock system calls in the previous parsing process.

 

 

Performance optimization method : move the large lock at the insert level to the bucket, so that only two strings that hit the same bucket number will be mutually exclusive, and strings with different bucket indexes can be parallelized during intern.

 

Comparison of string-cache optimization results:

 

Development mode has a 41% improvement and Production mode has a 4% improvement.

Summarize

cores, reduce lock usage, and maximize CPU utilization, here are some common strategies:

  • Use lock-free data structures (crossbeam, etc.)
  • rayon (iter -> par_iter). Split mutable and immutable code and use rayon to parallelize your code as much as possible.
  • Reduce the granularity of locks and reduce unnecessary critical sections

Algorithm optimization: Inadvertent introduction of O(n^2) algorithm leads to performance problems

background

According to feedback from the business side, there is a big performance difference between enabling source-map and not enabling it in the production environment.

Before performance optimization, you need to choose a handy profile tool:

  • Instruments
  • Samples
  • tracing (tokio tracing + perfetto / chrome-tracing)
  • Perf
  • flamegraph

Use sample to generate profile:

 

The storage form of String in Rust:

 

Take "good rspac" as an example:

  • Get the "good" byte offset
  • "You".len_utf8()=3
  • Get "k" byte offset
  • ['you','good','r','s','p','a','c' l.iter().map(|ch| ch.len_utf8()).sum()= 11
  • Use range 3..11 to get string slice

A brief introduction to substring::Substring

fn substring(&self, start_index: usize, end_index: usize) -> &str {
     if end_index <= start_index {
         return "";
     }
 
     let mut indices = self.char_indices();
 
     let obtain_index = |(index, _char)| index;
     let str_len = self.len();

    unsafe {
        // SAFETY: Since `indices` iterates over the `CharIndices` of `self`, we can guarantee
        // that the indices obtained from it will always be within the bounds of `self` and they
        // will always lie on UTF-8 sequence boundaries.
        self.slice_unchecked(
            indices.nth(start_index).map_or(str_len, &obtain_index),
            indices
                .nth(end_index - start_index - 1)
                .map_or(str_len, &obtain_index),
        )         
    } 
} 

Causes of performance bottlenecks:

1. The number of times substring is called is proportional to the number of mappings

2. Usually the number of mappings is compared, so in most cases, this implementation has no performance problems

3. In the minify scenario, because except the first line of code, the position of the rest of the code will change, the magnitude of the mapping and the size of the compressed product are only a constant multiple (it will be ignored when calculating the time complexity)

4. The time complexity of this process is about O(n^2), where n is the size of the compressed product

Performance optimization:

Use prefixes and arrays to calculate the mapping relationship between each char offset and byte offset in advance.

 

Take a look at the optimization benefits:

 

There is a 30%~1000% improvement for products of different sizes.

Summarize:

  • A good profile tool can help you get twice the result with half the effort
  • Algorithms with any time complexity have little gap when the data size is small
  • When analyzing the time complexity of an algorithm, you can't just count the visible code, you need to count the number of functions and libraries

future outlook

For Rspack, we plan to do three aspects of work in the future:

1. Use io-uring to speed up the IO part

2. Learn from salsa-rs to further optimize incremental build performance

3. Explore the use of native languages ​​to write high-performance plug-ins

That’s all for today’s sharing, thank you all.

GitHub:

https://github.com/web-infra-dev/rspack

Official website:

https://www.rspack.dev/zh/

Guess you like

Origin blog.csdn.net/weixin_47098359/article/details/131115372