Hango Rider: NetEase Shufan open source Envoy enterprise-level custom extension framework

Scalability is one of the most critical features of network proxy software. Flexible and powerful scalability can greatly expand the capability boundary of network proxy software. As an emerging open source high-performance network proxy software, Envoy itself provides relatively rich scalability, such as native extensions based on C++ and dynamic extensions based on WASM/Lua. But Envoy's existing scalability capabilities each have their own limitations. In the process of implementing Envoy gateway/grid on a large scale, NetEase Shufan has implemented a set of Lua-based enterprise-level custom extension framework-Rider for Envoy, which is applied to the Qingzhou micro-service platform to meet the needs of the business side. High performance, feature-rich and other requirements.

At present, the Rider ( https://github.com/hango-io/rider) extension framework has been fully open sourced and integrated into the open source API gateway Hango ( https://github.com/hango-io/hango-gateway) , which provides flexible, powerful, and easy-to-use custom expansion capabilities for Hango gateways.

1. The current state of Envoy’s scalability

Under the Internet system, almost any system that needs to be exposed to the outside world needs a network proxy: HAProxy and Nginx, which appeared earlier, are still popular; after entering the era of microservices, API gateways with richer functions and stronger management and control capabilities have become traffic portals Required components. Because of its excellent performance, scalability, and observability, Envoy has become the data plane selection for a large number of API gateways. In addition to the basic functions required by traffic proxy, Envoy has natively implemented many advanced functions required by proxy. Such as advanced load balancing, fusing, current limiting, etc. Therefore, the API gateway built based on Envoy already has rich functions and can meet the needs of most application proxies. However, in the actual usage scenario of the API gateway, some applications or businesses will expand new functions according to their own needs, which may be simple processing of some HTTP headers, or docking with their own APM, etc. Therefore, The API gateway must have the ability to expand to support the application or business to expand the corresponding functions according to their own needs, and this ability Envoy can still do it. It can be said that the scalability of the API gateway based on Envoy depends on the scalability provided by Envoy. ability. Then let's take a look at the extension mechanism currently provided by Envoy.

1.1 Native C++ Extensions

Envoy implements the ability to extend native C++ plug-ins through a pluggable filter mechanism. As shown in the figure below, L4 filters are responsible for extending protocol proxy capabilities and L4 traffic governance capabilities, and L7 filters implement rich traffic governance functions.

image.png

Since this extension method is natively provided by Envoy, the performance is naturally the best, but it faces two main problems at the same time. One is that the plug-in developer needs to have the development ability of the C++ language, and the other is that the plug-in needs to be re-developed after the plug-in is developed. Compile the Envoy binary file and then upgrade the deployment, the dynamic loading of the plug-in function cannot be achieved. In order to solve these two problems, the Envoy community has successively implemented the extension mechanism based on Lua and WASM. Let's first look at the principle of community Lua extension.

1.2 Community Lua extension

To use Lua language to develop Envoy plug-ins originally implemented in C++ language, intuitively, the following two points need to be considered: one is how the Lua script is executed in the Envoy process; the other is how the Lua script obtains the internal data and functions of Envoy, such as Get Header and Body. From these two perspectives, we can clearly look at the implementation of the Lua extension of the Envoy community (in fact, WASM and Rider are also from these two perspectives).

As shown in the figure below, the difference from the native C++ extension scheme introduced above is that there is a Lua plug-in in the seven-layer plug-in of Envoy, and this Lua plug-in developed in C++ is the key to answering the above two questions. First, how Lua scripts are executed in the Envoy process. The answer is through the Lua plug-in. The Lua plug-in of Envoy is still developed in C++, so the Lua script can be loaded and run in the Lua plug-in; the second is how the Lua script can obtain the internal data of Envoy. And functions, the answer is that Lua plugins will provide Envoy internal data and functions to Lua scripts in the form of Lua CAPI.

image.png

On the one hand, the community Lua extension provides users with the ability to develop plugins based on the Lua language, which is much simpler than C++. On the other hand, it supports Envoy to dynamically load Lua scripts without recompiling and upgrading. But at the same time, due to the overhead caused by the interaction between C++ and the Lua virtual machine, the performance of Lua extensions will naturally be worse than that of native C++ extensions, and the current Lua CAPI interaction method of the Envoy community will further aggravate performance problems. In addition to performance issues, the community's Lua extension has a bigger flaw - it does not support plug-in configuration, which directly leads to a significant drop in the practicability of the community's Lua extension. In contrast, WASM and Rider realize the configurability of plug-ins, and Rider has optimized the performance of Lua extensions, so that Rider's Lua extensions can meet the needs of enterprise-level extensions in terms of performance and functions.

1.3 Community WASM Extension

WASM is a technology derived from the front-end, and is a technology born to solve the increasingly complex front-end Web applications and limited JS script interpretation performance. WASM is not a language, but a bytecode standard. In theory, any language can be compiled into WASM bytecode and then executed in the WASM virtual machine.

The implementation principle of WASM extension is similar to that of Lua extension. A WASM plug-in will be implemented in Envoy's own four-layer or seven-layer plug-in, which will be embedded in the WASM virtual machine to dynamically load and run pluggable extension code (by Compiled to WASM bytecode), and also exposes an interface for accessing Envoy's internal data and functionality through the WASM virtual machine. Its principle is shown in the following figure:

envoy-wasm-filters.png

WASM seems to perfectly solve the various problems encountered by Envoy's scalability. It supports multiple languages, supports dynamic loading of plug-ins, and supports plug-in configuration. However, after our tests, WASM fell into the blood of performance. It is based on The performance of C++ extensions is even worse than that of Lua extensions, let alone plugins implemented in other languages ​​(the specific performance comparison results will be in the third part).

1.4 Summary

In the following table, we summarize the characteristics of various current extensibility solutions: although native C++ extensions have the best performance, they do not support dynamic loading of plug-ins; community Lua extensions support dynamic loading of plug-ins, but do not support plug-in configuration, which is almost impossible. Use; the community WASM extension supports both dynamic loading of plugins and plugin configuration, but with poor performance.

expansion plan Whether to support plugin configuration Whether to support dynamic expansion performance Supported languages development complexity
native C++ Yes no optimal C++ complex
CommunityLua no Yes poor Lua Simple
Community WASM Yes Yes Difference C++/Rust/Go等 medium

According to the advantages and disadvantages of the above various scalability solutions, NetEase Qingzhou microservices designed and implemented its own scalable framework Rider. The main design goals are as follows:

  • Support for Lua language extensions
  • Support Envoy to dynamically load, update, and remove Lua plugins
  • Support for defining Lua plugin configuration
  • Support custom Lua plugin effective scope, gateway level/project level/route level
  • Better performance than Envoy community Lua extension and WASM extension

Next, let's take a look at the design, optimization and practice of the Rider extensible framework.

2. Design, optimization and practice of Rider extensible framework

2.1 Early exploration

In response to the problems of poor performance and unsupported plugin configuration in community Lua extensions, Rider's early architecture designed and implemented two modules:

  • Rider Filter: Rider Filter is a seven-layer plug-in of Envoy, used to initialize and call Lua code, and provide the data and functions inside Envoy to the Lua SDK for invocation through Lua CAPI or FFI interface. Note that Rider uses FFI to implement most of the interfaces, and its theoretical performance is better than the community Lua extension based on CAPI implementation;
  • Lua SDK: Lua SDK is a Lua plug-in code framework, users can implement request processing by calling the API provided by Lua SDK. Note that the Lua SDK provides APIs to obtain global and route-level plugin configurations, so that Rider's Lua extension supports the acquisition of plugin configurations, which solves the big problem of community Lua extensions.

The following figure is the overall architecture diagram:

rider.drawio.png

Although our early architecture basically met the needs of Envoy's extensibility: support for multi-language Lua, support for dynamic loading of Lua plugins, support for Lua plugin configuration, etc. But there are still the following problems:

  • Rider Filter still has some interfaces that do not use FFI, and the performance may be slightly insufficient;
  • The Lua SDK needs to be further improved to support the development of more plug-in functions;
  • Huge performance issue with Rider that was discovered during the resolution of the first issue.

In response to these problems, we further refined the architecture of Rider, analyzed the performance of Rider in detail, and derived a new architecture.

2.2 Practical optimization

The first problem to be solved by the new Rider architecture is to try to carry out the FFI to the end. According to the previous research, Lua calls C in two ways. One is through the native CAPI. Each call will allocate a stack space (and Stack space). Frame is different, it is a piece of contiguous memory applied to Heap), passing parameters and return values ​​through the stack space. Another way is through the FFI call provided by Luajit. The advantage of FFI is that C functions can be called directly in Lua, using C data structures, the code can get Jit-optimized Buffs, and the performance is greatly improved compared to native Lua. Therefore, we want to transform the interface implemented by CAPI in Rider into FFI.

The first step of the transformation encountered a problem. The early Rider architecture seems to be unable to use FFI to expose Envoy Body related interfaces. Let's first look at the reasons why early Rider had to use native CAPI to expose Envoy Body related interfaces. In the early Rider architecture diagram, Lua Code has two main functions: on_request and on_response. These two functions are the functions that need to be implemented in the Lua code specified by the Rider architecture, because when the Rider Filter executes the Lua Code, the Rider Filter will only try Get these two functions from the Lua virtual machine, and then execute them in the decodeHeaders stage and the encodeHeaders stage respectively, then if there is a Body-related interface call in the on_request or on_response functions, the Rider Filter has not yet been executed to the decodeData or encodeData stage. The data can not be obtained yet, so we can only suspend the Lua coroutine first, wait until the Rider Filter executes to the decodeData or encodeData stage, and then Resume, and this method cannot be realized by FFI, and can only be realized by interacting with the Lua virtual machine.

rider_next.jpg

Based on the above problems, Rider's new architecture has made some refinements on the basis of the earlier architecture. As shown in the figure above, the modules of the overall architecture have not changed. What has changed are the functions that need to be implemented in the Lua plugin specified by the Rider framework and these functions in the Rider Filter execution time in . As shown in the figure above, the original on_request and on_response functions are further divided into Header and Body stage functions, and are called respectively in the Rider Filter processing Header and Body stages, so as to avoid the need to suspend the Lua coroutine for Body processing (later found that The implementation of WASM is a similar subdivision). Therefore, the request processing flow of the new architecture is as follows:

  • Rider plug-in configuration (including Lua plug-in configuration) is delivered through Pilot as part of LDS and RDS;
  • Envoy constructs a seven-layer Filter Chain for each HTTP request, which includes Rider Filter. When the Rider Filter is initialized, it will load the Lua SDK module and the corresponding plugin from the file system into the Lua VM;
  • In the request processing stage, the Rider Filter will call the on_request_header, on_request_body and on_response_header, on_response_body methods of the Lua code in the Decode and Encode stages respectively;
  • During the execution of the user's Lua code, the corresponding interface encapsulated by the Rider Filter is called through the Lua SDK, such as obtaining, modifying request and response information, calling external services, etc., and printing logs.

2.2.1 Performance optimization

The original intention of the new architecture design is to improve performance, so we conducted performance tests as soon as the new architecture was developed. The test scenarios are as follows:

  • Environment: local container environment
  • Backend: Nginx 4 cores
  • Shipping:4核
  • Client:Wrk 4t 32c
  • Lua plugin: call the get_body interface 100 times

Compare implementations:

  • CAPI: Rider's CAPI implementation, the original Lua and C interaction, passing parameters and return values ​​through the stack space;
  • FFIOld: Rider's early FFI implementation.

The test results are shown in the figure below. Classic negative optimization, and many negative ones (of course, because of 100 calls), FFIOld has a 30% drop in QPS compared to CAPI. The first thought is that there is a problem with the code writing, and the new architecture introduces a lot of overhead? So I tested the Header API implemented by Rider based on FFIOld. The performance is similar to the Body API based on FFIOld, which means that there is a problem with FFIOld. The performance of the early API implemented by Rider FFI may not be as good as CAPI!

image.png

So I went to understand the basic principles of FFI. FFI is a feature provided by Luajit. Luajit is a virtual machine that runs Lua. Like the Java virtual machine, Luajit has two operating modes: compilation mode and interpretation mode (the performance of the compilation mode is better than the interpretation mode. The pattern is good, you can find the reason if you are interested). Luajit runs in interpretation mode by default. During the running process, it will record the hot code that can be compiled. In subsequent runs, it will try to translate the hot code directly into machine code for execution, and the performance will be improved.

Going back to Rider, Rider's Lua plug-in also runs in the Luajit virtual machine, and tens of thousands of QPS requests will definitely make the Lua plug-in code a hot code, then Luajit will try to translate the Lua plug-in into machine code, and FFI defines The C function will also be translated into machine code and executed. It seems that the performance will indeed improve, but it does not meet expectations. The reason is that Luajit will try to translate the hot code into machine code. The attempt may be unsuccessful, and the failure will degenerate into interpretation mode, then the performance will be greatly reduced. Luajit provides a method to confirm whether the program is running in compiled mode. Add the following code in front of the Lua code:

local verbo = require("jit.v")
verbo.start()

Then continue the pressure test and find that Luajit outputs the following:

image.png

This picture shows two key pieces of information, one is that if TRACE outputs --- , it means that Luajit has exited the compilation mode; the other is that the reason for exiting the compilation mode is the type conversion of a parameter in the C function defined by FFI not support. Then navigate down to this parameter:

local function get_header_map_value(source, key)
    local ctx = get_context_handle()
    if not ctx then
        error("no context")
    end

    if type(key) ~= "string" then
        error("header name must be a string", 2)
    end

    local buffer = ffi_new("envoy_lua_ffi_str_t[1]")
    local rc = C.envoy_http_lua_ffi_get_header_map_value(ctx, source, key, #key, buffer)
    if rc ~= FFI_OK then
        return nil
    end
    return ffi_str(buffer[0].data, buffer[0].len)
end

C. The ctx passed in envoy_http_lua_ffi_get_header_map_value(ctx, source, key, #key, buffer) is the light userdata type of Lua, and the envoy_http_lua_ffi_get_header_map_value function is declared as a pointer type of a class, Luajit cannot complete the conversion during translation, so Compile mode exited. The truth is clear, the next step is to solve this problem. The specific design is too detailed to be explained here. If you are interested, you can move to our open source community. Let's take a look at the effect after optimization.

First of all, following the above get_body performance test, an additional set of FFINew (optimized FFI implementation) data is added. As shown in the figure below, the performance of FFINew is 66% higher than that of FFIOld, which is 16 % higher than that of CAPI. % , the advantages of FFI are finally reflected.

image.png

The above performance improvement may only be used as a reference. After all, the get_body interface is called 100 times, so we have performed simple performance tests for plug-ins of different complexity:

  • simple filter: call get_header 10 times;
  • normal filter: call set_header 20 times, call get_header 10 times, and finally remove the 20 headers;
  • complex filter: normal filter + 30 calls to get_body;

The results are shown in the figure below, the performance of FFINew is better than that of FFIOld, with 15%, 22%, and 29% performance improvements, respectively.

image.png

Finally, we further compared the performance of Rider and community WASM and Lua:

  • RiderOld: Implementation of Rider's early architecture;
  • RiderNew: the implementation of the current Rider;
  • WASMC++: Community WASM implementation of version 1.17;
  • RawLua: Lua extension implementation of community version 1.17;
  • RawC++: Envoy native C++ extension implementation;

image.png

As shown in the figure above, Rider's performance is better than the community's WASM and Lua, about 10% performance improvement, and only about 10% performance drop compared to Envoy's native C++ plug-in. The WASM plug-in here is implemented with the C++ SDK, and according to our internal tests, the performance of the plug-ins implemented by other WASM language SDKs will be worse. In addition, the performance of Rider is only improved by less than 10% compared with the community Lua. Personally, I feel that the data interaction between Lua and C++ of the performance test plug-in is relatively simple, and it is basically the transmission of simple strings, which does not reflect the advantages of FFI. .

2.2.2 Function enhancement

After the performance problem is solved, the next step is the enhancement of functions, that is, the enrichment of Lua SDK. Here is a summary of all Lua SDKs currently supported by Rider:

  • envoy.req.get_header(name)
  • envoy.req.get_header_size(name)
  • envoy.req.get_header_index(name, index)
  • envoy.req.get_headers()
  • envoy.req.get_body()
  • envoy.req.get_metadata(key, filter_name)
  • envoy.req.get_dynamic_metadata(key, filter_name)
  • envoy.req.get_query_parameters(max_args)
  • envoy.req.set_header(name, value)
  • envoy.req.set_headers(headers)
  • envoy.req.clear_header(name)
  • envoy.resp.get_header(name)
  • envoy.resp.get_header_size(name)
  • envoy.resp.get_header_index(name, index)
  • envoy.resp.get_headers()
  • envoy.resp.get_body()
  • envoy.resp.set_header(name, value)
  • envoy.resp.set_headers(headers)
  • envoy.resp.clear_header(name)
  • envoy.streaminfo.start_time()
  • envoy.streaminfo.current_time_milliseconds()
  • envoy.streaminfo.downstream_local_address()
  • envoy.streaminfo.downstream_remote_address()
  • envoy.streaminfo.upstream_cluster()
  • envoy.streaminfo.upstream_host()
  • envoy.logTrace(message)
  • envoy.logDebug(message)
  • envoy.logInfo(message)
  • envoy.logWarn(message)
  • envoy.logErr(message)
  • envoy.filelog(msg)
  • envoy.get_base_config()
  • envoy.get_route_config()
  • envoy.httpCall(cluster, headers, body, timeout)
  • envoy.respond(headers, body)

2.3 Practice of Rider

NetEase's internal media business has been developed and launched using multiple Lua plug-ins based on Rider. Among them, the Trace plug-in for printing full-link trace logs was launched in Q1 2020. Currently, it has been connected to all gateways, processing hundreds of thousands of QPS, and running stably.

3. Future Planning of Rider Extensible Framework

In the future, we will continue to maintain and optimize Rider in terms of stability, performance, and functions:

  • Stability: At present, Rider has been implemented on a large scale in multiple business parties inside and outside NetEase, and we will further improve and ensure the stability of Rider in the future;
  • Performance: Although the performance of Rider is better than that of community Lua and WASM, we will continue to optimize performance in the future to further narrow the performance gap with native C++ extensions;
  • Function: Align with community Lua and WASM in terms of Rider API, providing the most comprehensive API capabilities.

learn more

About the author: Wang Kai, senior engineer of NetEase Shufan, is mainly responsible for the data plane research and development, expansion and enhancement of Qingzhou microservices, Qingzhou API gateway and other related products. He has rich experience in data plane Envoy extension enhancement and practical implementation.

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4565392/blog/5440026