Why Envoy can beat Ngnix-thread model analysis

Why Envoy can beat Ngnix-thread model analysis

Introduction: With the popularity of Service Mesh in the past year, Envoy, as a key component of it, has also become familiar to the majority of technical personnel. The author is one of the developers of Envoy. This article explains Envoy's threading model in detail, which is very helpful for understanding how Envoy works. The content is more in-depth, it is recommended to read carefully.

There are currently very few basic technical documents about Envoy. In order to improve this, I am planning to do a series of articles about Envoy's various subsystems. This is the first article, please let me know your thoughts and other topics you wish to cover. One of the most common problems is to describe the threading model used by Envoy.

This article will introduce how Envoy maps connections to threads and the thread local storage (TLS) system used inside Envoy. It is precisely because of the existence of this system that Envoy can run in a highly parallel manner and guarantee high performance.
Thread overview
Why Envoy can beat Ngnix-thread model analysis

Figure 1: Thread overview

Envoy uses three different types of threads, as shown in Figure 1.

  • Main: This thread can start and shut down the server. Responsible for all xDS API processing (including DNS, health check and general cluster management), runtime, statistics refresh, management and general process management (signal, hot start, etc.). Everything that happens on this thread is asynchronous and "non-blocking". Usually, the main thread is responsible for all the key functions that can be completed without consuming a lot of CPU. This ensures that most management code runs in a single thread.
  • Worker: By default, Envoy generates a worker thread for each hardware thread in the system. (It can be controlled by the --concurrency option). Each Worker thread is a "non-blocking" event loop, responsible for monitoring each listener, accepting new connections, instantiating the filter stack for each connection, and processing all IO events during the connection lifecycle. This also allows most connection handling code to run in a near single thread.
  • File refresher: Each file (mainly access log) written by Envoy has an independent refresh thread. This is because even writing to the file system using O_NONBLOCK sometimes blocks. When the worker thread needs to write to the file, the data is actually moved into the memory buffer, and finally flushed to the disk through the file refresh thread. This is a shared memory area. In theory, all Workers can block on the same lock because they may try to write to the memory buffer at the same time. This part of the content will be discussed further later.

Connection handling

As mentioned above, all worker threads will listen to all listeners without any fragmentation. The kernel assigns the received socket to the worker thread. Modern kernels are generally good at doing this; the kernel uses features such as IO priority enhancement to try to improve thread processing capabilities instead of using other threads to process. These threads are also listening on the same socket, and each There is no need to use a spin lock to handle the connection.

Once the Worker accepts the connection, the connection will never leave that Worker. All further processing is done in the Worker thread, including forwarding. This means:

  • All connection pools in Envoy are bound to Worker threads. Although the HTTP/2 connection pool only establishes one connection with each upstream host at a time, if there are four Workers, each upstream host will have four HTTP/2 connections in a stable state.
  • The reason why Envoy works in this way is that all connections are handled in a single Worker thread, so that almost all code can be written without locks, as if it were single-threaded. This design makes most codes easier to write, and can scale very well to an almost unlimited number of Workers.
  • The main problem is that from the perspective of memory and connection pool efficiency, adjusting the --concurrency option is actually very important. Having too many Workers will waste memory, create more idle connections, and cause the connection pool hit rate to decrease. In Lyft, Envoy running as a sidecar has very low concurrency, and its performance roughly matches that of the services next to them. But we run the edge node Envoy with maximum concurrency.

What does non-blocking mean

So far, when discussing how the main thread and Woker thread operate, the term "non-blocking" has been used many times. All code is written under the assumption that there is no blocking. However, this is not entirely true. Envoy does use some process-wide locks:

  • As mentioned earlier, if the access log is being written, all workers will acquire the same lock before accessing the log buffer. Although the lock retention time should be very short, contention may occur during high concurrency and high throughput.
  • Envoy uses a very complex system to process thread-local statistics. I will discuss this topic in a follow-up article. I will briefly mention here that as part of thread-local statistical processing, it is sometimes necessary to acquire a lock on the "stat store". Such locks should not be highly contentious.
  • The main thread needs to periodically synchronize data with all worker threads. This is done by "posting" from the main thread to the Worker thread (and sometimes back to the main thread from the Worker thread). Publishing needs to acquire a lock and put the published message in the queue for subsequent operations. These locks should never be highly contented, but they are still blocking.
  • When Envoy logs to standard error, it acquires a process-wide lock. Generally speaking, Envoy's local log performance is not good, so we did not deliberately consider improving lock performance.
  • There are some other random locks, but they are not in the performance critical path and should never be contended.

Thread local storage

Since Envoy separates the responsibilities of the main thread from the responsibilities of the Worker thread, complex processing needs to be completed on the main thread, and then each Worker thread is processed in a highly concurrent manner. This section will introduce the Envoy thread local storage (TLS) system. In the next section, I will describe how to use it to handle cluster management.
Why Envoy can beat Ngnix-thread model analysis

Figure 2: Thread local storage (TLS) system

As already described, the main thread basically handles all management/control plane functions in Envoy. (The control surface seems to be a bit too much in the main thread, but it seems appropriate when considering the work done by the Worker). It is a common pattern for the main thread to perform certain operations, and then obtain the results through the Worker thread, and the Worker thread does not need to acquire the lock every time it is accessed.

Envoy's TLS system works as follows:

  • Code running on the main thread can allocate process-wide TLS slots. This is a vector index that allows O(1) access.
  • The main thread can put arbitrary data into its slot. After this operation is completed, the data will be published to each Worker as a recurring event.
  • The Worker can read from its TLS slot and will retrieve any thread-local data available there.

Although very simple, this is a very powerful example, very similar to the concept of RCU locks. (In essence, the Worker thread never sees any changes in the data in the TLS slot while it is working. Changes only occur during the quiescent period between work events).

Envoy uses it in two different ways:

  • In the absence of any locks, each Worker stores different data
  • Store the shared pointer to the global read-only data of each Worker. Therefore, each Worker cannot manipulate the reference count of the data while working. Only when all Workers pause and load new shared data, the old data will be destroyed. This is the same as RCU.

Cluster update thread

In this section, I will describe how TLS is used for cluster management.
Cluster management includes xDS API processing and/or DNS and health checks.
Why Envoy can beat Ngnix-thread model analysis

Figure 3: Cluster manager thread

Figure 3 shows the overall flow of the following components and steps:

1. The cluster manager is an internal component of Envoy, used to manage all known upstream clusters, CDS API, SDS/EDS API, DNS and health check. It is responsible for creating an ultimately consistent view of the upstream cluster, including discovered hosts and health status.
2. The health checker performs active health check and reports the health status changes to the cluster manager.
3. Execute CDS/SDS/EDS/DNS to determine cluster membership. The status change will be reported back to the cluster manager.
4. Each worker thread is constantly running the event loop.
5. When the cluster manager determines that the state of the cluster has changed, it creates a read-only snapshot of the cluster state and publishes it to each worker thread.
6. During the next inactivity period, the worker thread will update the snapshot in the allocated TLS slot.
7. During the IO event that needs to determine the host to be load balanced, the load balancer will query the host information in the TLS slot. No lock is required to perform this operation. (Also note that TLS can also trigger events on update so that the load balancer and other components can recalculate caches, data structures, etc. This is beyond the scope of this article, but is used in different places in the code).

Through the process described earlier, Envoy can process the request without any locks (except those previously described). Except for TLS code, most codes are not designed for thread-related operations and can be written as single-threaded programs. In addition to achieving excellent performance, this makes most code easier to write.

Other subsystems that use TLS

TLS and RCU are widely used in Envoy. Some other examples include:

  • Runtime (feature identification) coverage search: The current feature identification coverage map is calculated on the main thread. Then use RCU semantics to provide a read-only snapshot to each worker.
  • Routing table exchange: For the routing table provided by RDS, the routing table is instantiated on the main thread. Then use RCU semantics to provide a read-only snapshot for each worker.
  • HTTP date header caching: It turns out that calculating the HTTP date header for each request (when each core executes ~25K + RPS) is expensive. Envoy calculates the date header approximately every half a second and provides it to each worker via TLS and RCU.

There are other examples, but the previous examples should have illustrated how TLS is widely used inside Envoy.

Known performance pitfalls

Although Envoy's overall performance is quite good, when it is used with very high concurrency and throughput, there are still some points to note:

  • As described in this article, all current workers acquire a lock when writing to the memory buffer of the access log. In the case of high concurrency and high throughput, when writing the final file, it will be necessary to batch process the access logs of each worker. As an optimization, each Worker thread can have its own access log.
  • Although the statistics have been optimized, there may be atomic contention for individual statistics under very high concurrency and throughput. The solution to this is to use the Worker counter, which is periodically synchronized to the central counter. This will be discussed in a follow-up article.
  • If Envoy is used in a situation where a small number of connections takes up a lot of resources, the existing architecture will not work properly. This is because there is no guarantee that the connections are evenly distributed among the workers. This can be solved by implementing worker connection load balancing, where the worker can forward the connection to another worker for processing.

in conclusion

Envoy's threading model is designed to support simple programming paradigms and large-scale parallelism, but it may waste memory and connections if adjusted incorrectly. This model allows Envoy to perform well under a very high number of workers and throughput.

As I mentioned on Twitter, the design is also suitable for running on userspace network stacks such as DPDK, which may allow commercial servers to process millions of requests per second. It is also very interesting to see what can be done in the next few years.

One last point: I have been asked many times why we chose C++ for Envoy. The reason is that it is still the only widely deployed production-level language in which the architecture described in this article can be built. C++ is certainly not suitable for all projects, or even many projects, but for some use cases, it is still the only tool to get the job done.

Code link

Links to some interfaces and header files discussed in this article:
https://github.com/lyft/envoy/blob/master/include/envoy/thread_local/thread_local.h
https://github.com/lyft/envoy/blob /master/source/common/thread_local/thread_local_impl.h
https://github.com/lyft/envoy/blob/master/include/envoy/upstream/cluster_manager.h
https://github.com/lyft/envoy/blob /master/source/common/upstream/cluster_manager_impl.h

English text:
https://blog.envoyproxy.io/envoy-threading-model-a8d44b922310

More Envoy introduction:
https://www.envoyproxy.io/

Related Reading:

Alibaba Cloud failure is only a mistake in operation and maintenance?
The latest development of Weibo’s open source Motan RPC: new cross-language and service governance support
Java microservice framework new choice: Spring 5
from a single application to microservices: the inspiration of an API Gateway upgrade
They migrated the production environment from nginx to envoy, The reason turned out to be...

The author of this article is mattklein, translated by Fangyuan. Please indicate the source, technical originality and architecture practice articles for reprinting this article. Welcome to submit articles through the official account menu "Contact Us"

Highly available architecture

Changing the way the internet is built

Why Envoy can beat Ngnix-thread model analysis

Guess you like

Origin blog.51cto.com/14977574/2546798