The principle and application of coroutine, C++ real coroutine

The principle of coroutine

Coroutines are different from threads with operating system concepts. In fact, coroutines are program components like functions. You can easily create hundreds of thousands of coroutines in one thread, just like hundreds of thousands of function calls. same. Only the function has only one call entry starting point, and it ends after returning, and the coroutine entry can be the starting point or continue to execute from the previous return point, which means that the right of execution can be transferred between coroutines through yield. , Call each other symmetrically and at the same level, instead of calling the relationship between the upper and lower levels like functions. Of course, coroutines can also simulate functions to realize the calling relationship between the upper and lower levels, which are called asymmetric coroutines.

Let's take an example to look at a symmetric coroutine call scenario, the most familiar "producer-consumer" event-driven model, one coroutine is responsible for producing products and adding them to the queue, and the other is responsible for removing products from the queue. And use it. To improve efficiency, you want to add or delete multiple products at once. The pseudo code can look like this:

# producer coroutine
loop
while queue is not full
  create some new items
  add the items to queue
yield to consumer
 
# consumer coroutine
loop
while queue is not empty
  remove some items from queue
  use the items
yield to producer

If multiple threads are used to implement the producer-consumer model, a synchronization mechanism must be used between threads to avoid race conditions for global resources. This will inevitably lead to system overheads such as sleep, scheduling, and context switching, and thread scheduling is also Will produce uncertainty in timing.

For a coroutine, the concept of “hanging” is just to transfer code execution rights and call another coroutine. When the transferred coroutine ends for a while, it will be called again and “woke up” from the point of suspension. The call between programs is logically controllable and time-sequential. It can be said that everything is under control.

Nowadays, some languages ​​with coroutine semantics, such as C#, erlang, golang, and lightweight python, lua, javascript, ruby, and functional scala, scheme, etc. are relatively heavyweight. In contrast, C, as the original ecological language, is in an awkward position. The reason is that C relies on a routine call called a stack frame. The internal state and return value of the routine are kept on the stack, which means Producers and consumers can’t implement parallel calls to each other. Of course, you can rewrite it to use the producer as the main routine and the product as a passed parameter to call the consumer routine. This kind of code is unpleasant to write and looks like it will It's hard to bear, especially when the number of coroutines reaches the order of 100,000, this way of writing is too rigid.

If the context of each coroutine (such as the program counter) is saved in other places instead of on the stack, when the coroutines call each other, the called coroutines only need to restore the context before the last transfer point from a place outside the stack. However, this is a bit similar to CPU context switching. The C standard library provides us with two coroutine scheduling primitives: one is setjmp/longjmp, the other is ucontext components, which are implemented internally (of course in assembly language) Compared with the context switching of the coroutine, the former will have considerable uncertainty in the application (for example, it is not easy to package, please refer to the online documentation for specific instructions), so the latter is more widely used. Most C coroutines on the Internet The library is also implemented based on the ucontext component.

We know that python's yield semantic function is similar to an iterative generator. The function will retain the state of the last call and will continue to execute from the last return point when it is called next time, for example:

def cols():
    for i in range(10):
        yield i

g=cols()
for k in g:
    print(k) 

Let's take a look at how the yiled semantics of the C language is implemented:

int function(void) {
  static int i, state = 0;
  switch (state) {
    case 0: goto LABEL0;
    case 1: goto LABEL1;
  }
  LABEL0: /* start of function */
  for (i = 0; i < 10; i++) {
    state = 1; /* so we will come back to LABEL1 */
    return i;
    LABEL1:; /* resume control straight after the return */
  }
}

This is achieved by using static variables and goto jumps. If you don't use goto, you can directly use the jump function of switch:

int function(void) {
  static int i, state = 0;
  switch (state) {
    case 0: /* start of function */
    for (i = 0; i < 10; i++) {
      state = 1; /* so we will come back to "case 1" */
      return i;
      case 1:; /* resume control straight after the return */
    }
  }
} 

We can also use the __LINE__ macro to make it more general:

int function(void) {
  static int i, state = 0;
  switch (state) {
    case 0: /* start of function */
    for (i = 0; i < 10; i++) {
      state = __LINE__ + 2; /* so we will come back to "case __LINE__" */
      return i;
      case __LINE__:; /* resume control straight after the return */
    }
  }
}

In this way, we can use macros to extract a paradigm and encapsulate it into components:

#define Begin() static int state=0; switch(state) { case 0:
#define Yield(x) do { state=__LINE__; return x; case __LINE__:; } while (0)
#define End() }
int function(void) {
  static int i;
  Begin();
  for (i = 0; i < 10; i++)
    Yield(i);
  End();
}

This kind of coroutine implementation method has a limitation in use, that is, the preservation of the coroutine scheduling state depends on static variables, rather than local variables on the stack. In fact, local variables (stack) cannot be used to save the state, which makes The code does not have reentrancy and multi-threaded applications. If the local variable is wrapped into a fictitious context structure pointer passed in by the function parameter, and then the dynamically allocated heap is used to "simulate" the stack, the thread reentrancy problem is solved. But this will detract from the clarity of the code. For example, all local variables must be written as references to object members. Especially when there are many local variables, it is troublesome. For example, the gameplay of macro definition malloc/free is too large and difficult to control.

Since the coroutine itself is a single-threaded solution, we should assume that the application environment is single-threaded, and there is no code reentry problem, so we can boldly use static variables to maintain the simplicity and readability of the code. In fact, we should not consider using such a simple coroutine in a multi-threaded environment. If we have to use it, the ucontext component of glibc mentioned earlier is also a viable alternative, which provides a context for a coroutine private stack. Of course, this usage is not unlimited in cross-threading, please read its documentation carefully.

I will share with you a classic training camp video of the coroutine, the explanation is still very detailed, it is worth learning, and the video link is attached: the realization of the coroutine framework, the underlying principle and performance analysis, the interview blade-learning video

Click to enter qun: Jumping to get documentation

 

Concurrent application of coroutine

The coroutine is to use the synchronous programming idea in the single thread to realize the asynchronous processing flow, so as to realize that the single thread can process hundreds or thousands of requests concurrently, and the processing process of each request is linear, without the use of obscure callbacks. Mechanism to connect the processing flow.

Event-driven state machine

Traditional web servers (such as nginx, squid, etc.) all use the EDSM (event-driven state machine) mechanism to process requests concurrently. This is an asynchronous processing method that avoids blocking threads by using the callback method.

The most common way of EDSM is the asynchronous callback of I/O events. Basically, there will be a single-threaded main loop called the dispatcher (also called event loop), and users can implement asynchronous notification by registering a callback function (also called event handler) with the dispatcher, so that they don't have to waste resources on the spot. In the dispatcher main loop, wait for various I/O events to occur through system calls such as select()/epoll(). When the kernel detects that the event is triggered and the data is available or available, select()/epoll() will return So that the dispatcher calls the corresponding callback function to process the user's request.

The entire process is single-threaded. This kind of processing is essentially to synchronize a bunch of disjoint callbacks, as if they are connected in series on a sequential linked list. As shown in the figure below, the black double arrow indicates the multiplexing of I/O events. The callback is a basket containing the processing of various requests (of course not every request has a callback, a request can also correspond to a different callback), each callback It is connected in series and activated by the dispatcher. The request here is equivalent to the concept of thread (not the thread of the operating system), but the "context switch" (context switch) occurs at the end of each callback (assuming that different requests correspond to different callbacks), register the next callback to wait for the event Resume processing of other requests when triggered. As for the execution state of the dispatcher (execute state) can be saved and passed as a parameter of the callback function

The disadvantage of asynchronous callbacks is that they are difficult to implement and extend. Although there are already general libraries such as libevent, as well as other actor/reacotor design patterns and frameworks, as Dean Gaudet (Apache developer) said: "The inherent complexity— -Breaking up linear thought into a bucketload of callbacks-it still exists". As can be seen from the above figure, the request routines between callbacks are not continuous. For example, the switching between callbacks will interrupt some requests, or there are new requests that need to be re-registered.

The coroutine is still based on the EDSM model in essence, but aims to replace the traditional asynchronous callback method. The coroutine abstracts the request into the thread concept to get closer to the natural programming model (the so-called linear thought is as natural as switching between threads of the operating system).

The following describes the implementation of a coroutine: State Threads library.

ST library

ST (State Threads) library provides a high-performance, scalable server (such as web server, proxy server, mail agent, etc.) implementation scheme.

The ST library simplifies the multi-threading programming paradigm. Each request corresponds to a thread. Note that the thread here is actually a coroutine (coroutine), which is not the same as the kernel thread like pthread.

Here is a little explanation of the working principle of ST scheduling. The ST operating environment maintains four queues, namely IOQ (waiting queue), RUNQ (run queue), SLEEPQ (timeout queue) and ZOMBIEQ. When each thread is in a different queue, it corresponds to a different state (ST is the so-called thread state machine as the name implies). For example, when a polling request is made, the current thread adds IOQ to indicate a waiting event (if there is a timeout, it will be placed in SLEEPQ). When the event is triggered, the thread is removed from IOQ (if there is a timeout, it will also be removed from SLEEPQ) and transferred to RUNQ waits to be scheduled and becomes the current running thread, which is equivalent to the ready queue of the operating system. Corresponding to the traditional EDSM, it is the registration callback and the activation callback. For another example, when simulating synchronous control of wait/sleep/lock, the current thread will be put into SLEEPQ until it is awakened or timed out and enters RUNQ again to be scheduled.

ST's scheduling has the dual advantages of performance and memory: In terms of performance, ST implements its own setjmp/longjmp to simulate scheduling without any system overhead, and the context (that is, jmp_buf) is implemented in the underlying language for different platforms and architectures, and is portable Comparable to libc. Let's put a piece of code below to explain the scheduling implementation:

/*
 * Switch away from the current thread context by saving its state 
 * and calling the thread scheduler
 */
#define _ST_SWITCH_CONTEXT(_thread)       \
    ST_BEGIN_MACRO                        \
    if (!MD_SETJMP((_thread)->context)) { \
      _st_vp_schedule();                  \
    }                                     \
    ST_END_MACRO
 
/*
 * Restore a thread context that was saved by _ST_SWITCH_CONTEXT 
 * or initialized by _ST_INIT_CONTEXT
 */
#define _ST_RESTORE_CONTEXT(_thread)   \
    ST_BEGIN_MACRO                     \
    _ST_SET_CURRENT_THREAD(_thread);   \
    MD_LONGJMP((_thread)->context, 1); \
    ST_END_MACRO
 
void _st_vp_schedule(void)
{
    _st_thread_t *thread;
 
    if (_ST_RUNQ.next != &_ST_RUNQ) {
        /* Pull thread off of the run queue */
        thread = _ST_THREAD_PTR(_ST_RUNQ.next);
        _ST_DEL_RUNQ(thread);
    } else {
        /* If there are no threads to run, switch to the idle thread */
        thread = _st_this_vp.idle_thread;
    }
    ST_ASSERT(thread->state == _ST_ST_RUNNABLE);
 
    /* Resume the thread */
    thread->state = _ST_ST_RUNNING;
    _ST_RESTORE_CONTEXT(thread);
}

If you are familiar with the usage of setjmp/longjmp, you know that the current thread calls MD_SETJMP to save the scene context in jmp_buf and returns 0, and then calls _st_vp_schedule() to schedule itself. The scheduler first finds it on RUNQ. If the queue is empty, it finds idle thread. This is a special thread created when the entire ST is initialized, and then sets the current thread to itself, and then calls MD_LONGJMP to switch to the place where MD_SETJMP was called last time. , Restore the scene from thread->context and return 1, the thread will continue to execute. The whole process happens under the single thread of the operating system just like EDSM, so there is no system overhead and blocking.

In fact, the real blocking occurs when waiting for I/O event reuse, that is, select()/epoll(), which is the only system call in the entire ST. The current state of ST is that the entire environment is in an idle state, and all the request processing of threads has been completed, that is, RUNQ is empty. At this time, a main loop (similar to event loop) is maintained in _st_idle_thread_start, which is mainly responsible for three tasks: 1. Perform I/O multiplexing detection on all threads of IOQ; 2. Perform timeout check on SLEEPQ; 3. Set idle thread Dispatched out, the code is as follows:

void *_st_idle_thread_start(void *arg)
{
    _st_thread_t *me = _ST_CURRENT_THREAD();
 
    while (_st_active_count > 0) {
        /* Idle vp till I/O is ready or the smallest timeout expired */
        _ST_VP_IDLE();
 
        /* Check sleep queue for expired threads */
        _st_vp_check_clock();
 
        me->state = _ST_ST_RUNNABLE;
        _ST_SWITCH_CONTEXT(me);
    }
 
    /* No more threads */
    exit(0);
 
    /* NOTREACHED */
    return NULL;
}

The me here is the idle thread, because _st_idle_thread_start is the starting point to create the idle thread. Every time when switching back from the last _ST_SWITCH_CONTEXT(), then polling for the occurrence of I/O events in _ST_VP_IDLE(), once detected If another thread event occurs or a timeout occurs in SLEEPQ, use _ST_SWITCH_CONTEXT() to switch yourself out. If the RUNQ is not empty at this time, it will switch to the first thread in the queue. Here the main loop will not exit.

In terms of memory, the execution state of ST is stored on the stack as a local variable, instead of dynamically allocated like a callback, users may use thread mode and callback mode as follows:

/* thread land */
int foo()
{
    int local1;
    int local2;
    do_some_io();
}
 
/* callback land */
struct foo_data {
    int local1;
    int local2;
};
 
void foo_cb(void *arg)
{
    struct foo_data *locals = arg;
    ...
}
 
void foo()
{
    struct foo_data *locals = malloc(sizeof(struct foo_data));
    register(foo_cb, locals);
} 

There are two other points to note. One is that ST's thread is non-priority non-preemptive scheduling, which means that ST is based on EDSM. Each thread is event-driven or data-driven. Sooner or later, it will schedule itself out, and the scheduling point It is clear, not by time slice, which simplifies thread management; second, ST will ignore all signal processing, and sigact.sa_handler will be set to SIG_IGN in _st_io_init. This is because thread resources are minimized and avoided The signal mask and its system calls (which cannot be avoided on ucontext). But this does not mean that ST cannot process signals. In fact, ST recommends that the way of writing signals to pipes be converted to ordinary I/O event processing. For an example, see here.

multi-threading programming paradigm

Posix Thread (hereinafter referred to as PThread) is a general thread library. It maps user-level threads with kernel execution entities (kernel execution entities, also called lightweight processes in some books) by 1:1 or m:n. Realize multi-threading mode. For example, the Apache server uses PThread to implement concurrent request processing. Each thread processes a request. The thread processes the request in a synchronous and blocking manner. It will not accept other requests until the current request processing of the thread is completed.

ST is single-threaded (n:1 mapping), and its thread is actually a coroutine. In general network applications, the multi-threading paradigm cannot bypass the operating system, but in some specific server areas, shared resources between threads will bring additional complexity, such as locks, race conditions, concurrency, file handles, global variables, and pipes. , Signals, etc., the flexibility of these Pthreads will be greatly reduced. The scheduling of ST is accurate, it will only happen context switching at the point of clear I/O and synchronous function call. This is the characteristic of coroutines, so ST does not need mutual exclusion protection, and it can also Rest assured to use any static variables and non-reentrant library functions (this is not allowed in Protothreads, which is also a coroutine, because it is stack-less and cannot save the context), which greatly simplifies programming and debugging while increasing Performance.

By the way, there are only three ways to implement a coroutine in C language as far as I know:

1. Protothread uses switch-case semantic jump as a representative;

2. Setjmp/longjmp context switching that does not rely on libc as represented by ST;

3. Rely on the ucontext interface of glibc (coroutine of Yunfeng);

Among them, Protothread is the lightest, but the most restrictive, and ucontext consumes resources and performance is slow. At present, it seems that ST is the best.

to sum up

ST's core idea is to use the simple and elegant paradigm of multi-threading to outperform the complex and obscure implementation of traditional asynchronous callbacks, and to use the performance and decoupling architecture of EDSM to avoid the overhead and reef of multi-threading on the system.

The main limitation of ST is that all I/O operations of the application must use the API provided by ST, because only in this way can the thread be managed by the scheduler and avoid blocking.

In fact, in the last wordy sentence, the ngx_lua module also uses coroutine to simplify the processing of the Nginx process. Each request corresponds to a lua coroutine, so that the request is processed in a linear manner within coroutine, avoiding the asynchronous writing of callbacks.

Guess you like

Origin blog.csdn.net/Linuxhus/article/details/114088289