Skynet's overall architecture and task scheduling analysis

Yunfeng's open source skynet, the current scale is 8K+ C code and 2K+ lua code, which implements a multi-threaded high-concurrency online game background service framework, providing timers, concurrent scheduling, service expansion framework, asynchronous message queue, naming service and other foundations Ability to support lua scripts. Single server supports 10K+ client access and processing.

I personally pay more attention to high performance and concurrent scheduling. I analyzed the code of skynet in the past two days and summarized briefly.

1. Overall Architecture

A picture is worth a thousand words. Without monitoring, service extension, timers and other functions, the simplified framework of skynet service processing is shown below:

Insert picture description here

Each client of an online client has a socket connected to it on the skynet server. A socket corresponds to a lua virtual machine and a "client-specific message queue" (per client mq) in skynet. When there are messages in the customer-specific message queue, the queue will be mounted on the global message queue for scheduling and processing by worker threads.

The main process of skynet service processing is relatively simple: a socket thread polls all sockets, after receiving the client request, the request is packaged into a message, sent to the client-specific message queue corresponding to the socket, and then the message queue is linked to Global queue tail; N worker threads obtain the client-specific message queue from the head of the global queue, take a message from the client-specific message queue for processing, and re-hang the message queue to the end of the global queue after processing.

The actual code is more complicated: the timer thread will periodically check the set timer and send the expired timer message to the client message queue; each lua vm will also send other lua vm (or Your own) client-specific message queue sends messages; the monitor thread monitors the status of each client to see if there is an infinite loop of messages and so on. This article focuses on message scheduling, because any slight adjustment of message scheduling may have a great impact on server performance.

In addition, it can be seen that when each client processes messages, they are processed in the order in which they arrive. At the same time, a client's message will only be scheduled by one worker thread, so client processing logic does not need to consider multi-thread concurrency, and basically does not need to lock.

2. Concurrent task scheduling method

Lua supports non-preemptive coroutine. A lua virtual machine can support massive concurrent collaboration tasks. The main problem with coroutine is that it does not support multi-core and cannot make full use of the multi-core capabilities commonly provided by current servers. Therefore, there are currently many projects that add OS thread support to lua, such as Lua Lanes, LuaProc, etc. One problem that these projects must solve is the composition and scheduling of concurrent tasks.

Concurrent tasks can be represented by coroutine: a lua virtual machine (lua_State) is created on each OS thread, and a large number of coroutines can be created on the virtual machine. This scheduling is shown in the following figure:

Insert picture description here
Click to watch the skynet video of the core technology of game development

Insert picture description here
More skynet video materials can be added to the group: 832218493 for free!
Insert picture description here

This 1:1 scheduling method of OS threads and lua vm has many advantages:

Each OS thread has a private message queue, which has multiple writers but only one reader, which can realize the lock-free design of the reader.
OS threads can be bound to lua vm or not. Since modern OS will try to bind the CPU core and OS threads as much as possible, if the OS threads are bound to lua vm, it can greatly reduce the refresh of the cpu cache and increase the cache hit rate.
Lua vm is equivalent to the number of OS threads and has nothing to do with the number of tasks . A large number of tasks can share the same lua vm, sharing their lua bytecode, string constants and other information, which greatly reduces the memory usage of each task.
But this method also has serious flaws:

Does not support task migration across lua vm. Each task is a coroutine, and coroutine is a data structure inside lua vm. During execution, its stack refers to a large amount of shared data inside lua vm and cannot be migrated to another lua vm for execution. When multiple tasks on a lua vm are busy, they can only be executed serially by one OS thread, and cannot be handed over to other OS threads for parallel processing by means such as work stealing. In a telecommunications project I participated in before, this kind of business and thread binding processing method. For telecommunications services with relatively fixed business logic, the processing workload of each customer request is similar, because the CPU usage is more balanced after binding. Due to the better cache locality, this processing method has extremely high performance. However, for some client requests whose workload is not fixed or even changes frequently, this method can easily cause some threads to be busy, and other threads to be idle, and the multi-core capability cannot be effectively utilized.

Multiple tasks in the same lua vm share the memory space of the lua vm. When a problem occurs in one task, it can easily affect other tasks. Simply put, the isolation between tasks is not good.

Another way of processing is that each lua vm represents a task, and a large number of concurrent tasks in the system are processed by a large number of lua vms. Skynet uses this method. This method effectively solves the two shortcomings of the previous processing method. Each task is completely independent and can be handed over to any OS thread for processing; at the same time, tasks will not share the lua vm memory space, and the isolation is very good. It is a problem of one task. Will not affect the execution of other tasks. The main problem with this approach is that there is a lot of waste of memory. Each lua vm has to load a large number of the same lua bytecode and constants, which has a very high memory requirement. This also makes it impossible to reuse the cpu cache during execution of each task, resulting in a much lower cache hit rate. Yunfeng's solution to this problem is to modify the code loading mechanism of lua vm so that multiple lua vms in the same process share bytecode. In terms of specific implementation, there is an independent lua vm dedicated to loading bytecode and responsible for bytecode garbage collection. Other lua vms in the same process share the bytecode loaded by the independent lua vm. This method cannot solve the problem of string constant sharing, but only solves the problem of bytecode sharing. But even so, each online user also saves 1M of memory.

3. Global Message Queue

There is a "global message queue" and online user-specific "task-specific message queue" in skynet. Similar to the runtime scheduling of goroutines before version 1.1 of the go language, all worker threads use a common global queue. The skynet global queue is a circular queue, implemented using an array. Using a global queue is not a very efficient way of message scheduling. Each message's dequeue and enqueue generally requires a global lock. In go1.1, the scheduler algorithm is optimized, and each thread uses a local queue, which improves performance a lot (it is said that some applications have a performance increase of nearly 40%).
The skynet global queue has multiple concurrent producers and concurrent consumers. Under normal circumstances, access to the global queue requires a lock. But skynet actually adopts a wait-free queue implementation method, see the following code:
message enters the queue:
C code collection code

<span style="font-size: 14px;">static void  
skynet_globalmq_push(struct message_queue * queue) {
    
      
    struct global_queue *q= Q;  
  
    uint32_t tail = GP(__sync_fetch_and_add(&q->tail,1));  
    q->queue[tail] = queue;  
    __sync_synchronize();  
    q->flag[tail] = true;  
}  
</span> 

Message out of the queue:
C code collection code

<span style="font-size: 14px;">struct message_queue *  
skynet_globalmq_pop() {
    
      
    struct global_queue *q = Q;  
    uint32_t head =  q->head;  
    uint32_t head_ptr = GP(head);  
    if (head_ptr == GP(q->tail)) {
    
      
        return NULL;  
    }  
  
    if(!q->flag[head_ptr]) {
    
      
        return NULL;  
    }  
  
    __sync_synchronize();  
  
    struct message_queue * mq = q->queue[head_ptr];  
    if (!__sync_bool_compare_and_swap(&q->head, head, head+1)) {
    
      
        return NULL;  
    }  
    q->flag[head_ptr] = false;  
  
    return mq;  
}  
</span>  

It can be seen that whether it is entering the queue or leaving the queue, skynet only uses a few atomic operations provided by gcc, without any lock and unlock processing.

When entering the queue, the task-specific queue is directly added to the end of the global queue without any processing to determine that the queue is full. It is assumed that the global queue will never be full. The global queue directly allocates 65536 slot queue space for it when it is initialized, and it will not grow in the future. The number of concurrent clients on a server generally does not exceed 10,000. Each client corresponds to a task-specific message queue, and each task-specific message queue is added to the global queue at most once. According to this information, the queue will not be full when the number of simultaneous online customers is less than 65535. It is based on this assumption that skynet allows multiple threads to be queued concurrently to achieve wait-free. There are strict requirements for the access code of task-specific message queues: access to these message queues must be locked to prevent a message queue from being added to the global queue multiple times.

In fact, although wait-free is implemented when the global queue enters the queue, after q->tail is incremented by an atomic operation, the content in the position q->queue[tail] pointed to by the original tail is not synchronized. Because of this, in addition to the queue array in the global queue, there is also a flag array to mark whether the value in q->queue[tail] is set. At the same time, in order to prevent the out-of-order execution of the cpu from causing the flag to be set first, a memory barrier must be added before setting the flag to true through __sync_synchronize() to ensure that the setting of the flag is executed after the q->queue[tail] assignment.

When leaving the queue, the global queue is empty (if (head_ptr == GP(q->tail))), but when multiple threads take the same message at the same time, it is guaranteed that there is only one The thread fetches the message, but the thread that has not fetched the message will not continue to fetch subsequent messages, but will directly pretend that the queue is empty and go home idle. Refer to the following code:

C code collection code

<span style="font-size: 14px;">static void *  
_worker(void *p) {
    
      
    struct worker_parm *wp = p;  
    int id = wp->id;  
    struct monitor *m = wp->m;  
    struct skynet_monitor *sm = m->m[id];  
    for (;;) {
    
      
        if (skynet_context_message_dispatch(sm)) {
    
        
            //没有取到消息时,会进入这里进行wait,线程挂起  
            CHECK_ABORT  
            if (pthread_mutex_lock(&m->mutex) == 0) {
    
      
                ++ m->sleep;  
                pthread_cond_wait(&m->cond, &m->mutex);  
                -- m->sleep;  
                if (pthread_mutex_unlock(&m->mutex)) {
    
      
                    fprintf(stderr, "unlock mutex error");  
                    exit(1);  
                }  
            }  
        }  
    }  
    return NULL;  
}  
</span

When a large number of clients concurrently request, such collisions in the queue should be relatively frequent, and this processing method will cause the threads to hang frequently. In order to reduce the frequent wake-up of threads, skynet will wake up a suspended thread only when all workers are idle and then receive a new socket request:

C code collection code

<span style="font-size: 14px;">static void  
wakeup(struct monitor *m, int busy) {
    
      
    if (m->sleep >= m->count - busy) {
    
      
        //busy=0,意味着只有挂起线程数sleep=工作线程数count时才会唤醒线程  
        pthread_cond_signal(&m->cond);  
    }  
}  
  
static void *  
_socket(void *p) {
    
      
    struct monitor * m = p;  
    for (;;) {
    
      
        int r = skynet_socket_poll();  
        if (r==0)  
            break;  
        if (r<0) {
    
      
            CHECK_ABORT  
            continue;  
        }  
        wakeup(m,0); // 参数busy为0  
    }  
    return NULL;  
}  
</span>  
 

It can be seen that when some threads are suspended due to a collision with messages from the global queue, as long as one thread is still working, these suspended threads will not be awakened (even if there is still a large amount of processing in the global queue news). This does not use multi-core capabilities very effectively.

The solution can refer to the implementation of the go1.1 runtime goroutine scheduler. The global queue is divided into a thread-specific queue for each thread, and task-specific queues in the same thread-specific queue are processed by this thread. In this way, the thread-specific queue has only one consumer thread, and lock-free dequeue operations can be easily achieved without the above problems. In addition, in order to prevent the imbalance between various threads, after a thread has processed a message, it can decide to push the task-specific queue to other threads according to the current load of its own thread-specific queue (such as queue length and message retention time) The tail of a specific queue is the tail of a specific queue of its own thread. When the thread-specific queue is empty, you can directly take the next message from the task-specific queue to continue processing, reducing one enqueue and dequeue operation. This kind of scheme can still keep the load balance among threads without using complicated work stealing scheme, and the thread will not starve to death.

4. Task specific message queue

Each task-specific message queue is for an online client, corresponding to a socket and a lua vm. This queue has multiple concurrent producers, but only one concurrent consumer. For this kind of queue, generally we can use spin lock in the queue, and lock-free when leaving the queue (but we need to use a memory barrier to ensure that the cpu will not execute memory access instructions out of order). However, in skynet, spin lock is added when entering and leaving the queue:

C code collection code

<span style="font-size: 14px;">#define LOCK(q) while (__sync_lock_test_and_set(&(q)->lock,1)) {}  
#define UNLOCK(q) __sync_lock_release(&(q)->lock);  
  
int  
skynet_mq_pop(struct message_queue *q, struct skynet_message *message) {
    
      
    int ret = 1;  
    LOCK(q)  
  
    if (q->head != q->tail) {
    
      
        *message = q->queue[q->head];  
        ret = 0;  
        if ( ++ q->head >= q->cap) {
    
      
            q->head = 0;  
        }  
    }  
  
    if (ret) {
    
      
        q->in_global = 0;  
    }  
  
    UNLOCK(q)  
  
    return ret;  
}  
</span>  

Into the queue:

C code collection code

<span style="font-size: 14px;">void  
skynet_mq_push(struct message_queue *q, struct skynet_message *message) {
    
      
    assert(message);  
    LOCK(q)  
  
    if (q->lock_session !=0 && message->session == q->lock_session) {
    
      
        _pushhead(q,message);  
    } else {
    
      
        q->queue[q->tail] = *message;  
        if (++ q->tail >= q->cap) {
    
      
            q->tail = 0;  
        }  
  
        if (q->head == q->tail) {
    
      
            expand_queue(q);  
        }  
  
        if (q->lock_session == 0) {
    
      
            if (q->in_global == 0) {
    
      
                q->in_global = MQ_IN_GLOBAL;  
                skynet_globalmq_push(q);  
            }  
        }  
    }  
  
    UNLOCK(q)  
}  
</span>  

Very strange design. After careful analysis of the queued code, it turns out that skynet does not use the task-specific queue as a FIFO queue, but as a deque that supports both FIFO and LIFO. The reason is to support skynet.blockcall in lua. To support skynet.blockcall, a lock_session member is added to the task-specific queue data structure:

C code collection code

<span style="font-size: 14px;">struct message_queue {
    
      
    uint32_t handle;  
    int cap;  
    int head;  
    int tail;  
    int lock;  
    int release;  
    int lock_session;  
    int in_global;  
    struct skynet_message *queue;  
};  
</span>  
 

The skynet.blockcall implementation code is as follows:

Lua code collection code

<span style="font-size: 14px;">function skynet.blockcall(addr, typename , ...)  
    local p = proto[typename]  
    c.command("LOCK")  
    local session = c.send(addr, p.id , nil , p.pack(...))  
    if session == nil then  
        c.command("UNLOCK")  
        error("call to invalid address " .. skynet.address(addr))  
    end  
    return p.unpack(yield_call(addr, session))  
end  
</span>  
 
    c.command("LOCK")命令执行时会设置消息队列的lock_session为context的下一个session值,c.send发送请求时返回的session就是这个session值。响应消息回来时,skynet_mq_push发现响应消息中的session值与队列中的lock_session一致,就会通过_pushhead函数将响应消息放到队列头部,从而让skynet尽快处理。_pushhead实现如下:
 
C代码  收藏代码
<span style="font-size: 14px;">static void  
_pushhead(struct message_queue *q, struct skynet_message *message) {
    
      
    int head = q->head - 1;  
    if (head < 0) {
    
      
        head = q->cap - 1;  
    }  
    if (head == q->tail) {
    
      
        expand_queue(q);  
        --q->tail;  
        head = q->cap - 1;  
    }  
  
    q->queue[head] = *message;  
    q->head = head;  
  
    _unlock(q);  
}  
  
static void  
_unlock(struct message_queue *q) {
    
      
    // this api use in push a unlock message, so the in_global flags must not be 0 ,  
    // but the q is not exist in global queue.  
    if (q->in_global == MQ_LOCKED) {
    
      
        skynet_globalmq_push(q);  
        q->in_global = MQ_IN_GLOBAL;  
    } else {
    
      
        assert(q->in_global == MQ_DISPATCHING);  
    }  
    q->lock_session = 0;  
}  
</span>  

In fact, in my opinion, the use of deque is completely unnecessary. skynet.blockcall uses a new session, so it can use a new coroutine. At this point, even if the response message is placed at the end of the queue, the response will be processed normally. The only advantage of using lock_session is to increase the priority of the response message. When the response message comes, the response message is processed first. But this approach can be achieved through another high priority queue, just like the erlang scheduler. After adding a high-priority queue for each task, you can achieve the same effect as the current deque, and at the same time, you can fetch messages from the queue without lock, which should improve performance a little.

5. Summary

没有做过游戏开发,凭着自己多年做电信服务器软件的经验和一些个人兴趣爱好瞎扯了一些。电信服务器软件和游戏后台服务器软件有很多相似之处,但两个领域也有很多重大差异,可能很多在我看来很重要的问题在游戏开发中实际上无所谓。skynet已经实际应用了一段时间,具体设计和实现应该都有特定的需求。还是那句话,软件开发中可以争论对错,不过最终是好是坏还是由实际运行效果决定。正因为没看到具体的应用场景,这里的总结只是拿一些服务器软件通用的原理姑且推之

In the skynet code, the handle access uses a read-write lock, the task-specific queue access uses a spin lock, and the global queue access uses a wait-free lock-free design. In addition to more reads and less writes during handle access, and the use of read-write locks is more appropriate, the scheduling design of task-specific queues and global queues needs to be improved. Every time a message is processed in the system, the global queue will be out of the queue and once in the queue. The global queue is used very frequently, so it seems that the wait-free design is not a problem. However, after careful analysis, you will find that when the system is busy, the wait-free design will cause unnecessary waiting for some conflicting threads. Changing the global queue to a thread-specific queue, using spin locks for entering the queue and leaving the queue without locking, will improve the efficiency of multi-core usage. The task-specific queue has only one concurrent consumer. After the FIFO queue is used, the queue does not need to be locked.

Insert picture description here

Guess you like

Origin blog.csdn.net/lingshengxueyuan/article/details/111702218