One article teaches you to understand Golang coroutine scheduling [GMP design ideas]

1 The origin of the Golang scheduler

1.1 The problem of single process: process blocking, CPU wasting time

A single execution program, the computer can only process one task at a time

CPU waste time caused by process blocking

1.2 Multi-process and multi-thread problems: complex design, high memory, CPU usage

The design gets complicated:

The greater the number of processes/threads, the greater the switching cost and the more wasteful

Multithreading is often accompanied by synchronization competition (such as: locks, resource conflicts, etc.)

Barriers to multi-process and multi-thread

High memory usage:
- Memory occupied by processes (virtual memory: 4GB 32bitOS)
- Memory occupied by threads (about 4MB)

High CPU scheduling consumption

1.3 Mode of coroutine co-routine (M:N, depends on the performance of the scheduler)

insert image description here

insert image description here
CPU本质是操控一个线程，只不过我们逻辑意义上将线程划分为了协程和内核空间的线程。然后我们通过编程语言来操控用户空间上的线程（协程）

①N:1 mode:

Cannot take advantage of multiple CPUs

A blocked bottleneck
② 1:1 method:

Same as multi-thread/multi-process model

Switching coroutines is expensive
③M:N mode:

able to take advantage of multi-core

The optimization and algorithm of the coroutine scheduler

1.4 Optimization of Scheduling

Goroutine optimization:

Memory usage, a few KB, can be developed in large quantities

Flexible scheduling and low switching costs

Disadvantages of early Go schedulers:

Based on the global Go queue and traditional polling, multiple threads are used to schedule
. Disadvantages:

Creating, destroying, and scheduling G (goroutine) requires each M to acquire a lock, forming fierce resource competition

M (thread) transfer G will cause delay and additional system load

System calls (CPU switching between M) cause frequent thread blocking and unblocking

2 Design ideas of GMP model

insert image description here

在Go中，线程是运行goroutine的实体，调度器的功能是把可运行的goroutine分配到工作线程上。

2.1 Introduction to GMP model

①GMP：goroutine-processor-thread

G: goroutine coroutine
P: processor processor
M: thread kernel thread

②Global queue: store G waiting to run

Store waiting goroutines to run

③P's local queue: store G waiting to run

Processor's local queue:

Store waiting goroutines to run

Quantity limit: no more than 256G

Put the newly created goroutine in the local queue of P first, and only put it in the global queue if the local queue is full

④P list: Created when the program starts

Created when the program starts

There are at most GOMAXPROCS (configurable)

⑤M list: the number of kernel threads allocated to the Go program by the current OS

The number of kernel threads allocated to the current Go program by the current operating system

⑥ Number of P and M

Number of Ps:

Environment variable $GOMAXPROCS configuration

Set by runtime.GOMAXPROCS() in the program

Number of M:

Go language itself limits M up to 1W

The SetMaxThreads function in the runtime/debug package to set

If there is an M blocked, a new M will be created

If there is M idle, then it will recycle or sleep an M

2.2 Scheduler Design Strategy

① Multiplexing thread: work stealing, hand off mechanism

Avoid frequent creation and destruction of threads, but reuse threads

Work stealing mechanism: When this thread has no runnable G, it will try to steal G from the global G. If the global queue is empty, it will steal G from P bound to other threads instead of destroying idle threads [local Queue - Global Queue - Other Queues]

Hand off mechanism: When this thread is blocked by a system call made by G, the thread releases the bound P and transfers P to other idle threads for execution.

②Using parallelism

GOMAXPROCS sets the number of P, and at most GOMAXPROCS threads are distributed and run simultaneously on multiple CPUs

③ preempt

In the co-routine, it is necessary to wait for a coroutine to voluntarily give up the CPU before executing the next coroutine.在Go中，一个goroutine最多占用CPU 10ms，防止其他goroutine被饿死

④Global G queue

When M performs work stealing it can get G from the global G queue

There is no runnable goroutine in the local queue, and there will be a steal mechanism to obtain runnable G from other Ps. The running order of the local queue is to first query from the local queue, query the global queue, and then steal from other Ps. The specific source code proc.go at runtime

2.3 What process does go func() go through

insert image description here

We create a goroutine through go func()

There are two queues storing G, one is the local queue of the local scheduler P, and the other is the global G queue. The newly created G will be stored in P's local queue first, and if P's local queue is full, it will be stored in the global queue.

G can only run in M, and one M must hold one P, and the relationship between M and P is 1:1. M will pop an executable G from P's local queue for execution. If P's local queue is empty, it will steal an executable G from other MP combinations or global queues for execution. 【本地-全局-其他队列】

A process in which M schedules G to execute is a cyclic mechanism

When M executes a certain G, if a syscall or other blocking operation occurs, M will block. If there are some Gs currently executing, the runtime will detach the thread M from P, and then create a new operating system. Threads (idle threads can be reused if there are idle threads) to serve this P.

When the M system call ends, this G will try to obtain an idle P to execute, and put it into the local queue of this P. If P cannot be obtained, the thread M will become dormant and join the idle thread, and then the G will be put into the global queue.

2.4 Life cycle of scheduler scheduling: M0, G0

insert image description here

package main

import "fmt"

func main() {
    
    
    fmt.Println("Hello world")
}

Let's analyze the code flow:

The runtime creates the initial threads M0 and G0 and associates them

Scheduler initialization: initialize M0, stack, garbage collection, and create and initialize a P list consisting of GOMAXPROCS P

The main function in the sample code is main.main , and there is also a main function in the runtime - runtime.main. After the code is compiled, a runtime.main会调用main.mainGoroutine will be created for runtime.main when the program starts, and it will be called main goroutine, and then the main goroutine will be added to P's local queue.

Start M0, M0 has been bound to P, will get G from P's local queue, and get the main goroutine

G owns the stack, and M sets the operating environment according to the stack information and scheduling information in G

M runs G

G exits, returns to M again to obtain runnable G, and repeats this until main.main exits, runtime.main executes defer and panic processing, or calls runtime.exit to exit the program.

The life cycle of the scheduler almost occupies the whole life of a Go program. Before the execution of the goroutine of runtime.main, it is to prepare for the scheduler runtime.main的goroutine运行，才是调度器的真正开始until the end of runtime.main.

①M0: The first main thread created after starting the go program

M0 is the main thread numbered 0 after starting the program. The instance corresponding to this M will be in the global variable runtime.m0 and does not need to be allocated on the heap. M0 is responsible for performing initialization operations and starting the first G, and then M0 Just like any other M

②G0: the first goroutine created after each M startup

G0 is the first goroutine created every time an M is started. G0 is only used to schedule G. G0 does not point to any executable function. Each M will have its own G0. It will be used when scheduling or system calls The stack space of G0, the G0 of the global variable is the G0 of M0

2.5 GMP visual debugging (trace programming)

① Basic trace programming

Create trace file f, err := os.Create("trace.out")
Start trace trace.Start(f)
Stop trace trace.Stop()
After go build runs, you will get a trace.out file

trace/main.go:

package main

import (
	"fmt"
	"os"
	"runtime/trace"
)

func main() {
    
    
	//1. 创建trace文件
	file, err := os.Create("trace.out")
	if err != nil {
    
    
		panic(err)
	}
	defer file.Close()
	//2. 开启trace
	trace.Start(file)
	//执行业务逻辑
	fmt.Println("do something...")
	//3. 暂停trace
	trace.Stop()
}

After running the above program, you will get a trace.out file:
insert image description here

② Open the trace file through the go tool trace tool

go tool trace trace.out
Access through http://127.0.0.1:xxxx (port is random)

insert image description here

Click view trace on the page:

insert image description here
View the effect:

Click on the different unknowns in the graph to see the values at the bottom left

insert image description here

1. G information

Click on the visual data bar of the row of Goroutines, and we will see some detailed information.

insert image description here

There are two Gs in the program, one is a special G0, which is an initialized G that every M must have, we don't need to discuss this.

Among them, G1 should be the main goroutine (the coroutine that executes the main function), which is in a runnable and running state for a period of time.

2. M information

Click on the visual data bar of the Threads row, and we will see some detailed information.

insert image description here
There are two M in the program, one is a special M0, used for initialization, we do not need to discuss this.

3. P information

insert image description here

main.main is called in G1, and trace goroutine g18 is created. G1 runs on P1 and G18 runs on P0.

There are two Ps here. We know that a P must be bound to an M to schedule G.

4. M information

insert image description here
We will find that when G18 is indeed running on P0, there is indeed an extra M data in the Threads line, click to view as follows: the extra

M2 should be the M2 dynamically created by P0 to execute G18.

③By Debug trace method

main.go

package main

import (
    "fmt"
    "time"
)

func main() {
    
    
    for i := 0; i < 5; i++ {
    
    
        time.Sleep(time.Second)
        fmt.Println("Hello World")
    }
}

# 编写
go build main.go
# 执行编译好的可执行文件
GODEBUG=schedtrace=1000 ./main

# 查看输出结果
SCHED 0ms: gomaxprocs=2 idleprocs=0 threads=4 spinningthreads=1 idlethreads=1 runqueue=0 [0 0]
Hello World
SCHED 1003ms: gomaxprocs=2 idleprocs=2 threads=4 spinningthreads=0 idlethreads=2 runqueue=0 [0 0]
Hello World
SCHED 2014ms: gomaxprocs=2 idleprocs=2 threads=4 spinningthreads=0 idlethreads=2 runqueue=0 [0 0]
Hello World
SCHED 3015ms: gomaxprocs=2 idleprocs=2 threads=4 spinningthreads=0 idlethreads=2 runqueue=0 [0 0]
Hello World
SCHED 4023ms: gomaxprocs=2 idleprocs=2 threads=4 spinningthreads=0 idlethreads=2 runqueue=0 [0 0]
Hello World

● SCHED: debug information output flag string, which means that this line is the output of the goroutine scheduler;
● 0ms: the time from program startup to outputting this line of logs
; The default property of P is the same as the number of cpu cores by default, of course, it can also be set through GOMAXPROCS;
● idleprocs: the number of P in the idle state; through the difference between gomaxprocs and idleprocs, we can know the number of P that executes the go code Quantity;
● threads: the number of os threads/M, including the number of m used by the scheduler, plus the number of threads like sysmon used by the runtime itself; ●
spinningthreads: the number of os threads in the spinning state;
● idlethread: in the idle state The number of os threads;
● runqueue=0: the number of Gs in the global queue of the Scheduler;
● [0 0]: the number of Gs in the local queue of 2 Ps respectively.

3 GMP scene analysis collection

3.1 G1 creates G2: enter the local queue first

正在执行的G创建的其他G优先加入本地队列。
P owns G1, M1 starts to run G1 after acquiring P, G1 uses go func() to create G2, and for locality, G2 is preferentially added to the local queue of P1.

insert image description here

3.2 G1 is executed and calls G2: G0 is responsible for scheduling

G1执行完毕之后，由G0调度从队列中获取G2，然后让P来执行G2.
After G1 finishes running (function: goexit), the goroutine running on M switches to G0, and G0 is responsible for the switching of coroutines during scheduling (function: schedule). Fetch G2 from P's local queue, switch from G0 to G2, and start running G2 (function: execute). The multiplexing of thread M1 is realized.

insert image description here

3.3 G2 opens up too many G

Assume that each P's local queue can only store 3 G's. G2 has created 6 Gs, the first 3 Gs (G3, G4, G5) have joined the local queue of p1, and the local queue of p1 is full

insert image description here

3.4 G2 local is full and continue to create G: Disrupt the global queue

G2所绑定的队列已经满了，但此时又创建了新的G，则打乱本地队列的前半部分G和新创建的G一起放入全局队列。（保证：随机，避免新G饥饿。）
When G2 creates G7, it finds that the local queue of P1 is full and needs to perform load balancing (transfer the first half of G in the local queue in P1 and the newly created G to the global queue)

(It is not necessarily a new G in the implementation. If G is executed after G2, it will be saved in the local queue, and an old G will be used to replace the new G to join the global queue.)

When these G are transferred to the global queue, they will be shuffled. So G3, G4, G7 are transferred to the global queue.

insert image description here

3.5 Create G8 when G2 is not full locally: Put it in the local queue

如果在创建新G之后本地队列未满，则先放本地队列。
When G2 creates G8, the local queue of P1 is not full, so G8 will be added to the local queue of P1.

The reason why G8 is added to the local queue at point P1 is because P1 is bound to M1 at this time, and M1 is executing G2 at this time. Therefore, the new G created by G2 will be placed on the P bound to its own M first.

insert image description here

3.6 When creating G, it will try to consume other idle M and P in the environment to consume new G

创建新G成功后，会尝试换唤醒M和P，让其组合来消费新G。
Regulation: When creating G, the running G will try to wake up other idle P and M combinations to execute.

Assume that G2 wakes up M2, M2 binds P2, and runs G0, but there is no G in the local queue of P2, and M2 is a spin thread at this time (a thread that does not have G but is running, constantly looking for G).

insert image description here

3.7 LB from global queue to P local queue (load balancing)

自旋线程根据LB来从全局队列拉取G进行消费。
M2 tries to take a batch of G from the global queue ("GQ" for short) and put it in the local queue of P2 (function: findrunnable()). The number of Gs taken by M2 from the global queue conforms to the following formula

n =  min(len(GQ) / GOMAXPROCS +  1,  cap(LQ) / 2 )

insert image description here

Take at least 1 g from the global queue, but don't move too many g from the global queue to the p local queue each time, leaving points for other p. This is load balancing from global queues to P local queues.

3.8 M2 steals G from M1

如果M2队列为空，则从M1的本地队列中偷取后半部分G进行消费。（好不容易偷一次）
Assume that G2 has been running on M1. After two rounds, M2 has acquired G7 and G4 from the global queue to the local queue of P2 and completed the operation. Both the global queue and the local queue of P2 are empty, as shown in the left half of scene 8. part.

insert image description here
There is no G in the global queue, so m will perform work stealing (stealing): steal half of G from other Ps that have G, and put it in its own P local queue. P2 takes half of the Gs from the tail of P1's local queue, and half of them in this example has only one G8, puts them in the local queue of P2 and executes them.

3.9 The maximum limit of spin threads: spin + run <= GOMAXPROCS

正在运行的线程+自旋的线程 <= GOMAXPROCS。
G1's local queues G5 and G6 have been stolen by other Ms and have finished running. Currently, M1 and M2 are running G2 and G8 respectively. M3 and M4 have no goroutines to run. M3 and M4 are in a spinning state, and they are constantly looking for goroutines.

insert image description here

Why let m3 and m4 spin, the essence of spin is running, and the thread is running but not executing G, which becomes a waste of CPU. Why not destroy the scene to save CPU resources. Because creating and destroying CPUs will also waste time, we hope that when a new goroutine is created, M can run it immediately. If it is destroyed and then recreated, the delay will be increased and the efficiency will be reduced. Of course, it is also considered that too many spinning threads are a waste of CPU, so there are at most GOMAXPROCS spinning threads in the system (GOMAXPROCS=4 in the current example, so there are 4 P in total), and the redundant idle threads will make them sleep .

3.10 A system call or block occurs in G

P与阻塞的M解绑，去唤醒休眠中的M进行组合。
Assume that in addition to M3 and M4 as spin threads, there are M5 and M6 as idle threads (not bound by P, note that we can only have up to 4 P here, so the number of P should always be M>= P, most of which are M preempting the P that needs to run), G8 creates G9, G8 makes a blocked system call, M2 and P2 are immediately unbound, and P2 will perform the following judgment: if P2's local queue has G, global queue If there is a G or an idle M, P2 will immediately wake up one M and bind it, otherwise, P2 will join the idle P list and wait for M to obtain an available p. In this scenario, the P2 local queue has G9, which can be bound to other idle threads M5

insert image description here

3.11 G system call or non-blocking

尝试获取之前解绑的P，如果之前的P已经与其他M组合，则尝试从空闲的P队列中获取新P。如果获取失败，则将阻塞结束的G放入全局队列
G8 creates G9, assuming G8 makes a non-blocking system call.

insert image description here

M2 and P2 will be unbound, but M2 will remember P2, and then G8 and M2 will enter the system call state. When G8 and M2 exit the system call, they will try to obtain P2. If they cannot be obtained, they will obtain an idle P. If not, G8 will be marked as runnable and added to the global queue. M2 is not bound by P And become dormant (sleeping for a long time waiting for GC to recycle and destroy).

Summary: The Go scheduler is very lightweight and simple, enough to support the scheduling work of goroutines, and allows Go to have native (powerful) concurrency capabilities. The essence of Go scheduling is to allocate a large number of goroutines to a small number of threads for execution, and use multi-core parallelism to achieve more powerful concurrency.

Reference: https://www.yuque.com/aceld/golang/srxd6d