Golang runtime scheduling

Golang runtime scheduling

Golang as a concurrent language generated, the moment resulting from Golang doomed it has the characteristics of high concurrency, and the Go language concurrent (parallel) programming is achieved via goroutine, goroutine is one of the most important features golang with the use of low cost, low resource consumption, and high energy efficiency, the official claimed that tens of thousands of concurrent native goroutine not a problem, so it has become a characteristic Gopher often used.

Goroutine, core-based concurrency (parallel) Go programming language. What goroutine that?

Goroutine will usually be treated as golang implement coroutine (coroutines) from relatively superficial level, this knowledge can be considered reasonable.

However, coroutine on goroutine not in the traditional sense,

Now the mainstream threading model in three:

  • Kernel-level threading model
  • User-level threads model and two-threading model (also known as mixed threading models)
  • Traditional coroutine library belongs to the user-level thread model.

And goroutine and its Go Scheduler on the underlying implementation actually belong to two threading models,

Therefore, in order to facilitate understanding can sometimes simple analogy to the goroutine coroutine, but my heart there must be a clear understanding - goroutine not the same coroutine.

Thread

Threading implementation model are mainly three kinds: kernel-level threading model, and two user-level thread model threading model (also known as hybrid threading model), the biggest difference between them is that the user threads and kernel scheduling entity (KSE, Kernel the correspondence between the Scheduling Entity). The so-called kernel scheduling entity may be scheduled KSE refers operating system kernel scheduler object entity.

In simple terms KSE is a kernel-level threads, scheduling unit is the smallest of the operating system kernel, that is, we write the code as a common thread of understanding.

User-level thread model

User threads and kernel threads KSE is many (N: 1) mapping model, generally belonging to a single process and multi-threaded thread scheduling multiple users is done by the user's own thread library, thread creation, destruction and coordination of operations between multiple threads by the users themselves are responsible for the threads library without having to be implemented by the system call. All scheduling of all threads in a process created only with a KSE and dynamic binding at runtime, that is, only know the operating system user processes and threads which is no perception of the kernel are based on user process. Many language coroutine basically belong to the library in this manner (for example, the python gevent).

Since thread scheduling is done at the user level, that is, compared to the kernel scheduler does not need to allow the CPU to switch between user mode and kernel mode, this implementation compared to the kernel-level threads can do very lightweight, the system consumption of resources will be much smaller, and therefore the cost of the number of threads that can be created with the context switch takes will be much smaller. However, the model has a sin: do not concurrently in the true sense, suppose a user threads on a certain process because a blocking call (such as I / O blocked) CPU to be interrupted (preemptive scheduling) , then all the threads in this process are blocked (because the thread scheduling from within a single user process is not CPU clock interrupt, so that no round robin scheduling), the entire process is suspended. Even the multi-CPU machines, but also to no avail, because in user-level thread model, associated with a user CPU running the entire process, the child thread in the process is bound to be executed by the CPU user process scheduling, internal threads on the CPU is invisible, this time can be understood as CPU scheduling unit is the user processes.

So a lot of co-operation of some process library to make their re-packaged as complete blocking of non-blocking form, and then on the previous point to block, take the initiative to make their own, and notify other users or wake up threads to be executed in some way running on the KSE, thus avoiding the kernel scheduler KSE obstruction due to do a context switch, so that the entire process will not be blocked.

Kernel-level threading model

User threads and kernel threads KSE is one to one (1: 1) mapping model, each user is a real kernel thread binding thread, the thread scheduling is completely delivered to the operating system kernel to do, the application of thread creation, termination and synchronization system calls provided by the kernel are based on complete, most of the programming language thread library (such as Java's java.lang.Thread, C ++ std 11 of :: thread, etc.) are operating system threads (kernel-level threads) layer of packaging, created out of each thread with a separate KSE static binding, so the schedule is completely done by the operating system kernel scheduler, that is to say, in a process created out multiple threads each bind a KSE. Advantages and disadvantages of this model is equally clear: the advantage is simple, direct aid of the operating system kernel threads and a scheduler so you can quickly switch CPU thread scheduling, so you can run multiple threads simultaneously, compared to the user-level thread model it truly parallel processing; but its disadvantage is that, due to the direct aid of the operating system kernel to create, destroy and context between multiple threads as well as switching and scheduling, the resource costs rose sharply, and a great impact on performance.

Two threading model

Two threading model after absorbing a product, the advantages of the first two fully absorb the threading model and try to avoid their disadvantages. In this model, the user threads and kernel KSE-many (N: M) of the mapping model: First, a model different from the user-level thread, two threads of a process model may be associated with a plurality of kernel threads KSE, also that multiple threads within a process can bind its own KSE respectively, similar to this point and kernel-level threads model; secondly, but also different from kernel-level threading model, it's a process where the thread is not the only binding with KSE but multiple users can be mapped to a thread with KSE, KSE when a blocking operation because it is a binding thread scheduling CPU core, the process of its remaining associated with the other user threads can re-run KSE binding .

So, two threading model is neither a user-level thread scheduling model that is not completely on their own kernel-level thread model relies entirely on the operating system scheduler, but rather an intermediate state (self-scheduling and dispatch systems work together), because this model the high degree of complexity, the operating system kernel developers generally do not use it more often appears as a form of third-party libraries, and the Go language runtime scheduler is used in this implementation to achieve the Goroutine and the KSE dynamic association between, but to achieve the Go language of more advanced and elegant; why the model is called the two? That user user thread scheduler implementation of KSE to "scheduling", the kernel scheduler to achieve the KSE on the CPU 调度.

GPM model

Each OS thread has a fixed size block of memory (typically a 2MB) do stack, the stack will be used to store the current call or being suspended (means when called by other functions) of the internal variables of the function. At the same time the fixed-size stack of big and small. Because 2MB stack for a little goroutine is a big waste of memory, but for some complex tasks (such as deeply nested recursive) was another too small. Therefore, Go do a language of its own 线程.

In the Go language, each goroutine is an independent execution units, compared to a fixed allocation for each OS thread mode 2M memory, goroutine stack took a dynamic expansion mode, initially only 2KB, as the tasks performed on demand growth, up to 1GB (64-bit machine is the largest 1G, 32-bit machine is the largest 256M), and completely scheduled by golang own scheduler Go scheduler.

In addition, the GC will be periodically garbage collector will no longer be used, shrinkage stack space. Therefore, Go program can be complicated at the same time thousands of goroutine is thanks to its powerful scheduler and efficient memory model. Go creator of Dragon Sword is probably positioned goroutine because they not only allow goroutine as golang concurrent programming of the core components (developer of the program are based on goroutine run) and realize golang many standard library also everywhere goroutine to see the figure, such as net / http package, and even language itself GC garbage collector runtime runtime components are running on goroutine, the author hopes to goroutine evident.

Any user thread will eventually be handed over to OS thread are performed, goroutine (called G) is no exception, but does not directly bind the G thread running OS, but by Goroutine Scheduler in P - Logical Processor (logic processor) both as transmitters, P can be seen as an abstract resource or a context, a bound P an OS thread.

In achieving the OS thread's golang abstracted into a data structure: M, G are actually scheduled to be executed by the M through P, but in view of the level of G, P and provide all the resources required environmental operating G , so P G opinion is to run its "CPU", a G, P, M abstracted by the three implemented Go, Go ultimately form the basic structure of the scheduler:

  • G: represents Goroutines, each corresponding to a Goroutines structure G, G Goroutines storage stack operation, function and task status, reusable. G is not executable, each G needs to bind to P to be scheduled for execution.

  • P: Processor, logical processor indicates, for the G, P corresponding to the CPU core, only bind to P G can be scheduled (at the local runq P). For the M, P provides execution environment related (the Context), such as memory allocation state (mcache), task queue (G) and the like, the number P of the system determines the maximum number of parallel G (provided: the physical CPU Audit> = P the number), the number P of GOMAXPROCS set by the user to decide, but regardless of how GOMAXPROCS set, the number P of a maximum of 256.

  • M: Machine, OS thread abstraction, responsible for scheduling tasks on behalf of real resources to perform calculations, and Binding effective in P after entering the schedule cycle; and mechanisms schedule cycle is roughly Global from Local queue queue, and wait queue P acquired G, G switch to the execution stack and execute function G, call goexit do cleanup work and back to M, and so forth. G M does not maintain state, which is the basis of M G across scheduled, the number of M is uncertain, adjusted by the Go Runtime, causing the system to prevent the dispatch create too many OS thread, however, is currently the default maximum limit of 10,000.

GPM is located in the source model of the new version 1.13.6 of the Go src/runtime/runtime2.goAs for why the maximum number of M is limited to 10000, can be viewed here

About P, ​​in fact, when Go 1.0 release, it actually scheduler GM model, that is, no P, the whole scheduling process completed by the G and M, the model revealed some problems:

There is no single global mutex (Sched.Lock) and centralized storage of results in all state goroutine related operations, such as: create, re-scheduling, etc. must be locked;

  • goroutine delivery problems: M goroutine operable often transmitted between M, which results in additional scheduling delay and increased loss of performance;

  • Each M do memory cache, resulting in memory usage is too high, poor data locality;

  • Due to severe worker thread formed syscall call blocking and unblocking, resulting in additional performance loss.

The problem is too severe, leading to Go1.0 although known native support for concurrency, but has been criticized on the concurrent performance, so Dmitry Vyukov in Scalable Go Scheduler Design Doc questions the model of concurrent scalability, and by adding P (Processors) to improve the problem.

Re-design and implementation of Go scheduler (introduced in the original model of GM P) and implements called work-stealing scheduling algorithm:

  • Each P G, maintains a local queue;

  • When a G is created or changed executable state, put into his P-executable queue;

  • When a G in execution ends in the M, P to be removed from the queue G; P at this time, if the queue is empty, i.e. no other G may be performed, to randomly select another M P, from which executable G Remove half of the queue.

The algorithm avoids using global lock when goroutine scheduling.

GPM scheduling process

Task Queue maintains two kinds of time to save G Go dispatcher operation: one is a Global task queue, one is maintained for each P Local task queue.

When creating a new goroutine go through the keyword when it will be placed in local priority queue P's. To run goroutine, M needs to hold (bind) a P, then M OS starts a thread loop is removed from the local queue goroutine P's and a execute.

Of course, the above-mentioned work-stealingscheduling algorithm: Once finished the implementation of the current Local queue M P's of all G, P will not be so dry in that waiting consequently quit, it will first try to find G from Global queue to perform , Global if the queue is empty, it will randomly pick a P Moreover, from its queue G to take half their execution queue.

// go1.13.6 src/runtime/proc.go

// 省略了GC检查等其它细节,只保留了主要流程
// g:       G结构体定义
// sched:   Global队列

// 获取一个待执行的G
// 尝试从其他P中steal,从全局队列中获取g,轮询网络。
func findrunnable() (gp *g, inheritTime bool) {
   // 获取当前的G对象
	_g_ := getg()

	// The conditions here and in handoffp must agree: if
	// findrunnable would return a G to run, handoffp must start
	// an M.

top:
    // 获取当前P对象
	_p_ := _g_.m.p.ptr()
	if sched.gcwaiting != 0 {
		gcstopm()
		goto top
	}
	if _p_.runSafePointFn != 0 {
		runSafePointFn()
	}
	if fingwait && fingwake {
		if gp := wakefing(); gp != nil {
			ready(gp, 0, true)
		}
	}
	if *cgo_yield != nil {
		asmcgocall(*cgo_yield, nil)
	}

	// 1. 尝试从P的Local队列中取得G 优先_p_.runnext 然后再从Local队列中取
	if gp, inheritTime := runqget(_p_); gp != nil {
		return gp, inheritTime
	}

	// 2. 尝试从Global队列中取得G
	if sched.runqsize != 0 {
		lock(&sched.lock)
		// globrunqget从Global队列中获取G 并转移一批G到_p_的Local队列
		gp := globrunqget(_p_, 0)
		unlock(&sched.lock)
		if gp != nil {
			return gp, false
		}
	}

	// Poll network.
	// This netpoll is only an optimization before we resort to stealing.
	// We can safely skip it if there are no waiters or a thread is blocked
	// in netpoll already. If there is any kind of logical race with that
	// blocked thread (e.g. it has already returned from netpoll, but does
	// not set lastpoll yet), this thread will do blocking netpoll below
	// anyway.
	
	// 3. 检查netpoll任务
	if netpollinited() && atomic.Load(&netpollWaiters) > 0 && atomic.Load64(&sched.lastpoll) != 0 {
		if list := netpoll(false); !list.empty() { // non-blocking
			gp := list.pop()
			// netpoll返回的是G链表,将其它G放回Global队列
			injectglist(&list)
			casgstatus(gp, _Gwaiting, _Grunnable)
			if trace.enabled {
				traceGoUnpark(gp, 0)
			}
			return gp, false
		}
	}

	 // 4. 尝试从其它P窃取任务
	procs := uint32(gomaxprocs)
	if atomic.Load(&sched.npidle) == procs-1 {
		// Either GOMAXPROCS=1 or everybody, except for us, is idle already.
		// New work can appear from returning syscall/cgocall, network or timers.
		// Neither of that submits to local run queues, so no point in stealing.
		goto stop
	}
	// If number of spinning M's >= number of busy P's, block.
	// This is necessary to prevent excessive CPU consumption
	// when GOMAXPROCS>>1 but the program parallelism is low.
	if !_g_.m.spinning && 2*atomic.Load(&sched.nmspinning) >= procs-atomic.Load(&sched.npidle) {
		goto stop
	}
	if !_g_.m.spinning {
		_g_.m.spinning = true
		atomic.Xadd(&sched.nmspinning, 1)
	}
	for i := 0; i < 4; i++ {
	     // 随机P的遍历顺序
		for enum := stealOrder.start(fastrand()); !enum.done(); enum.next() {
			if sched.gcwaiting != 0 {
				goto top
			}
			stealRunNextG := i > 2 // first look for ready queues with more than 1 g
			// runqsteal执行实际的steal工作,从目标P的Local队列转移一般的G过来
            // stealRunNextG指是否steal目标P的p.runnext G
			if gp := runqsteal(_p_, allp[enum.position()], stealRunNextG); gp != nil {
				return gp, false
			}
		}
	}

stop:
// 我们没事做如果我们

I welcome the attention of the public number, reply keyword " golang ", there will be a gift both hands! ! ! I wish you a successful interview ! ! !

Published 72 original articles · won praise 0 · Views 1242

Guess you like

Origin blog.csdn.net/weixin_41818794/article/details/104227423