前言
在go语言中,每一个并发的执行单元叫做一个goroutine。你可以启动许多甚至成千上万的goroutine,Go的runtime负责对goroutine进行管理。所谓的管理就是“调度”,粗糙地说调度就是决定何时哪个goroutine将获得资源开始执行、哪个goroutine应该停止执行让出资源、哪个goroutine应该被唤醒恢复执行等
Goroutine调度器
线程数过多,意味着操作系统会不断的切换线程,频繁的上下文切换就成了性能瓶颈。 Go提供一种机制,可以在线程中自己实现调度,上下文切换更轻量。而线程中调度的就是Goroutine.
PMG模型
Dmitry Vyukov亲在Go 1.1中实现了G-P-M调度模型,一直沿用至今
基本概念:
- M(Machine):工作线程,可以认为它就是os thread(系统线程)。真正调度系统的执行者。M在绑定有效P后,从P的本地队列获得G,如果P的本地队列为空,则到全局队列获取,或者从其他P的队列偷取一半过来。G执行完成后,M从P的队列获取下一个G,不断循环下去。
- P(Processor):处理器(不是指CPU),是线程M执行的上下文,包含各种G对象队列、链表、cahce和状态,也有调度goroutine的能力。通过runtime.GOMAXPROCS控制P的数量
- G(Goroutine):即Go协程,调度系统的最小单位。存储了goroutine的执行stack信息、goroutine状态以及goroutine的任务函数等
runtime准备好G,P,M,然后M绑定P,M从各种队列中获取G,切换到G的执行栈上并执行G上的任务函数
其关系如下图所示:
- 全局队列(Global Queue):存放等待运行的G,是加锁的
- P的本地队列:通全局队列相似,也是存放的等待运行的G,但它是无锁的,而且数量有限,存满256个之后,会将一半的G存放到全局队列中。新建G时,会优先加入到P的本地队列中
PMG的创建及基本状态
下面,我们通过源码了解PMG的创建过程以及它们的基本状态
M
runtime/proc.go
func newm(fn func(), _p_ *p) {
// 根据fn和p和绑定一个m对象
mp := allocm(_p_, fn)
mp.nextp.set(_p_)
mp.sigmask = initSigmask
if gp := getg(); gp != nil && gp.m != nil && (gp.m.lockedExt != 0 || gp.m.incgo) && GOOS != "plan9" {
lock(&newmHandoff.lock)
if newmHandoff.haveTemplateThread == 0 {
throw("on a locked thread with no template thread")
}
mp.schedlink = newmHandoff.newm
newmHandoff.newm.set(mp)
if newmHandoff.waiting {
newmHandoff.waiting = false
notewakeup(&newmHandoff.wake)
}
unlock(&newmHandoff.lock)
return
}
newm1(mp)
}
func newm1(mp *m) {
if iscgo {
var ts cgothreadstart
if _cgo_thread_start == nil {
throw("_cgo_thread_start missing")
}
ts.g.set(mp.g0)
ts.tls = (*uint64)(unsafe.Pointer(&mp.tls[0]))
ts.fn = unsafe.Pointer(funcPC(mstart))
if msanenabled {
msanwrite(unsafe.Pointer(&ts), unsafe.Sizeof(ts))
}
execLock.rlock() // Prevent process clone.
asmcgocall(_cgo_thread_start, unsafe.Pointer(&ts))
execLock.runlock()
return
}
execLock.rlock() // Prevent process clone.
// 创建一个系统线程
newosproc(mp)
execLock.runlock()
}
M的状态转换如下图所示:
P
//runtime.GOMAXPROCS也会调用这个函数
func procresize(nprocs int32) *p {
old := gomaxprocs
// 如果 gomaxprocs <=0 抛出异常
if old < 0 || nprocs <= 0 {
throw("procresize: invalid arg")
}
if trace.enabled {
traceGomaxprocs(nprocs)
}
// update statistics
now := nanotime()
if sched.procresizetime != 0 {
sched.totaltime += int64(old) * (now - sched.procresizetime)
}
sched.procresizetime = now
// Grow allp if necessary.
if nprocs > int32(len(allp)) {
// Synchronize with retake, which could be running
// concurrently since it doesn't run on a P.
lock(&allpLock)
if nprocs <= int32(cap(allp)) {
allp = allp[:nprocs]
} else {
分配nprocs个*p
nallp := make([]*p, nprocs)
// Copy everything up to allp's cap so we
// never lose old allocated Ps.
copy(nallp, allp[:cap(allp)])
allp = nallp
}
unlock(&allpLock)
}
// initialize new P's
for i := int32(0); i < nprocs; i++ {
pp := allp[i]
if pp == nil {
pp = new(p)
pp.id = i
// 更改状态
pp.status = _Pgcstop
pp.sudogcache = pp.sudogbuf[:0]
for i := range pp.deferpool {
pp.deferpool[i] = pp.deferpoolbuf[i][:0]
}
pp.wbBuf.reset()
// 将pp保存到allp数组里, allp[i] = pp
atomicstorep(unsafe.Pointer(&allp[i]), unsafe.Pointer(pp))
}
if pp.mcache == nil {
if old == 0 && i == 0 {
if getg().m.mcache == nil {
throw("missing mcache?")
}
pp.mcache = getg().m.mcache // bootstrap
} else {
pp.mcache = allocmcache()
}
}
if raceenabled && pp.racectx == 0 {
if old == 0 && i == 0 {
pp.racectx = raceprocctx0
raceprocctx0 = 0 // bootstrap
} else {
pp.racectx = raceproccreate()
}
}
}
// free unused P's
for i := nprocs; i < old; i++ {
p := allp[i]
if trace.enabled && p == getg().m.p.ptr() {
// moving to p[0], pretend that we were descheduled
// and then scheduled again to keep the trace sane.
traceGoSched()
traceProcStop(p)
}
// move all runnable goroutines to the global queue
for p.runqhead != p.runqtail {
// pop from tail of local queue
p.runqtail--
gp := p.runq[p.runqtail%uint32(len(p.runq))].ptr()
// push onto head of global queue
globrunqputhead(gp)
}
if p.runnext != 0 {
globrunqputhead(p.runnext.ptr())
p.runnext = 0
}
// if there's a background worker, make it runnable and put
// it on the global queue so it can clean itself up
if gp := p.gcBgMarkWorker.ptr(); gp != nil {
casgstatus(gp, _Gwaiting, _Grunnable)
if trace.enabled {
traceGoUnpark(gp, 0)
}
globrunqput(gp)
// This assignment doesn't race because the
// world is stopped.
p.gcBgMarkWorker.set(nil)
}
// Flush p's write barrier buffer.
if gcphase != _GCoff {
wbBufFlush1(p)
p.gcw.dispose()
}
for i := range p.sudogbuf {
p.sudogbuf[i] = nil
}
p.sudogcache = p.sudogbuf[:0]
for i := range p.deferpool {
for j := range p.deferpoolbuf[i] {
p.deferpoolbuf[i][j] = nil
}
p.deferpool[i] = p.deferpoolbuf[i][:0]
}
freemcache(p.mcache)
p.mcache = nil
gfpurge(p)
traceProcFree(p)
if raceenabled {
raceprocdestroy(p.racectx)
p.racectx = 0
}
p.gcAssistTime = 0
p.status = _Pdead
// can't free P itself because it can be referenced by an M in syscall
}
// Trim allp.
if int32(len(allp)) != nprocs {
lock(&allpLock)
allp = allp[:nprocs]
unlock(&allpLock)
}
_g_ := getg()
// 如果当前的M已经绑定P,继续使用,否则将当前的M绑定一个P
if _g_.m.p != 0 && _g_.m.p.ptr().id < nprocs {
// continue to use the current P
_g_.m.p.ptr().status = _Prunning
_g_.m.p.ptr().mcache.prepareForSweep()
} else {
// 释放当前的p 再去获取allp[0]
if _g_.m.p != 0 {
_g_.m.p.ptr().m = 0
}
_g_.m.p = 0
_g_.m.mcache = nil
p := allp[0]
p.m = 0
p.status = _Pidle
acquirep(p)
if trace.enabled {
traceGoStart()
}
}
var runnablePs *p
for i := nprocs - 1; i >= 0; i-- {
p := allp[i]
if _g_.m.p.ptr() == p {
continue
}
//修改状态
p.status = _Pidle
// 将空闲p放入空闲链表
if runqempty(p) {
pidleput(p)
} else {
p.m.set(mget())
p.link.set(runnablePs)
runnablePs = p
}
}
stealOrder.reset(uint32(nprocs))
var int32p *int32 = &gomaxprocs // make compiler check that gomaxprocs is an int32
atomic.Store((*uint32)(unsafe.Pointer(int32p)), uint32(nprocs))
return runnablePs
}
所有的P在程序启动的时候就设置好了,并用一个名为allp的slice维护,可以通过调用runtime.GOMAXPROCS调整P的个数
P的状态转换如下图所示:
G
// 新建一个goroutine
func newproc(siz int32, fn *funcval) {
argp := add(unsafe.Pointer(&fn), sys.PtrSize)
gp := getg()
pc := getcallerpc()
systemstack(func() {
newproc1(fn, (*uint8)(argp), siz, gp, pc)
})
}
// Create a new g running fn with narg bytes of arguments starting
// at argp. callerpc is the address of the go statement that created
// this. The new g is put on the queue of g's waiting to run.
// 根据函数参数和函数地址,创建一个新的G,然后将这个G加入队列等待运行
func newproc1(fn *funcval, argp *uint8, narg int32, callergp *g, callerpc uintptr) {
_g_ := getg()
if fn == nil {
_g_.m.throwing = -1 // do not dump full stacks
throw("go of nil func value")
}
_g_.m.locks++ // disable preemption because it can be holding p in a local var
siz := narg
siz = (siz + 7) &^ 7
// We could allocate a larger initial stack if necessary.
// Not worth it: this is almost always an error.
// 4*sizeof(uintreg): extra space added below
// sizeof(uintreg): caller's LR (arm) or return address (x86, in gostartcall).
// 如果函数的参数大小比2048大的话,直接panic
if siz >= _StackMin-4*sys.RegSize-sys.RegSize {
throw("newproc: function arguments too large for new goroutine")
}
//获取p
_p_ := _g_.m.p.ptr()
newg := gfget(_p_)
if newg == nil {
newg = malg(_StackMin)
//将g的状态改为_Gdead
casgstatus(newg, _Gidle, _Gdead)
// 添加到全局allg数组,防止gc扫描清除掉
allgadd(newg) // publishes with a g->status of Gdead so GC scanner doesn't look at uninitialized stack.
}
if newg.stack.hi == 0 {
throw("newproc1: newg missing stack")
}
if readgstatus(newg) != _Gdead {
throw("newproc1: new g is not Gdead")
}
totalSize := 4*sys.RegSize + uintptr(siz) + sys.MinFrameSize // extra space in case of reads slightly beyond frame
totalSize += -totalSize & (sys.SpAlign - 1) // align to spAlign
sp := newg.stack.hi - totalSize
spArg := sp
if usesLR {
// caller's LR
*(*uintptr)(unsafe.Pointer(sp)) = 0
prepGoExitFrame(sp)
spArg += sys.MinFrameSize
}
if narg > 0 {
memmove(unsafe.Pointer(spArg), unsafe.Pointer(argp), uintptr(narg))
// This is a stack-to-stack copy. If write barriers
// are enabled and the source stack is grey (the
// destination is always black), then perform a
// barrier copy. We do this *after* the memmove
// because the destination stack may have garbage on
// it.
if writeBarrier.needed && !_g_.m.curg.gcscandone {
f := findfunc(fn.fn)
stkmap := (*stackmap)(funcdata(f, _FUNCDATA_ArgsPointerMaps))
if stkmap.nbit > 0 {
// We're in the prologue, so it's always stack map index 0.
bv := stackmapdata(stkmap, 0)
bulkBarrierBitmap(spArg, spArg, uintptr(bv.n)*sys.PtrSize, 0, bv.bytedata)
}
}
}
memclrNoHeapPointers(unsafe.Pointer(&newg.sched), unsafe.Sizeof(newg.sched))
newg.sched.sp = sp
newg.stktopsp = sp
newg.sched.pc = funcPC(goexit) + sys.PCQuantum // +PCQuantum so that previous instruction is in same function
newg.sched.g = guintptr(unsafe.Pointer(newg))
gostartcallfn(&newg.sched, fn)
newg.gopc = callerpc
newg.ancestors = saveAncestors(callergp)
newg.startpc = fn.fn
if _g_.m.curg != nil {
newg.labels = _g_.m.curg.labels
}
if isSystemGoroutine(newg, false) {
atomic.Xadd(&sched.ngsys, +1)
}
newg.gcscanvalid = false
// 更改当前g的状态为_Grunnable
casgstatus(newg, _Gdead, _Grunnable)
if _p_.goidcache == _p_.goidcacheend {
// Sched.goidgen is the last allocated id,
// this batch must be [sched.goidgen+1, sched.goidgen+GoidCacheBatch].
// At startup sched.goidgen=0, so main goroutine receives goid=1.
_p_.goidcache = atomic.Xadd64(&sched.goidgen, _GoidCacheBatch)
_p_.goidcache -= _GoidCacheBatch - 1
_p_.goidcacheend = _p_.goidcache + _GoidCacheBatch
}
newg.goid = int64(_p_.goidcache)
_p_.goidcache++
if raceenabled {
newg.racectx = racegostart(callerpc)
}
if trace.enabled {
traceGoCreate(newg, newg.startpc)
}
// 将当前新生成的g,放入队列
runqput(_p_, newg, true)
// 如果有空闲的p 且 m没有处于自旋状态 且 main goroutine已经启动,那么唤醒某个m来执行任务
if atomic.Load(&sched.npidle) != 0 && atomic.Load(&sched.nmspinning) == 0 && mainStarted {
wakep()
}
_g_.m.locks--
if _g_.m.locks == 0 && _g_.preempt {
// restore the preemption request in case we've cleared it in newstack
_g_.stackguard0 = stackPreempt
}
}
通过调用runtime.newproc来创建一个goroutine,创建好的这个goroutine会被放到,它所对应的内核线程M所使用的上下文P中的runqueue中,但是在后续的调度中,有些goroutine因为调用了runtime.gosched(调度),会被放到全局队列中
runtime.gosched:调用runtime·gosched函数也可以让当前goroutine放弃cpu,这种情况下会将goroutine设置称runnable,放置到全局队列中
G的状态转换如下图所示:
Goroutine调度策略
队列轮转(work stealing机制)
上图中可见每个P维护着一个包含G的队列runqueue,不考虑G进入系统调用或IO操作的情况下,P周期性的将G调度到M中执行,执行一小段时间,将上下文保存下来,然后将G放到队列尾部,然后从队列中重新取出一个G进行调度。
如果runqueue没有课执行的goroutine,则从全局队列里面尝试取出一个goroutine来执行,如果还是没有的话,则从其他的线程M的P中,偷出一些goroutine来执行(一偷就偷一半,使用的算法叫做work stealing),如果偷失败了,那说明M确实没啥做的,就去休息,等待被唤醒
hand off机制
当本线程因为G进行系统调用阻塞时,线程释放绑定的P,把P转移给其他空闲的线程继续执行剩下的G。而当之前阻塞的G系统调用结束后,当地线程目前没有P,就去找空闲的P,找到的话,就继续执行,如果没有空闲的P,则将该G放入全局队列,然后本地线程进入睡眠。
抢占调度
Go程序启动时,runtime会去启动一个名为sysmon的m(一般称为监控线程),该m无需绑定p即可运行,该m在整个Go程序的运行过程中至关重要。
sysmon每20us~10ms启动一次,按照《Go语言学习笔记》中的总结,sysmon主要完成如下工作:
- 释放闲置超过5分钟的span物理内存;
- 如果超过2分钟没有垃圾回收,强制执行;
- 将长时间未处理的netpoll结果添加到任务队列;
- 向长时间运行的G任务发出抢占调度;
- 收回因syscall长时间阻塞的P;
- 我们看到sysmon将“向长时间运行的G任务发出抢占调度”,这个事情由retake实施
如果一个G任务运行10ms,sysmon就会认为其运行时间太久而发出抢占式调度的请求。一旦G的抢占标志位被设为true,那么待这个G下一次调用函数或方法时,runtime便可以将G抢占,并移出运行状态,放入P的local runq中,等待下一次被调度。
Goroutine与线程的区别
动态栈
每一个OS线程都有一个固定大小的内存块(一般是2M)来做栈。而goroutine会以一个很小的栈开始起声明周期,一般只需要2kb,并且goroutine的栈大小不是固定的,最大值有1GB
调度
OS线程会被操作系统内核调度,从一个线程到另一个线程需要完整的上下文切换,操作很慢。
Go的运行时有自己的调度器,不需要进入内核的上下文,所以比线程的调度的代价要小得多
Gomaxprocs
Go调度器使用Gomaxprocs变量来决定多少个操作系统的线程同时执行。默认是CPU核心数
ID号
在goroutine中,没有被程序员获取到的身份的概念,也就是说,在内部每一个goroutine都有一个ID,但是没有开发接口供人去查询而已