原文地址:Goroutine运行时间过长而发生的抢占调度详解~
本文主要关注以下两点:
-
发生抢占调度的情况。
-
因运行时间过长发生的抢占调度的特点。
sysmon系统监控线程会定期(10毫秒)通过retake对goroutine发起抢占。
来看runtime/proc.go文件4376行分析retake:
// forcePreemptNS is the time slice given to a G before it is
// preempted.
const forcePreemptNS = 10 * 1000 * 1000 // 10ms
func retake(now int64) uint32 {
n := 0
// Prevent allp slice changes. This lock will be completely
// uncontended unless we're already stopping the world.
lock(&allpLock)
// We can't use a range loop over allp because we may
// temporarily drop the allpLock. Hence, we need to re-fetch
// allp each time around the loop.
for i := 0; i < len(allp); i++ { //遍历所有的P
_p_ := allp[i]
if _p_ == nil {
// This can happen if procresize has grown
// allp but not yet created new Ps.
continue
}
//_p_.sysmontick用于sysmon线程记录被监控p的系统调用时间和运行时间
pd := &_p_.sysmontick
s := _p_.status
if s == _Psyscall { //P处于系统调用之中,需要检查是否需要抢占
// Retake P from syscall if it's there for more than 1 sysmon tick (at least 20us).
t := int64(_p_.syscalltick)
if int64(pd.syscalltick) != t {
pd.syscalltick = uint32(t)
pd.syscallwhen = now
continue
}
// On the one hand we don't want to retake Ps if there is no other work to do,
// but on the other hand we want to retake them eventually
// because they can prevent the sysmon thread from deep sleep.
if runqempty(_p_) && atomic.Load(&sched.nmspinning)+atomic.Load(&sched.npidle) > 0 && pd.syscallwhen+10*1000*1000 > now {
continue
}
// Drop allpLock so we can take sched.lock.
unlock(&allpLock)
// Need to decrement number of idle locked M's
// (pretending that one more is running) before the CAS.
// Otherwise the M from which we retake can exit the syscall,
// increment nmidle and report deadlock.
incidlelocked(-1)
if atomic.Cas(&_p_.status, s, _Pidle) {
if trace.enabled {
traceGoSysBlock(_p_)
traceProcStop(_p_)
}
n++
_p_.syscalltick++
handoffp(_p_)
}
incidlelocked(1)
lock(&allpLock)
} else if s == _Prunning { //P处于运行状态,需要检查其是否运行得太久了
// Preempt G if it's running for too long.
//_p_.schedtick:每发生一次调度,调度器++该值
t := int64(_p_.schedtick)
if int64(pd.schedtick) != t {
//监控线程监控到一次新的调度,所以重置跟sysmon相关的schedtick和schedwhen变量
pd.schedtick = uint32(t)
pd.schedwhen = now
continue
}
//pd.schedtick == t说明(pd.schedwhen ~ now)这段时间未发生过调度,
//所以这段时间是同一个goroutine一直在运行,下面检查一直运行是否超过了10毫秒
if pd.schedwhen+forcePreemptNS > now {
//从某goroutine第一次被sysmon线程监控到正在运行一直运行到现在还未超过10毫秒
continue
}
//连续运行超过10毫秒了,设置抢占请求
preemptone(_p_)
}
}
unlock(&allpLock)
return uint32(n)
}
retake根据p两种不同状态决定是否需发起抢占调度:
-
_Prunning表示对应的goroutine正在运行,如其运行时间超过10毫秒则需抢占。
-
_Psyscall表示对应的goroutine正在内核执行系统调用,此时需根据多条件决定是否需抢占调度。
sysmon如监控到某goroutine连续运行超过10毫秒,则调preemptone向该goroutine发起抢占调度,来看runtime/proc.go文件4465行分析preemptone:
// Tell the goroutine running on processor P to stop.
// This function is purely best-effort. It can incorrectly fail to inform the
// goroutine. It can send inform the wrong goroutine. Even if it informs the
// correct goroutine, that goroutine might ignore the request if it is
// simultaneously executing newstack.
// No lock needs to be held.
// Returns true if preemption request was issued.
// The actual preemption will happen at some point in the future
// and will be indicated by the gp->status no longer being
// Grunning
func preemptone(_p_ *p) bool {
mp := _p_.m.ptr()
if mp == nil || mp == getg().m {
return false
}
//gp是被抢占的goroutine
gp := mp.curg
if gp == nil || gp == mp.g0 {
return false
}
gp.preempt = true //设置抢占标志
// Every call in a go routine checks for stack overflow by
// comparing the current stack pointer to gp->stackguard0.
// Setting gp->stackguard0 to StackPreempt folds
// preemption into the normal stack overflow check.
//stackPreempt是一个常量0xfffffffffffffade,是非常大的一个数
gp.stackguard0 = stackPreempt //设置stackguard0使被抢占的goroutine去处理抢占请求
return true
}
preemptone设置被抢占的goroutine对应g结构体中preempt为true和stackguard0为stackPreempt(stackPreempt是一常量0xfffffffffffffade,是非常大的数)就返回,没强制被抢占的goroutine停止运行。
处理定义的抢占标识函数调用链为morestack_noctxt()->morestack()->newstack()。
以程序为例:
package main
import "fmt"
func sum(a, b int) int {
a2 := a * a
b2 := b * b
c := a2 + b2
fmt.Println(c)
return c
}
func main() {
sum(1, 2)
}
用gdb反汇编main结果如下:
=> 0x0000000000486a80 <+0>: mov %fs:0xfffffffffffffff8,%rcx
0x0000000000486a89 <+9>: cmp 0x10(%rcx),%rsp
0x0000000000486a8d <+13>: jbe 0x486abd <main.main+61>
0x0000000000486a8f <+15>: sub $0x20,%rsp
0x0000000000486a93 <+19>: mov %rbp,0x18(%rsp)
0x0000000000486a98 <+24>: lea 0x18(%rsp),%rbp
0x0000000000486a9d <+29>: movq $0x1,(%rsp)
0x0000000000486aa5 <+37>: movq $0x2,0x8(%rsp)
0x0000000000486aae <+46>: callq 0x4869c0 <main.sum>
0x0000000000486ab3 <+51>: mov 0x18(%rsp),%rbp
0x0000000000486ab8 <+56>: add $0x20,%rsp
0x0000000000486abc <+60>: retq
0x0000000000486abd <+61>: callq 0x44ece0 <runtime.morestack_noctxt>
0x0000000000486ac2 <+66>: jmp 0x486a80 <main.main>
对morestack_noctxt调用在尾部,是通过jbe过来的,来看前三条指令:
0x0000000000486a80 <+0>: mov %fs:0xfffffffffffffff8,%rcx #main函数第一条指令,rcx = g
0x0000000000486a89 <+9>: cmp 0x10(%rcx),%rsp
0x0000000000486a8d <+13>: jbe 0x486abd <main.main+61>
jbe是条件跳转指令,根据上条指令执行结果决定是否需跳转。
main首条指令就是从TLS(Go根据fs实现TLS)读取当前正在运行的g的指针放在rcx,次条指令源操作数为间接寻址,从内存读取相对于g偏移16的地址对应内容到rsp。
先来看下g结构体的定义:
type g struct {
stack stack
stackguard0 uintptr
stackguard1 uintptr
......
}
type stack struct {
lo uintptr //8 bytes
hi uintptr //8 bytes
}
g的stack占16字节(lo、hi各8字节),所以g结构体起始位置加偏移量16对应stackguard0,因此main次条指令意为比较栈顶寄存器rsp和stackguard0的值,如rsp较小,表示当前g的栈快用完了,有溢出风险,需扩栈,假设main goroutine被设置了抢占标识,那么rsp会远小于stackguard0,因此stackguard0被设置抢占标记,代码就会跳到0x0000000000486abd处执行call调morestack_noctxt,该call会将紧随其后的一条指令的地址0x0000000000486ac2压入堆栈,再跳到morestack_noctxt去执行。
来看此时rsp、g、main的栈状态图:
morestack_noctxt用JMP直接跳到morestack继续执行,未使用call调morestack,所以rsp并未发生变化。
morestack执行流程类似于之前分析过的mcall,先保存调用morestack的goroutine(此场景为main goroutine)的调度信息对应的g结构体的sched中,之后切到当前工作线程的g0栈继续执行newstack。
来看runtime/asm_amd64.s文件433行分析morestack:
// morestack but not preserving ctxt.
TEXT runtime·morestack_noctxt(SB),NOSPLIT,$0
MOVL $0, DX
JMP runtime·morestack(SB)
// Called during function prolog when more stack is needed.
//
// The traceback routines see morestack on a g0 as being
// the top of a stack (for example, morestack calling newstack
// calling the scheduler calling newm calling gc), so we must
// record an argument size. For that purpose, it has no arguments.
TEXT runtime·morestack(SB),NOSPLIT,$0-0
......
get_tls(CX)
MOVQ g(CX), SI # SI = g(main goroutine对应的g结构体变量)
......
#SP栈顶寄存器现在指向的是morestack_noctxt函数的返回地址,
#所以下面这一条指令执行完成后AX = 0x0000000000486ac2
MOVQ 0(SP), AX
#下面两条指令给g.sched.PC和g.sched.g赋值,我们这个例子g.sched.PC被赋值为0x0000000000486ac2,
#也就是执行完morestack_noctxt函数之后应该返回去继续执行指令的地址。
MOVQ AX, (g_sched+gobuf_pc)(SI) #g.sched.pc = 0x0000000000486ac2
MOVQ SI, (g_sched+gobuf_g)(SI) #g.sched.g = g
LEAQ 8(SP), AX #main函数在调用morestack_noctxt之前的rsp寄存器
#下面三条指令给g.sched.sp,g.sched.bp和g.sched.ctxt赋值
MOVQ AX, (g_sched+gobuf_sp)(SI)
MOVQ BP, (g_sched+gobuf_bp)(SI)
MOVQ DX, (g_sched+gobuf_ctxt)(SI)
#上面几条指令把g的现场保存了起来,下面开始切换到g0运行
#切换到g0栈,并设置tls的g为g0
#Call newstack on m->g0's stack.
MOVQ m_g0(BX), BX
MOVQ BX, g(CX) #设置TLS中的g为g0
#把g0栈的栈顶寄存器的值恢复到CPU的寄存器,达到切换栈的目的,下面这一条指令执行之前,
#CPU还是使用的调用此函数的g的栈,执行之后CPU就开始使用g0的栈了
MOVQ (g_sched+gobuf_sp)(BX), SP
CALL runtime·newstack(SB)
CALL runtime·abort(SB)// crash if newstack returns
RET
切到g0前,当前goroutine的现场信息被保存到对应g结构体的sched,main下次被调度时,调度器可将g.sched.sp恢复到CPU的rsp完成栈切换,之后将g.sched.pc恢复到CPU的rip中,之后CPU继续执行callq后的【0x0000000000486ac2 <+66>: jmp 0x486a80 <main.main>】指令,此时状态如下:
来看runtime/stack.go文件899行分析newstack:
// Called from runtime·morestack when more stack is needed.
// Allocate larger stack and relocate to new stack.
// Stack growth is multiplicative, for constant amortized cost.
//
// g->atomicstatus will be Grunning or Gscanrunning upon entry.
// If the GC is trying to stop this g then it will set preemptscan to true.
//
// This must be nowritebarrierrec because it can be called as part of
// stack growth from other nowritebarrierrec functions, but the
// compiler doesn't check this.
//
//go:nowritebarrierrec
func newstack() {
thisg := getg() // thisg = g0
......
// 这行代码获取g0.m.curg,也就是需要扩栈或响应抢占的goroutine
// 对于我们这个例子gp = main goroutine
gp := thisg.m.curg
......
// NOTE: stackguard0 may change underfoot, if another thread
// is about to try to preempt gp. Read it just once and use that same
// value now and below.
//检查g.stackguard0是否被设置为stackPreempt
preempt := atomic.Loaduintptr(&gp.stackguard0) == stackPreempt
// Be conservative about where we preempt.
// We are interested in preempting user Go code, not runtime code.
// If we're holding locks, mallocing, or preemption is disabled, don't
// preempt.
// This check is very early in newstack so that even the status change
// from Grunning to Gwaiting and back doesn't happen in this case.
// That status change by itself can be viewed as a small preemption,
// because the GC might change Gwaiting to Gscanwaiting, and then
// this goroutine has to wait for the GC to finish before continuing.
// If the GC is in some way dependent on this goroutine (for example,
// it needs a lock held by the goroutine), that small preemption turns
// into a real deadlock.
if preempt {
//检查被抢占goroutine的状态
if thisg.m.locks != 0 || thisg.m.mallocing != 0 || thisg.m.preemptoff != "" || thisg.m.p.ptr().status != _Prunning {
// Let the goroutine keep running for now.
// gp->preempt is set, so it will be preempted next time.
//还原stackguard0为正常值,表示我们已经处理过抢占请求了
gp.stackguard0 = gp.stack.lo + _StackGuard
//不抢占,调用gogo继续运行当前这个g,不需要调用schedule函数去挑选另一个goroutine
gogo(&gp.sched) // never return
}
}
//省略的代码做了些其它检查所以这里才有两个同样的判断
if preempt {
if gp == thisg.m.g0 {
throw("runtime: preempt g0")
}
if thisg.m.p == 0 && thisg.m.locks == 0 {
throw("runtime: g is running but p is not")
}
......
//下面开始响应抢占请求
// Act like goroutine called runtime.Gosched.
//设置gp的状态,省略的代码在处理gc时把gp的状态修改成了_Gwaiting
casgstatus(gp, _Gwaiting, _Grunning)
//调用gopreempt_m把gp切换出去
gopreempt_m(gp) // never return
}
......
}
newstack作用一就是扩栈,第二就是是用来响应sysmon提出的抢占请求,它先检查g.stackguard0是否被设置为stackPreempt,是的话表示sysmon已发现运行超时并提出抢占请求,做一些基本检查后如发现当前goroutine可以被抢占则调gopreempt_m完成调度,来看runtime/proc.go文件2644行分析gopreempt_m:
func gopreempt_m(gp *g) {
if trace.enabled {
traceGoPreempt()
}
goschedImpl(gp)
}
gopreempt_m通过调goschedImpl完成实际调度切换工作。
goschedImpl先将gp状态由_Grunning改为_Grunnable,后通过dropg解除当前工作线程m与gp间的关联,再将gp放入全局运行队列等待调度器调度,最后调schedule进入下一轮调度循环。
经过这一轮分析,可知go发起抢占调度是有条件的,sysmon负责给被抢占的goroutine设置抢占标记,抢占的goroutine在入口处检查g的stackguard0再决定是否需调用morestack_noctxt,最终调到newstack完成抢占调度。
以上仅为个人观点,不一定准确,能帮到各位那是最好的。
好啦,到这里本文就结束了,喜欢的话就来个三连击吧。
以上均为个人认知,如有侵权,请联系删除。