Goroutine调度策略-如何从其它工作线程盗取Goroutine？

原文地址：Goroutine调度策略-如何从其它工作线程盗取Goroutine？

findrunnable()负责处理盗取goroutine相关的逻辑，还做了gc、netpoll等相关的事情，一言概之就是尽力去各个运行队列中寻找goroutine，实在找不到就休眠。

来看下去除无关逻辑和简化后的代码，runtime/proc.go 文件2176行代码：

// Finds a runnable goroutine to execute.// Tries to steal from other P's, get g from global queue, poll network.func findrunnable() (gp *g, inheritTime bool) {
   
       _g_ := getg()    // The conditions here and in handoffp must agree: if    // findrunnable would return a G to run, handoffp must start    // an M.top:    _p_ := _g_.m.p.ptr()    ......    // local runq    //再次看一下本地运行队列是否有需要运行的goroutine    if gp, inheritTime := runqget(_p_); gp != nil {
   
           return gp, inheritTime    }    // global runq    //再看看全局运行队列是否有需要运行的goroutine    if sched.runqsize != 0 {
   
           lock(&sched.lock)        gp := globrunqget(_p_, 0)        unlock(&sched.lock)        if gp != nil {
   
               return gp, false        }    }    ......    // Steal work from other P's.    //如果除了当前工作线程还在运行外，其它工作线程已经处于休眠中，那么也就不用去偷了，肯定没有    procs := uint32(gomaxprocs)    if atomic.Load(&sched.npidle) == procs-1 {
   
           // Either GOMAXPROCS=1 or everybody, except for us, is idle already.        // New work can appear from returning syscall/cgocall, network or timers.        // Neither of that submits to local run queues, so no point in stealing.        goto stop    }    // If number of spinning M's >= number of busy P's, block.    // This is necessary to prevent excessive CPU consumption    // when GOMAXPROCS>>1 but the program parallelism is low.    // 这个判断主要是为了防止因为寻找可运行的goroutine而消耗太多的CPU。    // 因为已经有足够多的工作线程正在寻找可运行的goroutine，让他们去找就好了，自己偷个懒去睡觉    if !_g_.m.spinning && 2*atomic.Load(&sched.nmspinning) >= procs-atomic.Load(&sched.npidle) {
   
           goto stop    }    if !_g_.m.spinning {
   
           //设置m的状态为spinning        _g_.m.spinning = true        //处于spinning状态的m数量加一        atomic.Xadd(&sched.nmspinning, 1)    }       //从其它p的本地运行队列盗取goroutine    for i := 0; i < 4; i++ {
   
           for enum := stealOrder.start(fastrand()); !enum.done(); enum.next() {
   
               if sched.gcwaiting != 0 {
   
                   goto top            }            stealRunNextG := i > 2 // first look for ready queues with more than 1 g            if gp := runqsteal(_p_, allp[enum.position()], stealRunNextG); gp != nil {
   
                   return gp, false            }        }    }stop:    ......    // Before we drop our P, make a snapshot of the allp slice,    // which can change underfoot once we no longer block    // safe-points. We don't need to snapshot the contents because    // everything up to cap(allp) is immutable.    allpSnapshot := allp    // return P and block    lock(&sched.lock)     ......     if sched.runqsize != 0 {
   
           gp := globrunqget(_p_, 0)        unlock(&sched.lock)        return gp, false    }       // 当前工作线程解除与p之间的绑定，准备去休眠    if releasep() != _p_ {
   
           throw("findrunnable: wrong p")    }    //把p放入空闲队列    pidleput(_p_)    unlock(&sched.lock)// Delicate dance: thread transitions from spinning to non-spinning state,// potentially concurrently with submission of new goroutines. We must// drop nmspinning first and then check all per-P queues again (with// #StoreLoad memory barrier in between). If we do it the other way around,// another thread can submit a goroutine after we've checked all run queues// but before we drop nmspinning; as the result nobody will unpark a thread// to run the goroutine.// If we discover new work below, we need to restore m.spinning as a signal// for resetspinning to unpark a new worker thread (because there can be more// than one starving goroutine). However, if after discovering new work// we also observe no idle Ps, it is OK to just park the current thread:// the system is fully loaded so no spinning threads are required.// Also see "Worker thread parking/unparking" comment at the top of the file.    wasSpinning := _g_.m.spinning    if _g_.m.spinning {
   
           //m即将睡眠，状态不再是spinning        _g_.m.spinning = false        if int32(atomic.Xadd(&sched.nmspinning, -1)) < 0 {
   
               throw("findrunnable: negative nmspinning")        }    }    // check all runqueues once again    // 休眠之前再看一下是否有工作要做    for _, _p_ := range allpSnapshot {
   
           if !runqempty(_p_) {
   
               lock(&sched.lock)            _p_ = pidleget()            unlock(&sched.lock)            if _p_ != nil {
   
                   acquirep(_p_)                if wasSpinning {
   
                       _g_.m.spinning = true                    atomic.Xadd(&sched.nmspinning, 1)                }                goto top            }            break        }    }    ......    //休眠    stopm()    goto top}

可以看到工作线程在放弃寻找可运行goroutine之前，会反复尝试从各个运行队列中寻找需要运行的goroutine，来看下需要注意的两个点：

工作线程M的自旋状态（spinning），工作线程从其它工作线程本地运行队列中获取goroutine时的状态称为自旋状态，从上述代码中可以看到，当M从其它p的运行队列盗取goroutine之前将spinning设置为true，同时增加处于自旋状态M的数量，当有空闲P又有goroutine需要运行时，处于自旋状态M的数量就决定了是否需要唤醒或创建新的工作线程。
盗取算法。盗取过程用了两个嵌套for循环，内层循环实现了盗取逻辑，从代码中可以看到，盗取的实质就是遍历allp中所有的p，查看其运行队列是否有goroutine，有则取其一般的当前工作线程的运行队列，然后从findrunnable返回，没有则继续遍历下一个p，为了保证公平性，遍历allp时并不是固定的从allp[0]也就是第一个p开始，而是从随机位置上的p开始，而且遍历的顺序也随机化了，并不是现在访问了第i个p，下一次就访问第i+1个p，而是使用了一种伪随机的方式遍历allp中的每个p，防止每次遍历时使用的都是同样的顺序访问allp中的元素。

来看下盗取算法的伪代码：

offset := uint32(random()) % nprocscoprime := 随机选取一个小于nprocs且与nprocs互质的数for i := 0; i < nprocs; i++ {
   
       p := allp[offset]    从p的运行队列偷取goroutine    if 偷取成功 {
   
           break    }    offset += coprime    offset = offset % nprocs}

现在来假设nprocs为8，也就是说有一共有8个p，如果第一次随机选择的offset=6，coprime=3（3与8互质，满足算法要求）的话，那么就会从allp切片中偷取的下标顺序为6, 1, 4, 7, 2, 5, 0, 3。

计算过程如下：

6，(6+3)%8=1，(1+3)%8=4, (4+3)%8=7, (7+3)%8=2, (2+3)%8=5, (5+3)%8=0, (0+3)%8=3

如果第一次随机选择的offset=4，coprime=5的话，那么就会从allp切片中偷取的下标顺序为1, 6, 3, 0, 5, 2, 7, 4。

计算过程如下：

1，(1+5)%8=6，(6+5)%8=3, (3+5)%8=0, (0+5)%8=5, (5+5)%8=2, (2+5)%8=7, (7+5)%8=4

总结来说，只要随机数不一样，偷取p的顺序也会不一样，但可以保证的是，经过8次循环，每个p都会被访问到。

挑选到盗取的p对象后，则调用runqsteal盗取p的运行队列中的goroutine，runqsteal再去调用runqrap从p的运行队列中批量拿出多个goroutine。

来看runtime/proc.go文件4854行代码分析runqrap：

// Grabs a batch of goroutines from _p_'s runnable queue into batch.// Batch is a ring buffer starting at batchHead.// Returns number of grabbed goroutines.// Can be executed by any P.func runqgrab(_p_ *p, batch *[256]guintptr, batchHead uint32, stealRunNextG bool) uint32 {
   
       for {
   
           h := atomic.LoadAcq(&_p_.runqhead) // load-acquire, synchronize with other consumers        t := atomic.LoadAcq(&_p_.runqtail) // load-acquire, synchronize with the producer        n := t - h        //计算队列中有多少个goroutine        n = n - n/2     //取队列中goroutine个数的一半        if n == 0 {
   
               ......            return ......        }        //小细节：按理说队列中的goroutine个数最多就是len(_p_.runq)，        //所以n的最大值也就是len(_p_.runq)/2，那为什么需要这个判断呢？        if n > uint32(len(_p_.runq)/2) { // read inconsistent h and t            continue        }        ......    }}

上述代码中n的计算很简单，从计算过程来看，n应该是runq队列中goroutine数量的一半，最大值不会超过运行队列容量的一半。

为啥非要校验n是否超过运行队列容量的一半？

关键的一点就在于读取runqhead和runqtail是两个操作，而不是一个原子操作，当读取runqhead之后未读取runqtail之前，如果有其它线程在快速增加（其它偷取者从队列中偷取到goroutine会增加runqhead，队列所有者往运行队列中添加goroutine会增加runqtail）这两个值，会导致读取出来的runqtail会远大于之前读取出来放在局部变量h里的runqhead的值了，也就是上述代码注释中所说的h和t不一致了，这时通常使用if判断来检测异常情况。

如果工作线程经过多次努力一直获取不到需要运行的goroutine，则会调用stopm进入休眠状态来等待其它工作线程唤醒。

来看runtime/proc.go文件1918行代码分析stopm：

// Stops execution of the current m until new work is available.// Returns with acquired P.func stopm() {
   
       _g_ := getg()
    if _g_.m.locks != 0 {
   
           throw("stopm holding locks")    }    if _g_.m.p != 0 {
   
           throw("stopm holding p")    }    if _g_.m.spinning {
   
           throw("stopm spinning")    }    lock(&sched.lock)    mput(_g_.m)   //把m结构体对象放入sched.midle空闲队列    unlock(&sched.lock)    notesleep(&_g_.m.park)  //进入睡眠状态     //被其它工作线程唤醒    noteclear(&_g_.m.park)    acquirep(_g_.m.nextp.ptr())    _g_.m.nextp = 0}

stopm核心是调用mput将m结构体对象放入sched的midle空闲队列，然后通过notesleep（&m.park）使自己进入休眠状态。

note是go runtime实现出来的一次性的唤醒以及睡眠的机制，一个线程可以通过调用notesleep（*note）进入休眠状态，另一线程可通过调用notewakeup（*note）将其唤醒，note底层实现与操作系统息息相关，不同的系统使用不同的机制，如Linux使用futex系统调用，而mac则使用pthread_cond_t条件变量，note对这些底层机制做了抽象和封装，给扩展性带来了好处，当休眠和唤醒机制需要支持新平台时，只需在note层增加对新平台的支持就好了，不需要修改上层代码。

当从notesleep返回后，需再次绑定一个p，然后返回到findrunnable，重新寻找可运行的goroutine，找到就会返回到schedule，并将找到的goroutine调度起来运行。

继续看runtime/lock_futex.go文件139行代码，分析notesleep：

func notesleep(n *note) {
   
       gp := getg()    if gp != gp.m.g0 {
   
           throw("notesleep not on g0")    }    ns := int64(-1)  //超时时间设置为-1，表示无限期等待    if *cgo_yield != nil {
   
           // Sleep for an arbitrary-but-moderate interval to poll libc interceptors.        ns = 10e6    }     //使用循环，保证不是意外被唤醒    for atomic.Load(key32(&n.key)) == 0 {
   
           gp.m.blocked = true        futexsleep(key32(&n.key), 0, ns)        if *cgo_yield != nil {
   
               asmcgocall(*cgo_yield, nil)        }        gp.m.blocked = false    }}

notesleep调用futexsleep进入休眠，上述代码在此步骤用到循环就是因为futexsleep有可能意外从睡眠中返回，所以futexsleep返回后还需检查note.key是否还为0，不是的话，那就表示不是其它工作线程唤醒，而是futexsleep意外返回了，这时就需再次调用futexsleep进入睡眠。

futexsleep调用futex进入睡眠，来看runtime/os_linux.go文件32行代码：

// Atomically,//if(*addr == val) sleep// Might be woken up spuriously; that's allowed.// Don't sleep longer than ns; ns < 0 means forever.//go:nosplitfunc futexsleep(addr *uint32, val uint32, ns int64) {
   
       var ts timespec    // Some Linux kernels have a bug where futex of    // FUTEX_WAIT returns an internal error code    // as an errno. Libpthread ignores the return value    // here, and so can we: as it says a few lines up,    // spurious wakeups are allowed.    if ns < 0 {
   
           //调用futex进入睡眠        futex(unsafe.Pointer(addr), _FUTEX_WAIT_PRIVATE, val, nil, nil, 0)        return    }    // It's difficult to live within the no-split stack limits here.    // On ARM and 386, a 64-bit divide invokes a general software routine    // that needs more stack than we can afford. So we use timediv instead.    // But on real 64-bit systems, where words are larger but the stack limit    // is not, even timediv is too heavy, and we really need to use just an    // ordinary machine instruction.    if sys.PtrSize == 8 {
   
           ts.set_sec(ns / 1000000000)        ts.set_nsec(int32(ns % 1000000000))    } else {
   
           ts.tv_nsec = 0        ts.set_sec(int64(timediv(ns, 1000000000, (*int32)(unsafe.Pointer(&ts.tv_nsec)))))    }    futex(unsafe.Pointer(addr), _FUTEX_WAIT_PRIVATE, val, unsafe.Pointer(&ts), nil, 0)}

futex是Go汇编实现的，主要就是执行futex系统调用进入操作系统内核进行休眠。

来看runtime/sys_linux_amd64.s文件525行代码：

// int64 futex(int32 *uaddr, int32 op, int32 val,//struct timespec *timeout, int32 *uaddr2, int32 val2);TEXT runtime·futex(SB),NOSPLIT,$0   #下面的6条指令在为futex系统调用准备参数MOVQ  addr+0(FP), DIMOVL   op+8(FP), SIMOVL   val+12(FP), DXMOVQ  ts+16(FP), R10MOVQ  addr2+24(FP), R8MOVL   val3+32(FP), R9MOVL   $SYS_futex, AX   #系统调用编号放入AX寄存器SYSCALL  #执行futex系统调用进入睡眠，从睡眠中被唤醒后接着执行下一条MOVL指令MOVL   AX, ret+40(FP)    #保存系统调用的返回值RET

futex的参数比较多，原型如下：

int64 futex(int32 *uaddr, int32 op, int32 val, struct timespec *timeout, int32 *uaddr2, int32 val2);

futex系统调用如果*uaddr == val则进入睡眠，反之直接返回。

为啥futex需要用第三个参数val？为啥在内核验证*uaddr == val，而不是在用户代码验证？

原因就是验证*uaddr == val以及进入睡眠这两个操作是一个原子操作，否则就会存在竞态条件，也就是说如果不是原子操作，在当前工作线程验证完*uaddr == val，进入休眠之前这段时间内，有另外的工作线程通过唤醒操作将*uaddr的值修改了，如此就会导致当前工作线程永远处于休眠状态，还有就是，用户代码中无法实现验证*uaddr == val以及进入休眠状态两个操作作为一个原子操作，所以需要内核为其实现。

线程进入睡眠后停止工作，之后goroutine需线程运行，正在睡眠的线程如何知道？

从之前代码分析中可以看到，stopm调用notesleep时传递的参数是m结构体的park成员，m已通过mput放入全局midle空闲队列，如此，其它运行着的工作线程一旦发现有其它需要运行的goroutine时，就可以通过全局m空闲队列找到处于睡眠状态的m，然后调用notewakeup（&m.park）将其唤醒。

至此，完整的调度器调度策略就聊完了，下篇文章就来聊聊调度时机。

以上仅为个人观点，不一定准确，能帮到各位那是最好的。

好啦，到这里本文就结束了，喜欢的话就来个三连击吧。

扫码关注公众号，获取更多优质内容。

Goroutine调度策略-如何从其它工作线程盗取Goroutine？

猜你喜欢