Go language goroutine scheduler initialization twelve

The following content is reproduced from  https://mp.weixin.qq.com/s/W9D4Sl-6jYfcpczzdPfByQ

Original source code travel  notes of Awa Zhang who loves to write programs2019-05-05 

This article is the twelfth chapter of the "Go Language Scheduler Source Code Scenario Analysis" series, and it is also the second subsection of Chapter 2.


 

This chapter will take the following simple Hello World program as an example to analyze the initialization of the Go language scheduler, the creation and exit of goroutines, the scheduling loop of worker threads, and the switching of goroutines by tracing the complete running process from startup to exit. And other important content.

package main

import "fmt"

func main() {
    fmt.Println("Hello World!")
}

First, we analyze the initialization of the scheduler from the start of the program.

Before analyzing the startup process of the program, let's first take a look at the initial state of the program's stack before executing the first instruction.

Any program written in a compiled language (whether C, C++, go or assembly language) will go through the following stages in sequence when it is loaded and run by the operating system:

  1. Read the executable program from the disk into the memory;

  2. Create process and main thread;

  3. Allocate stack space for the main thread;

  4. Copy the parameters entered by the user on the command line to the stack of the main thread;

  5. Put the main thread into the run queue of the operating system and wait for it to be scheduled to run.

Before the main thread is scheduled to execute the first instruction for the first time, the function stack of the main thread is shown in the following figure:

image

After understanding the initial state of the program, let's officially start.

Program entry

Use go build to compile hello.go on the Linux command line to get the executable program hello, and then use gdb to debug. In gdb, we first use the info files command to find the entry point address of the program is 0x452270, and then use b *0x452270 at 0x452270 Next to the breakpoint at the address, gdb tells us that the source code corresponding to this entry is line 8 of the runtime/rt0_linux_amd64.s file.

bobo@ubuntu:~/study/go$ go build hello.go 
bobo@ubuntu:~/study/go$ gdb hello
GNU gdb (GDB) 8.0.1
(gdb) info files
Symbols from "/home/bobo/study/go/main".
Local exec file:
`/home/bobo/study/go/main', file type elf64-x86-64.
Entry point: 0x452270
0x0000000000401000 - 0x0000000000486aac is .text
0x0000000000487000 - 0x00000000004d1a73 is .rodata
0x00000000004d1c20 - 0x00000000004d27f0 is .typelink
0x00000000004d27f0 - 0x00000000004d2838 is .itablink
0x00000000004d2838 - 0x00000000004d2838 is .gosymtab
0x00000000004d2840 - 0x00000000005426d9 is .gopclntab
0x0000000000543000 - 0x000000000054fa9c is .noptrdata
0x000000000054faa0 - 0x0000000000556790 is .data
0x00000000005567a0 - 0x0000000000571ef0 is .bss
0x0000000000571f00 - 0x0000000000574658 is .noptrbss
0x0000000000400f9c - 0x0000000000401000 is .note.go.buildid
(gdb) b *0x452270
Breakpoint 1 at 0x452270: file /usr/local/go/src/runtime/rt0_linux_amd64.s, line 8.

Open the code editor and find the runtime/rt0_linx_amd64.s file, which is a source code file written in go assembly language. We have discussed its format in the first part of this book. Now look at line 8:

runtime/rt0_linx_amd64.s : 8

TEXT _rt0_amd64_linux(SB),NOSPLIT,$-8
    JMP_rt0_amd64(SB)

The first line of code above defines the symbol _rt0_amd64_linux, which is not a real CPU instruction. The JMP instruction in the second line is the first instruction of the main thread. This instruction simply jumps to (equivalent to go language or c Goto in) _rt0_amd64 continues to execute at the symbol. The definition of _rt0_amd64 is in the runtime/asm_amd64.s file:

runtime/asm_amd64.s : 14

TEXT _rt0_amd64(SB),NOSPLIT,$-8
    MOVQ0(SP), DI// argc 
    LEAQ8(SP), SI // argv
    JMPruntime·rt0_go(SB)

The first two lines of instructions put the addresses of the parameters argc and argv array passed by the operating system kernel in the DI and SI registers, respectively, and the third line of instructions jumps to rt0_go for execution.

The rt0_go function completes all the initialization work when the go program is started, so this function is relatively long and complicated, but here we only focus on some initializations related to the scheduler, let's look at it in sections:

runtime/asm_amd64.s : 87

TEXT runtime·rt0_go(SB),NOSPLIT,$0
    // copy arguments forward on an even stack
    MOVQDI, AX// AX = argc
    MOVQSI, BX// BX = argv
    SUBQ$(4*8+7), SP// 2args 2auto
    ANDQ$~15, SP     //调整栈顶寄存器使其按16字节对齐
    MOVQAX, 16(SP) //argc放在SP + 16字节处
    MOVQBX, 24(SP) //argv放在SP + 24字节处

The fourth instruction above is used to adjust the value of the top register of the stack to be aligned to 16 bytes, that is, to make the address of the memory pointed to by the top register SP of the stack a multiple of 16, and the reason why it is aligned to 16 bytes is because The CPU has a set of SSE instructions. The memory addresses appearing in these instructions must be multiples of 16. The last two instructions move argc and argv to new locations. The other parts of this code have been commented in more detail, so I won't explain too much here.

Initialize g0

Continuing to look at the following code, the global variable g0 will be initialized below. As we said earlier, the main function of g0 is to provide a stack for runtime code execution, so here we mainly initialize several stack-related members of g0. From here It can be seen that the stack of g0 is about 64K, and the address range is SP-64*1024 + 104 ~ SP.

runtime/asm_amd64.s : 96

// create istack out of the given (operating system) stack.
// _cgo_init may update stackguard.
//下面这段代码从系统线程的栈空分出一部分当作g0的栈,然后初始化g0的栈信息和stackgard
MOVQ$runtime·g0(SB), DI       //g0的地址放入DI寄存器
LEAQ(-64*1024+104)(SP), BX //BX = SP - 64*1024 + 104
MOVQBX, g_stackguard0(DI) //g0.stackguard0 = SP - 64*1024 + 104
MOVQBX, g_stackguard1(DI) //g0.stackguard1 = SP - 64*1024 + 104
MOVQBX, (g_stack+stack_lo)(DI) //g0.stack.lo = SP - 64*1024 + 104
MOVQSP, (g_stack+stack_hi)(DI) //g0.stack.hi = SP

The relationship between g0 and the stack after running the above lines of instructions is shown in the following figure:

image

 

The main thread is bound to m0

After setting up the g0 stack, we skip the CPU model check and the code related to cgo initialization, and continue the analysis directly from line 164.

runtime/asm_amd64.s : 164

  //下面开始初始化tls(thread local storage,线程本地存储)
LEAQruntime·m0+m_tls(SB), DI //DI = &m0.tls,取m0的tls成员的地址到DI寄存器
CALLruntime·settls(SB) //调用settls设置线程本地存储,settls函数的参数在DI寄存器中

// store through it, to make sure it works
//验证settls是否可以正常工作,如果有问题则abort退出程序
get_tls(BX) //获取fs段基地址并放入BX寄存器,其实就是m0.tls[1]的地址,get_tls的代码由编译器生成
MOVQ$0x123, g(BX) //把整型常量0x123拷贝到fs段基地址偏移-8的内存位置,也就是m0.tls[0]= 0x123
MOVQruntime·m0+m_tls(SB), AX //AX = m0.tls[0]
CMPQAX, $0x123 //检查m0.tls[0]的值是否是通过线程本地存储存入的0x123来验证tls功能是否正常
JEQ 2(PC)
CALLruntime·abort(SB) //如果线程本地存储不能正常工作,退出程序

This code first calls the settls function to initialize the thread local storage (TLS) of the main thread. The purpose is to associate m0 with the main thread. As for why m and the worker thread are bound together, we have already introduced it in the previous section. Now, I won’t repeat it here. After setting the thread local storage, the next few instructions are to verify whether the TLS function is normal, and if it is not normal, abort the program directly.

Let's take a detailed look at how the settls function implements thread private global variables.

runtime/sys_linx_amd64.s : 606

// set tls base to DI
TEXT runtime·settls(SB),NOSPLIT,$32
//......
//DI寄存器中存放的是m.tls[0]的地址,m的tls成员是一个数组,读者如果忘记了可以回头看一下m结构体的定义
//下面这一句代码把DI寄存器中的地址加8,为什么要+8呢,主要跟ELF可执行文件格式中的TLS实现的机制有关
//执行下面这句指令之后DI寄存器中的存放的就是m.tls[1]的地址了
ADDQ$8, DI// ELF wants to use -8(FS)

  //下面通过arch_prctl系统调用设置FS段基址
MOVQDI, SI //SI存放arch_prctl系统调用的第二个参数
MOVQ$0x1002, DI// ARCH_SET_FS //arch_prctl的第一个参数
MOVQ$SYS_arch_prctl, AX //系统调用编号
SYSCALL
CMPQAX, $0xfffffffffffff001
JLS2(PC)
MOVL$0xf1, 0xf1 // crash //系统调用失败直接crash
RET

As you can see from the code, the address of m0.tls[1] is set to the base address of the fs segment through the arch_prctl system call. There is a segment register called fs in the CPU that corresponds to it, and each thread has its own set of CPU register values. The operating system will help us save the values ​​in all registers in memory when the thread is tuned away from the CPU. When the scheduling thread is up and running, the values ​​of these registers will be restored from the memory to the CPU, so that after this, the worker thread code can find m.tls through the fs register. Readers can refer to the tls function after initializing tls above Verify the code to understand this process.

Let's continue to analyze rt0_go,

runtime/asm_amd64.s : 174

ok:
// set the per-goroutine and per-mach "registers"
get_tls(BX) //获取fs段基址到BX寄存器
LEAQruntime·g0(SB), CX //CX = g0的地址
MOVQCX, g(BX) //把g0的地址保存在线程本地存储里面,也就是m0.tls[0]=&g0
LEAQruntime·m0(SB), AX //AX = m0的地址

//把m0和g0关联起来m0->g0 = g0,g0->m = m0
// save m->g0 = g0
MOVQCX, m_g0(AX) //m0.g0 = g0
// save m0 to g0->m 
MOVQAX, g_m(CX) //g0.m = m0

The above code first puts the address of g0 into the thread local storage of the main thread, and then passes

m0.g0 = &g0
g0.m = &m0

Bind m0 and g0 together, so that g0 can be obtained through get_tls in the main thread, and m0 can be found through the m member of g0, so the association between m0 and g0 and the main thread is realized here. It can also be seen from here that the value stored in the local storage of the main thread is the address of g0, which means that the private global variable of the worker thread is actually a pointer to g instead of a pointer to m. At present, this pointer points to g0. Indicates that the code is running on the g0 stack. At this time, the relationship between the stack of the main thread, m0, g0, and g0 is shown in the following figure:

image

 

 

Initialize m0

The following code starts to process command line parameters. We don't care about this part, so skip it. After the command line parameters are processed, the osinit function is called to obtain the number of CPU cores and stored in the global variable ncpu. When the scheduler is initialized, it needs to know how many CPU cores the current system has.

runtime/asm_amd64.s : 189

//准备调用args函数,前面四条指令把参数放在栈上
MOVL16(SP), AX// AX = argc
MOVLAX, 0(SP)       // argc放在栈顶
MOVQ24(SP), AX// AX = argv
MOVQAX, 8(SP)       // argv放在SP + 8的位置
CALLruntime·args(SB)  //处理操作系统传递过来的参数和env,不需要关心

//对于linx来说,osinit唯一功能就是获取CPU的核数并放在global变量ncpu中,
//调度器初始化时需要知道当前系统有多少CPU核
CALLruntime·osinit(SB)  //执行的结果是全局变量 ncpu = CPU核数
CALLruntime·schedinit(SB) //调度系统初始化

Next, continue to see how the scheduler is initialized.

runtime/proc.go : 526

func schedinit() {
// raceinit must be the first call to race detector.
// In particular, it must be done before mallocinit below calls racemapshadow.
   
    //getg函数在源代码中没有对应的定义,由编译器插入类似下面两行代码
    //get_tls(CX) 
    //MOVQ g(CX), BX; BX存器里面现在放的是当前g结构体对象的地址
    _g_ := getg() // _g_ = &g0

    ......

    //设置最多启动10000个操作系统线程,也是最多10000个M
    sched.maxmcount = 10000

    ......
   
    mcommoninit(_g_.m) //初始化m0,因为从前面的代码我们知道g0->m = &m0

    ......

    sched.lastpoll = uint64(nanotime())
    procs := ncpu  //系统中有多少核,就创建和初始化多少个p结构体对象
    if n, ok := atoi32(gogetenv("GOMAXPROCS")); ok && n > 0 {
        procs = n //如果环境变量指定了GOMAXPROCS,则创建指定数量的p
    }
    if procresize(procs) != nil {//创建和初始化全局变量allp
        throw("unknown runnable goroutine during bootstrap")
    }

    ......
}

As we have seen earlier, the address of g0 has been set to the thread local storage, and schedinit uses the getg function (the getg function is implemented by the compiler, and we can’t find its definition in the source code) from the thread local storage. Get the currently running g, here is g0, and then call the mcommoninit function to initialize m0 (g0.m) as necessary. After the initialization of m0 is complete, call procresize to initialize the p structure object that the system needs, according to go According to the official language of the language, p means processor, and its number determines that there can be at most fewer goroutines running in parallel. In addition to initializing m0 and p, the schedinit function also sets the maxmcount member of the global variable sched to 10000, which limits the number of operating system threads that can be created up to 10000 to work.

Here we need to focus on how mcommoninit initializes m0 and how the procresize function creates and initializes p structure objects. First, we dive into the mcommoninit function to find out. It should be noted here that this function is not only executed during initialization, but also if a worker thread is created during program operation, so we will see the lock and check whether the number of threads has exceeded the maximum in the function And other related code.

runtime/proc.go : 596

func mcommoninit(mp *m) {
    _g_ := getg() //初始化过程中_g_ = g0

    // g0 stack won't make sense for user (and is not necessary unwindable).
    if _g_ != _g_.m.g0 {  //函数调用栈traceback,不需要关心
        callers(1, mp.createstack[:])
    }

    lock(&sched.lock)
    if sched.mnext+1 < sched.mnext {
        throw("runtime: thread ID overflow")
    }
    mp.id = sched.mnext
    sched.mnext++
    checkmcount() //检查已创建系统线程是否超过了数量限制(10000)

    //random初始化
    mp.fastrand[0] = 1597334677 * uint32(mp.id)
    mp.fastrand[1] = uint32(cputicks())
    if mp.fastrand[0]|mp.fastrand[1] == 0 {
        mp.fastrand[1] = 1
    }

    //创建用于信号处理的gsignal,只是简单的从堆上分配一个g结构体对象,然后把栈设置好就返回了
    mpreinit(mp)
    if mp.gsignal != nil {
        mp.gsignal.stackguard1 = mp.gsignal.stack.lo + _StackGuard
    }

    //把m挂入全局链表allm之中
    // Add to allm so garbage collector doesn't free g->m
    // when it is just in a register or thread-local storage.
    mp.alllink = allm 

    // NumCgoCall() iterates over allm w/o schedlock,
    // so we need to publish it safely.
    atomicstorep(unsafe.Pointer(&allm), unsafe.Pointer(mp))
    unlock(&sched.lock)

    // Allocate memory to hold a cgo traceback if the cgo call crashes.
    if iscgo || GOOS == "solaris" || GOOS == "windows" {
        mp.cgoCallers = new(cgoCallers)
    }
}

It can be seen from the source code of this function that there is no scheduling-related initialization for m0, so you can simply think that this function just puts m0 into the global linked list allm and returns.

After m0 completes the basic initialization, continue to call procresize to create and initialize the p structure object. In this function, a specified number of p structure objects (determined by the number of cpu cores or environment variables) will be created and placed in the full variable allp, and Bind m0 and allp[0] together, so when this function is executed, there will be

m0.p = allp[0]
allp[0].m = &m0

At this point, m0, g0, and p required by m are completely related.

Initialize allp

Let's look at the procresize function. After the initialization is completed, the user code can also call it through the GOMAXPROCS() function to re-create and initialize the p structure object. There are many problems involved in dynamically adjusting p during operation, so The processing of this function is more complicated, but if you only consider initialization, it is relatively simpler, so here only the code that will be executed during initialization is retained:

runtime/proc.go : 3902

func procresize(nprocs int32) *p {
    old := gomaxprocs //系统初始化时 gomaxprocs = 0

    ......

    // Grow allp if necessary.
    if nprocs > int32(len(allp)) { //初始化时 len(allp) == 0
        // Synchronize with retake, which could be running
        // concurrently since it doesn't run on a P.
        lock(&allpLock)
        if nprocs <= int32(cap(allp)) {
            allp = allp[:nprocs]
        } else { //初始化时进入此分支,创建allp 切片
            nallp := make([]*p, nprocs)
            // Copy everything up to allp's cap so we
            // never lose old allocated Ps.
            copy(nallp, allp[:cap(allp)])
            allp = nallp
        }
        unlock(&allpLock)
    }

    // initialize new P's
    //循环创建nprocs个p并完成基本初始化
    for i := int32(0); i < nprocs; i++ {
        pp := allp[i]
        if pp == nil {
            pp = new(p)//调用内存分配器从堆上分配一个struct p
            pp.id = i
            pp.status = _Pgcstop
            ......
            atomicstorep(unsafe.Pointer(&allp[i]), unsafe.Pointer(pp))
        }

        ......
    }

    ......

    _g_ := getg()  // _g_ = g0
    if _g_.m.p != 0 && _g_.m.p.ptr().id < nprocs {//初始化时m0->p还未初始化,所以不会执行这个分支
        // continue to use the current P
        _g_.m.p.ptr().status = _Prunning
        _g_.m.p.ptr().mcache.prepareForSweep()
    } else {//初始化时执行这个分支
        // release the current P and acquire allp[0]
        if _g_.m.p != 0 {//初始化时这里不执行
            _g_.m.p.ptr().m = 0
        }
        _g_.m.p = 0
        _g_.m.mcache = nil
        p := allp[0]
        p.m = 0
        p.status = _Pidle
        acquirep(p) //把p和m0关联起来,其实是这两个strct的成员相互赋值
        if trace.enabled {
            traceGoStart()
        }
    }
   
    //下面这个for 循环把所有空闲的p放入空闲链表
    var runnablePs *p
    for i := nprocs - 1; i >= 0; i-- {
        p := allp[i]
        if _g_.m.p.ptr() == p {//allp[0]跟m0关联了,所以是不能放任
            continue
        }
        p.status = _Pidle
        if runqempty(p) {//初始化时除了allp[0]其它p全部执行这个分支,放入空闲链表
            pidleput(p)
        } else {
            ......
        }
    }

    ......
   
    return runnablePs
}

This function code is relatively long, but not complicated, here is a summary of the main flow of this function:

  1. Use make([]*p, nprocs) to initialize the global variable allp, that is, allp = make([]*p, nprocs)

  2. Create and initialize nprocs p structure objects cyclically and save them in allp slices in turn

  3. Bind m0 and allp[0] together, that is, m0.p = allp[0], allp[0].m = m0

  4. Put all p except allp[0] into the pidle free queue of the global variable sched

After the procresize function is executed, the initialization work related to the scheduler is basically over. At this time, the relationship between the various components of the entire scheduler is shown in the following figure:

image

 

After analyzing the basic initialization of the scheduler, in the next section we look at how the first goroutine in the program is created.


Finally, if you think this article is helpful to you, please help me click on the “Looking” at the bottom right corner of the article or forward it to the circle of friends, thank you very much!

image

Guess you like

Origin blog.csdn.net/pyf09/article/details/115238748