Go diary-morestack and goroutine pool

转自http://www.zenlife.tk/goroutine-pool.md

The initial stack size of Go goroutine is only 2K. If the call chain is relatively long during operation, the stack will automatically expand when it exceeds this size. At this time, a function runtime.morestack will be called. The cost of opening a goroutine itself is very small, but the cost of calling morestack to expand the stack is relatively large. Think about it, what if the function stack expands and someone references the objects on the original stack? Therefore, in the morestack, the objects inside need to be adjusted, and the pointers must be relocated to the new stack. The larger the stack, the more objects that need to be adjusted, and the greater the overhead when calling morestack.

We can write a simple bench, this function recursive call consumes more stack space:

func f(n int) {
  var useStack [100]byte
  if n == 0 {
    return
  }
  _ = useStack[3]
  f(n - 1)
}
下面是对比测试:

func bench1() {
  var wg sync.WaitGroup

  for i := 0; i < benchCount; i++ {
    wg.Add(1)
    go func() {
      for j := 0; j < 15; j++ {
        f(2)
      }
      wg.Done()
    }()
    wg.Wait()
  }
}
func bench2() {
  var wg sync.WaitGroup

  for i := 0; i < benchCount; i++ {
    wg.Add(1)
    go func() {
      f(30)
      wg.Done()
    }()
    wg.Wait()
  }
}

The workload of the two is the same, but bench1 will not trigger runtime.morestack, and bench2 will trigger. You can see that the results differ by an order of magnitude:

bench1 used: 52480486 ns
bench2 used: 559074503 ns

This problem was found in our project, and the CPU time of morestack accounted for almost 10%. Cockroach, a friend next door, also found this problem. How to solve it?

There are two directions. One is to initially allocate a larger initial stack. For example, after starting a goroutine, first call the following function to expand the stack to 8K:

// reserveStack reserves 8KB memory on the stack to avoid runtime.morestack.
func reserveStack(dummy bool) {
  var buf [8 << 10]byte
  // avoid compiler optimize the buf out.
  if dummy {
    for i := range buf {
      buf[i] = byte(i)
    }
  }
}
提

The method of reserving the stack space in the front is not enough to expand the stack and the overhead is lower than the program running to the back stack. Cockroach is this approach. During the actual measurement, I found that morestak's overhead was not eliminated, but transferred.

So what I want to talk about is another direction, goroutine pool.

In fact, goroutine is such a lightweight thing, in fact, it does not make much sense to make a pool. It is good to open it when you use it. However, in the case of triggering morestack, this overhead is a bit high and can be caught on the flame graph (go pprof is not so sensitive). After using the pool, if the goroutine is expanded and returned to the pool, the next time it is taken out, it will be a goroutine that has been expanded, so morestack can be avoided.

Next, talk about how to write this goroutine pool.

I hope the interface is this:

pool = New()  // 创建pool
pool.Go(func() {
    // do something
})

Call

go func() {
}()

The effect is exactly the same, except that after pool.Go executes the closure function, the goroutine does not exit, but returns to the pool for the next call.

To this end, we need to abstract goroutine into a resource,

func (pool *Pool) Go(f func()) {
     res := pool.get()
     res.run(f)
     // pool.put(res) 这里还不能归还,后面讲为什么
}

This resource is special. It consists of a channel and a background goroutine:

type res struct {
     ch chan
     pool *Pool
}
go func(r *res) {
   for work := r.ch {
       work()
       r.pool.put(res)
   }
}

Run only needs to feed into the channel, and the goroutine in the background will execute it after it gets the work.

func (r *res) run(f) {
    r.ch <- f 
}
这里有

This detail is returned to the goroutine pool after execution, and the work should be left to do after execution, because res.run(f) is non-blocking.

The realization of the pool is relatively easy, just use a linked list, string res, end-to-end nodes, end-in and first-out. But pay attention to thread safety. If you want to optimize, you can lock the access of the first and last nodes separately, or even use a lock-free formation.

Notice that I did not design the Close interface for the pool. Why? This is intentional. So, when will the goroutine in the pool be released? It is designed to not be automatically recycled after a period of time.

From an experience point of view, as long as it involves reusing goroutine code, there is a high probability of leakage, especially when calling close and close implementations. Does the allocated resource belong to the pool or does it not belong to the pool? When the pool is closed, do you wait for the resources to come back? Still not waiting? What if the pool is closed when the resource is returned? At the moment of closing, how to deal with the locks when reading and writing the pool? So this is a design problem.

Each use is marked with the last used time, and one more recycled goroutine periodically scans the goroutine in the pool. If it has not been used for a long time, it will be recycled. If the recycled goroutine finds that there is no goroutine in the pool, it will exit by itself. It is very clean and there will be no leakage at all. Add a mark before exiting. If the pool is used again, re-create the recycling goroutine and start working.

Guess you like

Origin blog.csdn.net/qq_32198277/article/details/86576630