1 Memory optimization

1.1 Small objects are merged into structures and allocated once to reduce the number of memory allocations

Students who have done C/C++ may know that the frequent application and release of small objects on the heap will cause memory fragmentation (some are called holes), resulting in the inability to apply for contiguous memory space when allocating large objects. The general recommendation is to use memory pool. The bottom layer of the Go runtime also uses a memory pool, but each span is 4k in size and maintains a cache at the same time. The cache has a list array ranging from 0 to n. Each unit of the list array is mounted with a linked list. Each node of the linked list is a piece of available memory. The memory blocks of all nodes in the same linked list are equal in size; but different The memory size of the linked list is unequal, that is to say, a unit of the list array stores a type of memory block of a fixed size, and the memory blocks stored in different units are unequal in size. This means that the cache caches memory objects of different sizes. Of course, when the size of the memory you want to apply for is closest to which type of cache memory block, you should allocate which type of memory block. When the cache is not enough, it will be allocated to spanalloc.

Suggestion: Small objects are combined into a structure for one allocation, as shown below:

for k, v := range m {
    k, v := k, v // copy for capturing by the goroutine
    go func() {
        // using k & v
    }()
}

Replace with:

for k, v := range m {
    x := struct {k , v string} {k, v} // copy for capturing by the goroutine
    go func() {
        // using x.k & x.v
    }()
}

 

1.2 Allocate enough space for the contents of the cache at one time and reuse them appropriately

During protocol encoding and decoding, it is necessary to operate []byte frequently, and you can use bytes.Buffer or other byte buffer objects.

Suggestion: By pre-allocating enough memory for bytes.Buffert, etc., avoid dynamically requesting memory when Grow, which can reduce the number of memory allocations. At the same time, consider appropriate reuse of byte buffer objects.

1.3 When slice and map are created with make, the estimated size specifies the capacity

Unlike arrays, slices and maps do not have a fixed space size and can be dynamically expanded according to the addition of elements.

Slice will initially specify an array. When appending and other operations are performed on the slice, when the capacity is not enough, it will automatically expand:

  • If the new size is more than 2 times the current size, the capacity will increase to the new size;
  • If not, loop the following operations: if the current capacity is less than 1024, increase by 2 times; otherwise, increase by 1/4 of the current capacity each time, until the increased capacity exceeds or waits for the new size.

The expansion of the map is more complicated, and each expansion will increase to 2 times the previous capacity. There are buckets and oldbuckets in its structure for incremental expansion:

  • Under normal circumstances, buckets are used directly, and oldbuckets is empty;
  • If the capacity is being expanded, the oldbuckets is not empty, and the buckets is twice the size of the oldbuckets.

Recommendation: Estimate the size and specify the capacity during initialization

m := make(map[string]string, 100)
s := make([]string, 0, 100) // 注意:对于slice make时,第二个参数是初始大小,第三个参数才是容量

1.4 The long call stack avoids applying for more temporary objects

The default size of goroutine's call stack is 4K (modified to 2K in 1.7). It adopts a continuous stack mechanism. When the stack space is not enough, the Go runtime will continue to expand:

  • When the stack space is not enough, increase by 2 times, the variables of the original stack are directly copied to the new stack space, and the variable pointer points to the new space address;
  • Unstacking will release the occupied stack space. When the GC finds that the stack space occupies less than 1/4, the stack space will be reduced by half.

For example, if the final size of the stack is 2M, in extreme cases, there will be 10 stack expansion operations, which will reduce performance.

suggestion:

  • Control the complexity of the call stack and function, don't do all the logic in one goroutine;
  • For example, if you really need a long call stack, consider goroutine pooling to avoid changes in stack space caused by frequent creation of goroutines.

1.5 Avoid creating temporary objects frequently

Go raises stop the world on GC, i.e. the whole situation pauses. Although the 1.7 version has greatly optimized the GC performance, the 1.8 even under the bad situation, the GC is 100us. However, the pause time still depends on the number of temporary objects. The more temporary objects there are, the longer the pause time may be and consumes CPU.

Recommendation: The GC optimization method is to reduce the number of temporary objects as much as possible:

  • try to use local variables
  • All local variables are combined into a large structure or array, reducing the number of scans of objects and returning as much memory as possible at one time.

2 Concurrency optimization

2.1 Use goroutine pool for highly concurrent task processing

Although goroutines are lightweight, for high-concurrency lightweight task processing, if goroutines are frequently created for execution, the execution efficiency will not be too efficient:

  • Too many goroutines are created, which will affect goroutine scheduling by go runtime and GC consumption;
  • If there is a call abnormal blocking backlog during high concurrency, a large number of goroutine backlogs in a short time may cause the program to crash.

2.2 Avoid high concurrent calls to synchronous system interfaces

The implementation of goroutine is to simulate asynchronous operations through synchronization. The following operations will not block the thread scheduling of the go runtime:

  • network IO
  • Lock
  • channel
  • time.sleep
  • Syscall based on underlying system asynchronous calls

The following blocking creates a new scheduling thread:

  • local IO calls
  • Syscall based on underlying system synchronous call
  • Calling IO or other blocking in the C language dynamic library by CGo

Network IO can be based on the asynchronous mechanism of epoll (or asynchronous mechanism such as kqueue), but it does not provide asynchronous mechanism for some system functions. For example, in the common posix api, the operation of the file is a synchronous operation. Although there is an open source fileepoll to simulate asynchronous file operations. But Go's syscall still relies on the underlying operating system's API. The system API is not asynchronous, and Go cannot do asynchronous processing.

Suggestion: isolate the goroutines involved in synchronous calls into controllable goroutines instead of calling goroutines with high concurrency directly.

2.3 Avoid shared object mutual exclusion when high concurrency

In traditional multi-threaded programming, when the concurrency conflict is between 4 and 8 threads, the performance may have an inflection point. The recommendation in Go is not to communicate through shared memory. It is very easy to create goroutines in Go. When a large number of goroutines share the same mutex object, there will be an inflection point for a certain number of goroutines.

Suggestion: goroutines should be executed independently and without conflict; if there is conflict between goroutines, partitions can be used to control the number of concurrent goroutines and reduce the number of concurrent conflicts of the same mutex object.

3 Other optimizations

3.1 Avoid using CGO or reduce the number of CGO calls

Go can call C library functions, but Go has a garbage collector and Go's stack grows dynamically, but these cannot be seamlessly interfaced with C. Before the Go environment is transferred to the C code execution, a new call stack must be created for C, the stack variable is assigned to the C call stack, and the call is completed and copied back. And this call overhead is also very large, and it is necessary to maintain the calling context of Go and C, and the mapping of the call stacks of the two. Compared with the direct GO call stack, the pure call stack may have more than 2 or even 3 orders of magnitude.

Suggestion: Avoid using CGO as much as possible. When it is unavoidable, reduce the number of calls across CGO.

3.2 Reduce the conversion between []byte and string, try to use []byte for string processing

The string type in GO is an immutable type. Unlike std:string in C++, it can be directly converted to char* value to point to the same address content; while in GO, []byte and string are two different structures at the bottom, between them There is a real copy of the value object for the conversion, so try to reduce this unnecessary conversion

Suggestion: If there is processing such as string concatenation, try to use []byte, for example:

func Prefix(b []byte) []byte {
    return append([]byte("hello", b...))
}

3.3 The concatenation of strings gives priority to bytes.Buffer

Since the string type is an immutable type, concatenation creates a new string. There are several common ways of string splicing in GO:

  • string + operation: causes multiple object allocations and value copies
  • fmt.Sprintf : Will dynamically parse the parameters, the efficiency is not good
  • strings.Join : internally is append of []byte
  • bytes.Buffer : size can be pre-allocated, reducing object allocation and copying

Recommendation: For high performance requirements, give preference to bytes.Buffer, pre-allocate the size. Non-critical paths are used for brevity. fmt.Sprintf can simplify different type conversion and splicing.


Reference:
1.  Go language memory allocator - FixAlloc
2.  https://blog.golang.org/strings