Go must-know series: Performance Tuning and Benchmark

Author: Zen and the Art of Computer Programming

1 Introduction

1.1 Overview of performance tuning

For web applications or enterprise-level applications, performance optimization is particularly important. So how to improve the system's processing power and response speed is a very important issue. Here, I will introduce the relevant knowledge of performance tuning.

1.2 Performance evaluation method

First, we need to define what exactly we mean by “performance”? This question can be confusing because different people may have different opinions. For example, many people think of performance as the number of requests a website can handle per second under one million PV; while others think of performance as the delay of a single request. In short, no matter which perspective is used to measure performance, it is inseparable from the consideration of service quality requirements. Therefore, you first need to clarify the evaluation methods you use.

Generally speaking, performance evaluation usually includes two aspects:

  • Traffic: the maximum number of concurrent requests (QPS) that the server can bear;
  • Response time: The time when the request is completed, that is, the time interval from the client sending the request to receiving the response.

In addition, system stability can also be measured through stress testing. Stress testing is to simulate the load in high concurrency scenarios and use tools to detect whether the system can continue to provide acceptable service quality.

Therefore, performance evaluation is a comprehensive process, which must consider traffic (concurrency), response time, and stability. Therefore, in addition to conventional performance analysis tools, targeted performance testing plans should also be developed based on business scenarios.

2. Core concepts

2.1 CPU and memory

CPU

The CPU (Central Processing Unit) is the central controller in all computers. It was jointly invented in 1947 by John Snow Cooper, a professor in the Davis Department of Electrical Engineering at South Carolina State University, and Martin Kohler (Michael Collins) at the University of Arizona. Its design goal is to achieve high-speed, multi-core parallelization of calculations, and is widely used in various electronic devices.

The CPU can execute many operation instructions, including addition, subtraction, multiplication and division, logical operations, shifts, control instructions, etc. However, due to different original design intentions, there was no unified instruction set standard at that time. And with the help of today's x86/x64 instruction set, the CPU can perform operations better.

CPU performance indicators are:

  • Clock frequency: How many pulse signals can be generated per unit time, that is, the number of instructions processed. For example, a common CPU with a main frequency of 1GHz has 1 billion clock cycles/second.
  • Instructions executed per second (IPC): The number of instructions that can be executed in one second, obtained by Clock Frequency / CPI. CPI represents the number of times an instruction is executed per clock cycle. Generally speaking, the faster the CPU, the higher its IPC.
  • Bus bandwidth: refers to the data transfer rate. For example, under the current mainstream x86/x64 architecture, the data bus width of the CPU access bus is 32 bits or 64 bits. Therefore, the bus bandwidth depends on the system architecture and hardware configuration.
Memory

Memory refers to the storage used to store data. There are two main components:

  • Static Random Access Memory (SRAM): As the name suggests, static means that it cannot be changed, and random access means that each location can access any value. SRAM is found in North America and some other regions.
  • Dynamic random access memory (DRAM): DRAM stands for dynamic random access memory, and as the name suggests, it can be modified dynamically. In other words, it can refresh like a computer display, updating the displayed content every time it is displayed.

The size of the memory determines the capacity of the system to store data, but it also affects the performance of the system. Because the memory is small and can only be cached by registers, SRAM is more suitable as a cache to accelerate calculations. DRAM, on the other hand, is better suited for long-term storage and processing of data.

Memory performance indicators include:

  • Capacity: unit capacity. For example, 8GB of memory.
  • Read and write rate: How many bytes of data can be read or written per unit time. For example, DDR3 has a speed of 400Mb/s.

2.2 Goroutine and threads

Goroutine

Goroutine is a lightweight thread used for concurrent programming in the Go language. It is similar to OS Thread (operating system thread), but is smaller than OS Thread and takes up less space. Therefore, the more goroutines you create in Go, the more efficient it will be.

Each goroutine has its own stack space and only occupies corresponding CPU resources when it is running. Since memory is shared between goroutines, communication can be easily achieved. However, it should be noted that do not abuse goroutine, otherwise it will cause too many context switches and cause performance degradation.

thread

Threads are the most basic mechanism provided by the operating system for concurrent programming. It is actually an execution path in the process. Multiple threads can be created in a process, and these threads can share the heap and code segments in the process.

Threads provide an abstraction layer that allows multiple threads to be viewed as independent sequences of execution. Each thread has a private register set and stack, but global variables and static variables can be shared between threads.

Because threads occupy less memory than other resources, they are suitable for IO-intensive tasks. In addition, using threads can avoid complex locking, synchronization and other problems in multi-threaded environments.

2.3 Asynchronous IO and event-driven model

Asynchronous IO model

The asynchronous IO (Asynchronous I/O) model is the key technology to implement asynchronous I/O. It allows applications to perform non-blocking I/O operations without waiting for an I/O operation to complete before continuing. This approach can improve throughput and shorten response time.

In the asynchronous IO model, the application initiates an I/O request through a system call, then immediately starts processing other things, and notifies the application when the I/O is completed. Therefore, asynchronous IO needs to be implemented with callback functions, message queues and other mechanisms.

event driven model

Event-driven programming model uses message queues to achieve decoupling between tasks. The application registers for events of interest and notifies the main loop when an event occurs. The main advantages of this model are simplicity, ease of maintenance, and the ability to fully utilize multi-core CPUs.

2.4 GC

GC (Garbage Collection) is the collective name for the Go language garbage collector. Its function is to automatically release memory that is no longer used during operation to prevent memory leaks.

GC mainly uses three algorithms: mark-clear, copy, mark-collate. Among them, the mark-sweep algorithm and the copy algorithm both directly recycle unreachable objects, while the mark-compact algorithm moves unreachable objects to one end of the memory.

There are two trigger conditions for GC:

  • Manual triggering: The programmer manually calls runtime.GC() to trigger.
  • Adaptive: GC adaptively adjusts runtime to reduce pause times.

2.5 Channels

Channels are a mechanism provided by the Go language for inter-process communication. It's similar to Pipes, but has more features. Channels can support different types of information, including ordinary types of values, channeled function call results, etc.

Channels have the following properties:

  • Message sending: When sending a message to a channel, the message will be stored in the buffer of the channel until it is received by the consumer.
  • Synchronous: Channel operations must be synchronous, meaning that the message sender must wait for the message to be read by the receiver before continuing work.
  • Buffer: The buffer size represents the capacity of the channel. If the channel is full, new messages will be blocked.

2.6 Mutex and Semaphore

Mutex (Mutual exclusion) is a locking mechanism that protects critical resources. At any time, only one thread can hold the lock. It prevents multiple threads from accessing shared resources concurrently.

Semaphore (semaphore) is a mechanism used to limit the number of accesses to shared resources. It manages an internal counter. Whenever the acquire method is called, the counter is decremented by one. When the counter is zero, the lock cannot be acquired. The counter is not restored until other threads call the release method to release the lock.

2.7 TCP and UDP

TCP (Transmission Control Protocol) is a network layer protocol. It is a connection-based protocol. That is to say, before formal communication, the client and server must first establish a connection.

UDP (User Datagram Protocol) is a network layer protocol. It is a connectionless protocol, which means that the client and server do not need to establish a connection before formal communication.

3. Concurrency primitives and patterns

3.1 WaitGroup

WaitGroup is a mechanism provided by the Go language for waiting for a group of goroutines to complete execution. It has a counter that indicates how many goroutines need to be waited for to complete execution.

Usually, WaitGroup is used with a counter that records the number of goroutines. Whenever a goroutine completes its task, the Done() method is called to decrement the counter by one. When the counter reaches zero, it means that all goroutines have completed their tasks.

Typical usage is as follows:

func worker(id int, wg *sync.WaitGroup) {
    // do some work
    time.Sleep(time.Second)

    fmt.Println("worker", id, "done")
    wg.Done()
}

func main() {
    var wg sync.WaitGroup
    for i := 0; i < 10; i++ {
        wg.Add(1) // increment the waitgroup counter

        go worker(i, &wg)
    }

    wg.Wait() // block until all workers are done

    fmt.Println("all workers are done")
}

3.2 Coroutine pool

Coroutine Pool is a method for controlling the number of coroutines. It prevents excessive resource consumption by limiting the maximum concurrency.

Specifically, under the control of the coroutine pool, the following methods can be used:

  • Initialize the coroutine pool and set the maximum concurrency.
  • When a new task is submitted, the remaining number of the coroutine pool is checked. If the maximum number is exceeded, wait for the coroutines in the coroutine pool to complete execution.
  • Create a new coroutine, perform tasks, and release the current coroutine.

A typical implementation is as follows:

type Worker struct {
    ID     int    `json:"id"`
    Job    string `json:"job"`
    Result chan int `json:"result"`
}

var gPool = make(chan *Worker, maxConcurrentWorkers)

// Submit a new job to pool of workers and return immediately
func submitJob(w *Worker) {
    select {
    case gPool <- w:
        log.Printf("New job submitted with id=%d", w.ID)
    default:
        log.Printf("Max number of concurrent workers reached. Waiting...")
    }
}

// Create a new worker that waits in pool for tasks to execute
func createAndStartWorker() {
    worker := <-gPool
    go func() {
        result := runTask(worker.Job)
        worker.Result <- result // send back task result
        close(worker.Result)      // indicate we're done sending results
        gPool <- worker           // put worker back into pool
    }()
}

4. Performance tuning tools and processes

4.1 pprof

pprof (Profiling tool) is a tool provided by the Go language for performance debugging. It provides an HTTP service that can obtain the CPU, memory, goroutine, thread, GC, block and other information of the current program at runtime.

How to use it:

  1. Introduce the "net/http/pprof" package into the code.
  2. Before starting the server, call http.ListenAndServe(":6060", nil) to bind the listening address and port number.
  3. Open the URL http://localhost:6060/debug/pprof through the browser to see detailed performance information.

4.2 Benchmark testing

Benchmarks are tests that measure the execution speed of certain operations or methods. It is used to understand the execution efficiency of the code and where the bottlenecks are.

Go provides a benchmarking package for writing and running benchmark tests. The benchmarking package will generate a set of random inputs based on the specified test function and execute it multiple times. Finally, the average execution time or performance metric is output to give the test results.

How to use it:

  1. Create a test file named "xx_test.go" in the source code directory.
  2. Import the testing package in this file.
  3. Write the test function according to the requirements and add the BenchmarkXXX function prefix.
  4. Run the command go test -bench="." to run the test.

4.3 Trace

Trace is a tool provided by the Go language for tracing program execution. It captures events while the program is executing (for example, goroutine creation, running, blocking, exiting) and saves them to a file.

How to use it:

  1. Create an empty file named "trace.out" in the source directory.
  2. Call trace.Start(os.Stderr) to start trace before starting the server.
  3. Call trace.Stop() to stop the trace after the test ends.
  4. View the trace file.

Guess you like

Origin blog.csdn.net/universsky2015/article/details/133385285