100 Go mistakes and how to avoid them: 9~12

License: CC BY-NC-SA 4.0

Translator: Flying Dragon

This article comes from [OpenDocCN Saturated Translation Project] , which uses post-translation editing (MTPE) process to improve efficiency as much as possible.

Once the truth is in your eyes, you can no longer ignore it. --"The Matrix"

Nine, concurrent practice

This chapter covers

  • Prevents common mistakes with goroutines and channels
  • Understand the implications of using standard data structures and concurrent code
  • Using the standard library and some extensions
  • Avoid data races and deadlocks

In the previous chapter, we discussed the basics of concurrency. It's time to look at the actual mistakes Go developers make when using concurrency primitives.

9.1 #61: Propagate inappropriate context

Contexts are ubiquitous when dealing with concurrency in Go, and in many cases it might be advisable to propagate them. However, context propagation can sometimes cause subtle bugs that prevent subroutines from executing correctly.

Let us consider the following example. We expose an HTTP handler that performs some tasks and returns a response. But just before returning the response, we also want to send it to a kafka topic. We don't want to reduce the latency of the HTTP consumer, so we want to process the publish operation asynchronously in a new goroutine. Let's assume we have a function that accepts a context publish, e.g. an operation that publishes a message is interrupted if the context is cancelled. Here's a possible implementation:

func handler(w http.ResponseWriter, r *http.Request) {
    
    
    response, err := doSomeTask(r.Context(), r)         // ❶
    if err != nil {
    
    
        http.Error(w, err.Error(), http.StatusInternalServerError)
        return
    }

    go func() {
    
                                             // ❷
        err := publish(r.Context(), response)
        // Do something with err
    }()

    writeResponse(response)                             // ❸
}

❶ Perform some tasks to respond to HTTP

❷ Created a goroutine to send a response to kafka

❸ Write the HTTP response

First we call a doSomeTaskfunction to get a responsevariable. It is publishused in the calling goroutine and formats the HTTP response. Additionally, when called publish, we propagate the context attached to the HTTP request. Can you guess what's wrong with this code?

We must be aware that the context attached to the HTTP request can be canceled in different cases:

  • when the client connection is closed

  • In the case of HTTP/2 requests, when the request is canceled

  • When the response is written back to the client

In the first two cases, we might get things right. For example, if we doSomeTaskget a response from the client, but the client has closed the connection, publisha context may have been canceled when the call was made, so the message will not be published. But what about the last case?

When the response is written to the client, the context associated with the request will be canceled. Therefore, we are faced with a race condition:

  • If the response is written after the Kafka publish, we all return the response and publish the message successfully.

  • However, if the response was written before or during Kafka's publication, the message should not be published.

In the latter case, the call publishwill return an error because we returned the HTTP response quickly.

How can we solve this problem? One idea is to not propagate the parent context. Instead, we'll call with an empty context publish:

err := publish(context.Background(), response)    // ❶

❶ Use empty context instead of HTTP request context

Here it will work. We can call no matter how long it takes to write back the HTTP response publish.

But what if the context contains useful values? For example, we can correlate HTTP requests with Kafka publications if the context contains a correlation ID for distributed tracing. Ideally, we'd like to have a new context that is unrelated to the underlying parent's cancellation, but still communicates the value.

The standard package does not provide a direct solution to this problem. Therefore, a possible solution is to implement our own Go context, similar to the one provided, except that it does not carry the cancellation signal.

One context.Contextis an interface that contains four methods:

type Context interface {
    
    
    Deadline() (deadline time.Time, ok bool)
    Done() <-chan struct{
    
    }
    Err() error
    Value(key any) any
}

Pass Deadlinemethod manages the context's deadline, passes Doneand Errmethod manages cancellation signals. DoneA closed channel should be returned when the deadline has passed or the context has been cancelled , and Erran error should be returned instead. Finally, Valuepass these values ​​through the method.

Let's create a custom context that decouples the cancellation signal from the parent context:

type detach struct {
    
                      // ❶
    ctx context.Context
}

func (d detach) Deadline() (time.Time, bool) {
    
    
    return time.Time{
    
    }, false
}

func (d detach) Done() <-chan struct{
    
    } {
    
    
    return nil
}

func (d detach) Err() error {
    
    
    return nil
}

func (d detach) Value(key any) any {
    
    
    return d.ctx.Value(key)           // ❷
}

❶ The custom struct acts as a wrapper on top of the initial context

❷ Delegate the call to get the value to the parent context

With the exception of calling methods on the parent context to get the value Value, all other methods return default values, so the context is never considered expired or canceled.

Thanks to our custom context, we can now call publishand detach the cancel signal:

err := publish(detach{
    
    ctx: r.Context()}, response)    // ❶

❶ Use on HTTP contextdetach

The context passed to publishwill now never expire or be canceled, but it will carry the value of the parent context.

In conclusion, propagate a context with caution. In this section, we illustrate this with an example of processing asynchronous operations based on the context associated with an HTTP request. Because the context is canceled once we return the response, asynchronous operations can also stop unexpectedly. Let's remember to propagate the effects of a given context, it is always possible to create custom contexts for specific operations if necessary.

The next section discusses a common concurrency mistake: starting a goroutine without planning to stop it.

9.2 #62: Starting a goroutine without knowing when to stop it

Starting goroutines is easy and cheap - so easy and cheap that we probably don't have to plan when to stop new goroutines, which could lead to leaks. Not knowing when to stop a goroutine is a design problem and a common concurrency bug in Go. Let's find out why and how to prevent it.

First, let's quantify what a goroutine leak means. In terms of memory, a goroutine has a minimum stack size of 2 KB, which can be increased and decreased as needed (maximum stack size is 1 GB on 64-bit, 250 MB on 32-bit). In terms of memory, goroutines can also hold references to variables allocated on the heap. At the same time, goroutines can hold resources such as HTTP or database connections, open files, and network sockets, which should eventually be closed gracefully. If a goroutine is leaked, these types of resources will also be leaked.

Let's look at an example where the point at which a goroutine stops is not clear. Here, the parent goroutine calls a function that returns a channel, and then creates a new goroutine that will continue to receive messages from that channel:

ch := foo()
go func() {
    
    
    for v := range ch {
    
    
        // ...
    }
}()

When chclosed, the created goroutine will exit. But do we know when this channel will close? This might not be obvious since created chby the function. fooIf the channel is never closed, it's a leak. Therefore, we should always be vigilant to ensure that we finally arrive at a goal.

Let's discuss a concrete example. We will design an application that needs to observe some external configuration (for example, use a database connection). Here's the first implementation:

func main() {
    
    
    newWatcher()

    // Run the application
}

type watcher struct {
    
     /* Some resources */ }

func newWatcher() {
    
    
    w := watcher{
    
    }
    go w.watch()      // ❶
}

❶ Created a goroutine that monitors external configuration

We call newWatcher, which creates a watcherstructure and starts a goroutine responsible for monitoring the configuration. The problem with this code is that when the main goroutine exits (maybe because of an OS signal or because it has a limited workload), the application stops. Therefore, watcherresources created by are not closed gracefully. How can I prevent this from happening?

One option is to pass newWatchera maincontext that will be canceled when returned:

func main() {
    
    
    ctx, cancel := context.WithCancel(context.Background())
    defer cancel()

    newWatcher(ctx)      // ❶

    // Run the application
}

func newWatcher(ctx context.Context) {
    
    
    w := watcher{
    
    }
    go w.watch(ctx)      // ❷
}

❶ Pass to newWatchera context that will eventually be canceled

❷ Spread this context

We propagate the created context to watchthe method. When the context is cancelled, watcherthe structure should close its resources. However, can we guarantee watchthe time to do so? Absolutely not - it's a design flaw.

The problem is, we use signals to communicate that a goroutine must be stopped. We don't block the parent goroutine until the resource is closed. Let's make sure to:

func main() {
    
    
    w := newWatcher()
    defer w.close()     // ❶

    // Run the application
}

func newWatcher() watcher {
    
    
    w := watcher{
    
    }
    go w.watch()
    return w
}

func (w watcher) close() {
    
    
    // Close the resources
}

❶ Delay calling closemethod

watcherWith a new method: close. We now call this closemethod to deferensure that the resource is closed before the application exits, rather than signaling watcherthat it is time to close its resource.

In conclusion, it is important to note that goroutines, like any other resource, must eventually be closed to free memory or other resources. Starting a goroutine without knowing when to stop is a design problem. No matter when it starts, we should have a clear plan and know when it will stop. Last but not least, if a goroutine creates resources and its lifecycle is tied to the application's lifecycle, it may be safer to wait for the goroutine to complete before exiting the application. This way, we can ensure that resources are released.

Now let's discuss one of the most common mistakes when working in Go: mishandling goroutines and loop variables.

9.3 #63: Not being careful with goroutines and loop variables

Mishandling goroutines and loop variables is probably one of the most common mistakes Go developers make when writing concurrent applications. Let's look at a concrete example; then we'll define the conditions for such a bug and how to prevent it.

In the example below, we initialize a slice. Then, in the closure executed as a new goroutine, we access this element:

s := []int{
    
    1, 2, 3}

for _, i := range s {
    
          // ❶
    go func() {
    
    
        fmt.Print(i)       // ❷
    }()
}

❶ Iterate over each element

❷ Accessing loop variables

We probably want this code to print in no particular order 123(since there is no guarantee that the first goroutine created will finish first). However, the output of this code is undefined. For example, sometimes it prints 233and sometimes it prints 333. what is the reason?

In this example we create new goroutines from a closure. As a reminder, a closure is a function value that references a variable from outside its body: here is ithe variable. We have to be aware that when the closure goroutine is executed, it does not capture the value when the goroutine was created. Instead, all goroutines refer to the exact same variable. When a goroutine runs, it fmt.Printprints out ithe value as it executes. Therefore, iit may have been modified after the goroutine was listed.

Figure 9.1 shows what the code 233might look like when printed. Over time, ithe value of will change: 1, 2, and then 3. In each iteration, we spin up a new goroutine. Because there is no guarantee when each goroutine will start and finish, the results will vary. In this example, the first goroutine prints it when it is iequal . 2Then, 3other goroutines print when the value has been equal i. Therefore, this example prints 233. The behavior of this code is undefined.

Figure 9.1 goroutines accessing a variable that is not fixed but changes over time i.

iWhat's the solution if we want each closure to access the value when the goroutine is created ? If we want to keep using closures, the first option involves creating a new variable:

for _, i := range s {
    
    
    val := i            // ❶
    go func() {
    
    
        fmt.Print(val)
    }()
}

❶ Create a local variable for each iteration

Why does this code work? In each iteration, we create a new local variable val. The current value captured by this variable before the goroutine was created i. So when each closure goroutine executes the print statement, it uses the expected value. This code prints 123(again, in no particular order).

The second option no longer relies on closures, but uses an actual function:

for _, i := range s {
    
    
    go func(val int) {
    
         // ❶
        fmt.Print(val)
    }(i)                   // ❷
}

❶ Execute a function that takes an integer as an argument

❷ Call this function and pass ithe current value

We still execute the anonymous function in a new goroutine (eg, we don't run it go f(i)), but this time it's not a closure. The function was not referenced as a variable from outside its body val; valit is now part of the function's input. By doing this, we fix each iteration iand make our app work as expected.

We have to be careful with goroutines and loop variables. There is a problem if the goroutine is a closure accessing an iteration variable declared from outside its body. We can fix it by creating a local variable (eg, we've seen used before executing the goroutine val := i) or by making the function no longer a closure. Both options are available and we should not favor one over the other. Some developers may find the closure approach more convenient, while others may find the function approach more expressive.

selectWhat happens when using statement on multiple channels ? Let's find out.

9.4 #64: Using selectand Channeling Expected Deterministic Behavior

A common mistake Go developers make when working with channels is selectmaking wrong assumptions about how to use multiple channels. Wrong assumptions can lead to subtle bugs that are hard to identify and reproduce.

Suppose we want to implement a goroutine that needs to receive data from two channels:

  • messageChfor pending new messages.

  • disconnectChReceive notifications communicating disconnections. In this case we want to return from the parent function.

We should give priority to these two channels messageCh. For example, if a disconnect occurs, we want to make sure we've received all the messages before returning.

We can decide to handle priority like this:

for {
    
    
    select {
    
                             // ❶
    case v := <-messageCh:           // ❷
        fmt.Println(v)
    case <-disconnectCh:             // ❸
        fmt.Println("disconnection, return")
        return
    }
}

selectReceive from multiple channels using statements

❷ Receive new news

❸ Disconnect

We use selectreceive from multiple channels. Because of messageChthe priorities we want to differentiate, we can assume that we should write messageChcases first, and disconnectChcases second. But do these codes really work? Let's try it out by writing a fake producer goroutine that sends 10 messages and then sends a disconnect notification:

for i := 0; i < 10; i++ {
    
    
    messageCh <- i
}
disconnectCh <- struct{
    
    }{
    
    }

If we run this example, messageChhere is a possible output if buffered:

0
1
2
3
4
disconnection, return

Instead of receiving those 10 messages, we received 5 of them. what is the reason? It lies in the canonical multipass selectstatement ( go.dev/ref/spec):

If one or more communications are possible, the individual communications that are possible are selected by uniform pseudo-random selection.

Unlike switcha statement, where the first matching case wins, if there are more than one option, selectthe statement chooses at random.

This behavior may seem strange at first, but there is a good reason: to prevent possible starvation. The first possible communication is assumed to be selected based on source order. In this case, we can get into a situation where, for example, we can only receive from one channel because the sender is fast. To prevent this, the language designers decided to use random selection.

Going back to our example, even if case v := <-messageChit comes first in the source code order, there is no guarantee which case will be selected if there are messages in messageChand . disconnectChTherefore, the behavior of this example is undefined. We might get 0, 5 or 10 messages.

How can I overcome this situation? If we want to receive all messages before returning in case of disconnection, there are different possibilities.

If there is only one producer, we have two options:

  • Make messageChit an unbuffered channel instead of a buffered one. Because the sender goroutine blocks until the receiver goroutine is ready, this approach guarantees that all messages from disconnectChit are received before the slave disconnects .messageCh

  • Use single channel instead of dual channel. For example, we can define a structto deliver a new message or a disconnect. The channel guarantees that messages are sent in the same order as they were received, so we can ensure that the disconnect is received last.

If we have a situation with multiple producer goroutines, there may be no guarantee which one will write first. So whether we have an unbuffered messageChchannel or a single channel, it will cause races among producers. In this case we can implement the following solution:

  1. From messageChor disconnectChreceive.

  2. If a disconnect is received

    • Read messageChall existing information, if available.
    • Then return.

Here is the solution:

for {
    
    
    select {
    
    
    case v := <-messageCh:
        fmt.Println(v)
    case <-disconnectCh:
        for {
    
                              // ❶
            select {
    
    
            case v := <-messageCh:     // ❷
                fmt.Println(v)
            default:                   // ❸
                fmt.Println("disconnection, return")
                return
            }
        }
    }
}

❶ Insidefor/select

❷ Read the remaining information

❸ then return

This solution uses an inner with two shells for/select: one on messageChtop and one on defaultthe shell. Used in a defaultselect statement only if none of the other conditions match select. In this case, it means that we only messageChreturn when all remaining messages in .

Let's look at an example of how the code works. We will consider the case where messageChthere are two messages in and disconnectChone disconnected in , as shown in Figure 9.2.

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-BUYOlQ0J-1684395591001)(https://gitcode.net/OpenDocCN/100-go-mistakes-zh/- /raw/master/docs/img/CH09_F02_Harsanyi.png)]

Figure 9.2 Initial state

In this case, as we have already said, selectone case or the other is chosen at random. Assume selectthat the second case is chosen; see Figure 9.3.

Figure 9.3 Receive Disconnect

Therefore, we receive a disconnect and go inside select(Figure 9.4). Here, as long as the message is still in messageChprogress, selectwill always take precedence over default(Figure 9.5).

Figure 9.4 Internalselect

Figure 9.5 Receive remaining messages

Once we have received messageChall messages from , selectthere is no blocking and select defaultcase (Figure 9.6). So we return and block the goroutine.

Figure 9.6 Default situation

This is a way to ensure that we receive all remaining messages from one channel via receivers on multiple channels. Of course, if one is sent after the goroutine has returned messageCh(for example, if we have multiple producer goroutines), we will miss the message.

When using multi-pass ones select, we must remember that the first case in source order does not automatically win if there are multiple options. Instead, Go chooses randomly, so there is no guarantee which option will be chosen. To overcome this behavior, we can use unbuffered channels or single channels in case of a single producer goroutine. In the case of multiple producer goroutines, we can use internal selectors defaultto handle priority.

The next section discusses a common channel type: notification channels.

9.5 #65: Not using notification channels

Channels are a mechanism for cross-routine communication through signals. Signals may or may not have data. But how to handle the latter case is not always straightforward for Go programmers.

Let's look at a concrete example. We'll create a channel that will notify us when a connection drops. One idea is to chan boolhandle it as one:

disconnectCh := make(chan bool)

Now, suppose we interact with an API that provides us with such a channel. Since this is a boolean channel, we can receive trueor falsemessages. It's probably pretty clear truewhat's being conveyed. But falsewhat does it mean? Does it mean we are not disconnected? In this case, how often do we get this signal? Does that mean we're back in touch?

Should we expect to receive it false? Maybe we should just expect to hear from truethem. If so, it means we don't need a specific value to convey some information, we need a channel with no data . The idiomatic way to handle this is a channel of an empty structure: chan struct{}.

In Go, an empty struct is a struct without any fields. Regardless of the architecture, it doesn't take up any bytes of storage, we can unsafe.Sizeofverify this with:

var s struct{
    
    }
fmt.Println(unsafe.Sizeof(s))
0

Note why not use an empty interface ( var i interface{})? Because the null interface is not free; it takes 8 bytes on 32-bit architectures and 16 bytes on 64-bit architectures.

An empty struct is a de facto standard for expressing no meaning. For example, if we need a hashset structure (a collection of unique elements), we should use an empty structure as the value: map[K]struct{}.

Applied to channels, if we want to create a channel to send notifications that there is no data, the appropriate way to do so in Go is a chan struct{}. One of the most famous applications of empty struct channels is Go contexts, which we discuss in this chapter.

A channel may or may not have data. If we want to design an idiomatic API about the Go standard, let's remember that channels with no data should be chanrepresented by a type. In this way, it clarifies to the recipient that they should not expect any meaning from the content of the message - just the fact that they have received it. In Go, such channels are called notification channels .

The next section discusses how Go handles nilchannels and the rationale for using them.

9.6 #66: Not using nilchannels

A common mistake when working with Go and channels is to forget that nilchannels are sometimes helpful. So what are nilchannels and why should we care about them? This is the scope of this section.

Let's start with a goroutine that creates a nilchannel and waits to receive messages. What should this code do?

var ch chan int     // ❶
<-ch

❶Channel nil_

chis chan inttype. The zero value of the channel is zero, chfor nil. The goroutine will not panic; however, will always block.

nilThe principle is the same if we send a message to a channel. This path never leads:

var ch chan int
ch <- 0

So what is the purpose of Go allowing nilmessages to be received from or sent to channels ? nilWe will discuss this issue with a concrete example.

We will implement a func merge(ch1, ch2 <-chan int) <-chan intfunction to merge two channels into one. By merging them (see Figure 9.7), we mean that every message received in ch1or will be sent to the returned channel.ch2

Figure 9.7 Merging two channels into one

How to do this in GO? Let's first write a simple implementation that starts a goroutine and receives data from two channels (the resulting channel will be a buffered channel containing one element):

func merge(ch1, ch2 <-chan int) <-chan int {
    
    
    ch := make(chan int, 1)

    go func() {
    
    
        for v := range ch1 {
    
        // ❶
            ch <- v
        }
        for v := range ch2 {
    
        // ❷
            ch <- v
        }
        close(ch)
    }()

    return ch
}

❶ Channels from ch1receive and publish to merge

❷ Channels from ch2receiving and publishing to merging

In another goroutine, we receive messages from two channels, and each message is eventually chpublished in .

The main problem with this first version is that we ch1receive from and then ch2receive from. This means that ch1we won't receive ch2messages from until it's closed. This doesn't fit our use case as ch1it could be open forever, so we want to receive from both channels at the same time.

Let's selectwrite an improved version with a concurrent receiver using:

func merge(ch1, ch2 <-chan int) <-chan int {
    
    
    ch := make(chan int, 1)

    go func() {
    
    
        for {
    
    
            select {
    
              // ❶
            case v := <-ch1:
                ch <- v
            case v := <-ch2:
                ch <- v
            }
        }
        close(ch)
    }()

    return ch
}

❶ Simultaneously receive ch1andch2

selectstatement makes a goroutine wait for multiple operations at the same time. Since we're wrapping it in a forloop, we should be repeatedly receiving messages from one channel or the other, right? But do these codes really work?

One problem is that close(ch)the statement is unreachable. rangeUse the operator to cycle through the channel interrupts when the channel is closed . However, the way we implemented / doesn't really work when ch1or is closed . Worse, if at some point or off, when logging the value, the receiver of the merged channel will receive the following:ch2forselectch1ch2

received: 0
received: 0
received: 0
received: 0
received: 0
...

So the receiver will repeatedly receive an integer equal to zero. Why? Receiving from a closed channel is a non-blocking operation:

ch1 := make(chan int)
close(ch1)
fmt.Print(<-ch1, <-ch1)

Although we might think this code will panic or block, it will run and print 0 0. What we're capturing here is the closure event, not the actual message. To check if we received a message or end signal we have to do:

ch1 := make(chan int)
close(ch1)
v, open := <-ch1        // ❶
fmt.Print(v, open)

Regardless of whether the channel is open or not, ❶ will specify open

Using opena boolean, we can now see ch1if it's still open:

0 false

At the same time, we will also 0assign to vthe zero value since it is an integer.

Let's go back to our second solution. We say ch1off doesn't work well; for example, since selectcase is case v := <-ch1, we'll always enter the case and post a zero integer to the merged channel.

Let's take a step back and see what the best way to deal with this problem is (see Figure 9.8). We have to receive from two channels. then either

  • is ch1closed first, so we have to ch2start receiving until it closes.

  • ch2Close first, so we're going from ch1receive until it closes.

Figure 9.8 Handling different cases depending on whether it is closed first ch1or closed firstch2

How can this be achieved in Go? Let's write a version like we might have done with a state machine approach and boolean functions:

func merge(ch1, ch2 <-chan int) <-chan int {
    
    
    ch := make(chan int, 1)
    ch1Closed := false
    ch2Closed := false

    go func() {
    
    
        for {
    
    
            select {
    
    
            case v, open := <-ch1:
                if !open {
    
                   // ❶
                    ch1Closed = true
                    break
                }
                ch <- v
            case v, open := <-ch2:
                if !open {
    
                   // ❷
                    ch2Closed = true
                    break
                }
                ch <- v
            }

            if ch1Closed && ch2Closed {
    
      // ❸
                close(ch)
                return
            }
        }
    }()

    return ch
}

ch1Whether the processing is closed

ch2Whether the processing is closed

❸ If both channels are closed, will close chand return

We define two boolean ch1Closedsums ch2Closed. Once we receive a message from a channel, we check to see if it is a close signal. If it is, we handle it by marking the channel as closed (eg, ch1Closed = true). After both channels are closed, we close the merged channel and stop the goroutine.

What's wrong with this code other than it's starting to get complicated? There is one major problem: when one of the two channels is closed, forthe loop acts as a busy-wait loop, which means it will keep looping even if no new messages are received in the other channel. In our case, we have to remember the behavior of the statement. Assuming ch1closed (so we don't get any new messages here); when we arrive again select, it will wait for one of the following three conditions to happen:

  • ch1closure.

  • ch2There is new news.

  • ch2closure.

The first condition ch1is off and will always be in effect. So as long as we ch2don't receive a message in and this channel is not closed, we will keep looping the first case. This will result in wasted CPU cycles and must be avoided. Therefore, our solution is not feasible.

We can try to enhance the state machine part and implement sub-loops in each case for/select. But this will make our code more complicated and difficult to understand.

Time to get back to nilthe channel. As we mentioned, nilreceiving from a channel will block forever. How about using this idea in our solution? nilInstead of setting a boolean after a channel is closed, we'll assign the channel to . Let's write the final version:

func merge(ch1, ch2 <-chan int) <-chan int {
    
    
    ch := make(chan int, 1)

    go func() {
    
    
        for ch1 != nil || ch2 != nil {
    
        // ❶
            select {
    
    
            case v, open := <-ch1:
                if !open {
    
    
                    ch1 = nil             // ❷
                    break
                }
                ch <- v
            case v, open := <-ch2:
                if !open {
    
    
                    ch2 = nil             // ❸
                    break
                }
                ch <- v
            }
        }
        close(ch)
    }()

    return ch
}

❶ If at least one channel fails nil, continue

❷ Once closed, nilassign the channel toch1

❷ Once closed, nilassign the channel toch2

First, we loop as long as at least one channel is still open. Then, for example, if ch1closed, we assign ch1a value of zero. Therefore, during the next loop iteration, selectthe statement will only wait for two conditions:

  • ch2There is new news.

  • ch2closure.

ch1is no longer part of the equation as it is a nilchannel. At the same time, we ch2keep the same logic for and assign it a value after it's closed nil. Finally, when both channels are closed, we close the merged channel and return. Figure 9.9 shows a model of such an implementation.

Figure 9.9 Receive from two channels. If one is closed, we assign it a value of 0 so we only receive from one channel.

This is the realization we've all been waiting for. We cover all the different cases and don't need busy loops that waste CPU cycles.

In conclusion, we have seen that waiting or sending to a nilchannel is a blocking behavior, and this behavior is useful. As we saw in the example of merging two channels, we can use nilchannels to implement an elegant state machine that will selectremove one from a statement case. Let's keep this idea in mind: nilchannels are useful in certain situations and should be part of a Go developer's toolset when dealing with concurrent code.

In the next section, we'll discuss the size that should be set when creating a channel.

9.7 #67: Confused about channel size

When we makecreate a channel using built-in functions, the channel can be either unbuffered or buffered. Related to this topic, there are two mistakes that often occur: not knowing when to use this or that; and if we use buffered channels, how big should we use buffered channels. Let's examine these points.

First, let's remember the core concepts. An unbuffered channel is a channel without any capacity*. It can be created by omitting the dimension or providing a 0dimension:*

ch1 := make(chan int)
ch2 := make(chan int, 0)

With an unbuffered channel (sometimes called a synchronous channel), the sender blocks until the receiver receives data from the channel.

In contrast, buffered channels have capacity and must be created with 1a size greater than or equal to:

ch3 := make(chan int, 1)

With buffered channels, the sender can send messages while the channel is not full. Once the channel is full, it blocks until the receiver or router receives the message. For example:

ch3 := make(chan int, 1)
ch3 <-1                   // ❶
ch3 <-2                   // ❷

❶ No blocking

❷ Blocking

The first send does not block, while the second one blocks because the channel is full at this stage.

Let's take a step back and discuss the fundamental difference between these two channel types. A channel is a concurrency abstraction used to support communication between goroutines. But what about synchronization? In concurrency, synchronization means that we can guarantee that multiple goroutines are in a known state at some point. For example, a mutex provides synchronization because it ensures that only one goroutine is in a critical section at a time. About channels:

  • Unbuffered channels support synchronization. We guarantee that two goroutines will be in a known state: one receiving messages and the other sending them.

  • Buffered channels do not provide any strong synchronization. In fact, if the channel is not full, the producer goroutine can send the message and continue execution. The only guarantee is that the goroutine will not receive the message before it is sent. But that's only a guarantee, because of causality (you don't drink your coffee before you're ready).

It is crucial to keep this basic distinction in mind. Both channel types support communication, but only one provides synchronization. If we need synchronization, we have to use unbuffered channels. Unbuffered channels may also be easier to reason about: buffered channels may cause non-obvious deadlocks that are immediately apparent with unbuffered channels.

In other cases, unbuffered channels are preferable: for example, in the case of notification channels, notifications are close(ch)handled via channel close(). Here, there is no benefit to using buffered channels.

But what if we need a buffered channel? What size should we provide? The default value we should use for buffered channels is its minimum value: 1. So, we can approach the question from this perspective: Is there any good reason not to use 1the value of ? Here is a list of possible cases where we should use another size:

  • Using a worker pool-like pattern means spinning up a fixed number of goroutines that need to send data to a shared channel. In this case, we can relate the channel size to the number of goroutines created.

  • When using channels for rate limiting issues. For example, if we need to enhance resource utilization by limiting the number of requests, we should set the channel size according to the limit.

If we are outside of these cases, it should be cautious to use different channel sizes. Codebases that use magic numbers to set channel sizes are very common:

ch := make(chan int, 40)

why 40? What's the point? Why not 50even 1000? There should be a good reason for setting such a value. Perhaps this is decided after benchmarking or performance testing. In many cases it might be a good idea to comment on the rationale for such a value.

Let's remember that deciding on an exact queue size is not a simple matter. First, it's a balance between CPU and memory. The smaller the value, the more CPU contention we face. But the larger the value, the more memory needs to be allocated.

Another issue to consider is the one mentioned in the 2011 white paper on the LMAX Disruptor (Martin Thompson et al; lmax-exchange.github.io/disruptor/files/Disruptor-1.0.pdf):

Due to the speed difference between consumers and producers, the queue is usually always close to full or close to empty. They seldom operate in a balanced middle ground where the ratios of production and consumption are evenly matched.

Therefore, it is difficult to find a stable and accurate channel size, which means an exact value that does not cause too much contention or wasted memory allocation.

That's why, except for the cases described, it's usually best to 1start with the default channel size. For example, when uncertain, we can still use benchmarks to measure it.

As with almost any subject in programming, exceptions can be spotted. Therefore, the goal of this section is not to be exhaustive, but to give guidance on what dimensions should be used when creating channels. Synchronization is guaranteed for unbuffered channels, not buffered channels. Also, if we need a buffered channel, we should remember to use one as the default value for the channel size. We should carefully decide to use another value with a precise process and the rationale should be commented. Last but not least, we have to keep in mind that opting for buffered channels can also lead to non-obvious deadlocks that are easier to spot with unbuffered channels.

In the next section, we discuss possible side effects when dealing with string formatting.

9.8 #68: Forget about possible side effects of string formatting

Formatting strings is a common action for developers, whether returning an error or logging a message. However, when working in concurrent applications, it's easy to forget about the potential side effects of string formatting. This section will look at two concrete examples: one from the etcd repository, leading to a data race, and another leading to a deadlock situation.

9.8.1 etcd data race

etcd is a distributed key-value store implemented in Go. It is used in many projects, including Kubernetes, to store all cluster data. It provides an API to interact with the cluster. For example, Watcherthe interface is used to receive data change notifications:

type Watcher interface {
    
    
    // Watch watches on a key or prefix. The watched events will be returned
    // through the returned channel.
    // ...
    Watch(ctx context.Context, key string, opts ...OpOption) WatchChan
    Close() error
}

The API relies on gRPC streams. If you're not familiar with it, it's a technique for constantly exchanging data between a client and a server. The server must maintain a list of all clients using this function. Therefore, Watcherthe interface is implemented by a structure containing all active streams watcher:

type watcher struct {
    
    
    // ...

    // streams hold all the active gRPC streams keyed by ctx value.
    streams map[string]*watchGrpcStream
}

The keys of this map are based on Watchthe context provided when calling the method:

func (w *watcher) Watch(ctx context.Context, key string,
    opts ...OpOption) WatchChan {
    
    
    // ...
    ctxKey := fmt.Sprintf("%v", ctx)       // ❶
    // ...
    wgs := w.streams[ctxKey]
    // ...

❶ Format map keys according to the provided context

ctxKeyis the key of the map, formatted by the context provided by the client. When formatting context.WithValuea string in a context created by values(), Go will read all values ​​in that context. In this case, the etcd developers discovered that Watchthe context provided was one that contained mutable values ​​(e.g., pointers to structures) under certain conditions. They found a situation where one goroutine was updating a context value while another was executing Watch, thus reading all values ​​in this context. This leads to a data race.

The fix( github.com/etcd-io/etcd/pull/7816) is to not rely fmt.Sprintfon formatting the keys of the map to prevent traversing and reading the chain of wrapped values ​​in the context. Instead, the solution is to implement a custom streamKeyFromCtxfunction that extracts keys from specific immutable context values.

Note: Potentially mutable values ​​in the context introduce additional complexity to prevent data races. This may be a design decision that requires careful consideration.

This example illustrates that we must be careful about side effects of string formatting in concurrent applications—in this case, data races. In the following example, we will see the side effects that lead to a deadlock situation.

9.8.2 Deadlock

Suppose we have to deal with a structure that can be accessed concurrently Customer. We will use sync.RWMutexto protect access, whether read or write. We'll implement a UpdateAgemethod to update the customer's age and check if the age is positive. At the same time, we will implement and Stringerinterface.

Can you see what the problem is in this code? A Customerstruct exposes a UpdateAgemethod and implements fmt.Stringeran interface.

type Customer struct {
    
    
    mutex sync.RWMutex                                    // ❶
    id    string
    age   int
}

func (c *Customer) UpdateAge(age int) error {
    
    
    c.mutex.Lock()                                        // ❷
    defer c.mutex.Unlock()

    if age < 0 {
    
                                              // ❸
        return fmt.Errorf("age should be positive for customer %v", c)
    }

    c.age = age
    return nil
}

func (c *Customer) String() string {
    
    
    c.mutex.RLock()                                       // ❹
    defer c.mutex.RUnlock()
    return fmt.Sprintf("id %s, age %d", c.id, c.age)
}

❶ Use sync.RWMutexprotected concurrent access

❷ Lock and delay unlock as we update client

❸ If the age is negative, an error will be returned

❹ Locking and delayed unlocking as we read the client

The problem here may not be simple. If the provided ageis negative, we return an error. Since the error is formatted, using %sthe directive on the receiver, it will call Stringthe method to format Customer. But because UpdateAgethe mutex is already acquired, Stringthe method will not be able to acquire the mutex (see Figure 9.10).

Figure 9.10 If agenegative, executeUpdateAge

Therefore, this leads to a deadlock situation. If all goroutines are also asleep, it will cause a panic:

fatal error: all goroutines are asleep - deadlock!

goroutine 1 [semacquire]:
sync.runtime_SemacquireMutex(0xc00009818c, 0x10b7d00, 0x0)
...

How should this situation be handled? First, it illustrates the importance of unit testing. In this case, we might decide that it's not worth creating a test for a negative age, since the logic is so simple. However, without proper test coverage, we might miss the issue.

One thing that could be improved here is to limit the scope of the mutex. In UpdateAge, we first acquire the lock and check that the input is valid. We should do the opposite: first check the input, and if the input is valid, acquire the lock. This has the benefit of reducing potential side effects, but also has a performance impact - the lock is only acquired when needed, not before:

func (c *Customer) UpdateAge(age int) error {
    
    
    if age < 0 {
    
    
        return fmt.Errorf("age should be positive for customer %v", c)
    }

    c.mutex.Lock()          // ❶
    defer c.mutex.Unlock()

    c.age = age
    return nil
}

❶ locks the mutex only after the input is validated

In our case, locking the mutex only after checking the age avoids the deadlock situation. If age is negative, called Stringwithout prior locking of the mutex.

However, in some cases it is not simple or possible to limit the scope of a mutex. In this case we have to be very careful with string formatting. Maybe we want to call another function that doesn't try to acquire the mutex, or maybe we just want to change the way we format the error so it doesn't call the Stringmethod. For example, the code below will not cause a deadlock because we only idlog the customer ID when accessing the field directly:

func (c *Customer) UpdateAge(age int) error {
    
    
    c.mutex.Lock()
    defer c.mutex.Unlock()

    if age < 0 {
    
    
        return fmt.Errorf("age should be positive for customer id %s", c.id)
    }

    c.age = age
    return nil
}

We've seen two concrete examples, one that formats keys in the context, and another that returns an error in the formatted structure. In both cases, the format strings cause a problem: data races and deadlock situations. Therefore, we should be cautious about the possible side effects of string formatting in concurrent applications.

The next section discusses appendthe behavior when called concurrently.

9.9 #69: appendCreate a data race using

We mentioned earlier what data races are and what are their effects. Now, let's look at slices and appendwhether adding elements to slices is data-race-free. spoiler? It depends.

In the following example we will initialize a slice and create two goroutines that will use appendcreate a new slice with additional elements:

s := make([]int, 1)

go func() {
    
                    // ❶
    s1 := append(s, 1)
    fmt.Println(s1)
}()

go func() {
    
                    // ❷
    s2 := append(s, 1)
    fmt.Println(s2)
}()

❶ In a new goroutine, sa new element is appended to

❷ same

Do you believe this example has a data race? the answer is negative.

我们必须回忆一下第 3 章中描述的一些切片基础知识。切片由数组支持,有两个属性:长度和容量。长度是切片中可用元素的数量,而容量是后备数组中元素的总数。当我们使用append时,行为取决于切片是否已满(长度==容量)。如果是,Go 运行时创建一个新的后备数组来添加新元素;否则,运行库会将其添加到现有的后备数组中。

在这个例子中,我们用make([]int, 1)创建一个切片。该代码创建一个长度为一、容量为一的切片。因此,因为切片已满,所以在每个 goroutine 中使用append会返回一个由新数组支持的切片。它不会改变现有的数组;因此,它不会导致数据竞争。

现在,让我们运行同一个例子,只是在初始化s的方式上稍作改变。我们不是创建长度为1的切片,而是创建长度为0但容量为1的切片:

s := make([]int, 0, 1)      // ❶

// Same

❶ 改变了切片初始化的方式

这个新例子怎么样?是否包含数据竞争?答案是肯定的:

==================
WARNING: DATA RACE
Write at 0x00c00009e080 by goroutine 10:
  ...

Previous write at 0x00c00009e080 by goroutine 9:
  ...
==================

我们用make([]int, 0, 1)创建一个切片。因此,数组没有满。两个 goroutines 都试图更新后备数组的同一个索引(索引 1),这是一种数据竞争。

如果我们希望两个 goroutines 都在一个包含初始元素s和一个额外元素的片上工作,我们如何防止数据竞争?一种解决方案是创建s的副本:

s := make([]int, 0, 1)

go func() {
    
    
    sCopy := make([]int, len(s), cap(s))
    copy(sCopy, s)                          // ❶

    s1 := append(sCopy, 1)
    fmt.Println(s1)
}()

go func() {
    
    
    sCopy := make([]int, len(s), cap(s))
    copy(sCopy, s)                          // ❷

    s2 := append(sCopy, 1)
    fmt.Println(s2)
}()

❶ 制作了一个副本,并在拷贝的切片上使用了append

❷ 相同

两个 goroutines 都会制作切片的副本。然后他们在切片副本上使用append,而不是原始切片。这防止了数据竞争,因为两个 goroutines 都处理孤立的数据。

切片和映射的数据竞争

数据竞争对切片和映射的影响有多大?当我们有多个 goroutines 时,以下为真:

  • 用至少一个 goroutine 更新值来访问同一个片索引是一种数据竞争。goroutines 访问相同的内存位置。

  • 不管操作如何,访问不同的片索引不是数据竞争;不同的索引意味着不同的内存位置。

  • 用至少一个 goroutine 更新来访问同一个映射(不管它是相同的还是不同的键)是一种数据竞争。为什么这与切片数据结构不同?正如我们在第 3 章中提到的,映射是一个桶数组,每个桶是一个指向键值对数组的指针。哈希算法用于确定桶的数组索引。因为该算法在映射初始化期间包含一些随机性,所以一次执行可能导致相同的数组索引,而另一次执行可能不会。竞争检测器通过发出警告来处理这种情况,而不管实际的数据竞争是否发生。

当在并发上下文中使用片时,我们必须记住在片上使用append并不总是无竞争的。根据切片以及切片是否已满,行为会发生变化。如果切片已满,append是无竞争的。否则,多个 goroutines 可能会竞争更新同一个数组索引,从而导致数据竞争。

一般来说,我们不应该根据片是否已满而有不同的实现。我们应该考虑到在并发应用中的共享片上使用append会导致数据竞争。因此,应该避免使用它。

现在,让我们讨论一个切片和映射上不精确互斥锁的常见错误。

9.10 #70:对切片和映射不正确地使用互斥

在数据可变和共享的并发环境中工作时,我们经常需要使用互斥体来实现对数据结构的保护访问。一个常见的错误是在处理切片和贴图时不准确地使用互斥。让我们看一个具体的例子,了解潜在的问题。

我们将实现一个用于处理客户余额缓存的Cache结构。该结构将包含每个客户 ID 的余额映射和一个互斥体,以保护并发访问:

type Cache struct {
    
    
    mu       sync.RWMutex
    balances map[string]float64
}

注意这个解决方案使用一个sync.RWMutex来允许多个读者,只要没有作者。

Next, we add a AddBalancemethod to alter balancesthe graph. The change is done in a critical section (within mutex and mutex unlock):

func (c *Cache) AddBalance(id string, balance float64) {
    
    
    c.mu.Lock()
    c.balances[id] = balance
    c.mu.Unlock()
}

At the same time, we must implement a method to calculate the average balance of all customers. One idea is to handle the minimum critical section like this:

func (c *Cache) AverageBalance() float64 {
    
    
    c.mu.RLock()
    balances := c.balances                  // ❶
    c.mu.RUnlock()

    sum := 0.
    for _, balance := range balances {
    
          // ❷
        sum += balance
    }
    return sum / float64(len(balances))
}

❶ Created balancescopy

❷ Iterate replicas outside the critical section

First, we create a balancescopy that maps to a local variable. Copy only in the critical section to iterate over each balance and calculate the average outside the critical section. Does this solution work?

-raceIf we run the test using the flag with two concurrent goroutines , one calling AddBalance(and thus changing balances), the other AverageBalance, a data race occurs. What's the problem here?

Internally, a map is a runtime.hmapstructure that mainly contains metadata (eg, counters) and pointers referencing buckets. Therefore, balances := c.balancesthe actual data is not copied. Slicing works the same way:

s1 := []int{
    
    1, 2, 3}
s2 := s1
s2[0] = 42
fmt.Println(s1)

s2Print s1returns even if we modified it [42 2 3]. The reason is that s2 := s1a new slice is created: s2with s1the same length and same capacity, and backed by the same array.

Going back to our example, we balancesassign a new map that references c.balancesthe same bucket. Simultaneously, two goroutines operate on the same dataset, one of which makes changes to it. So, it's a data race. How do we solve data races? We have two options.

If iterating isn't heavy (which is the case here, since we perform increments), we should protect the entire function:

func (c *Cache) AverageBalance() float64 {
    
    
    c.mu.RLock()
    defer c.mu.RUnlock()       // ❶

    sum := 0.
    for _, balance := range c.balances {
    
    
        sum += balance
    }
    return sum / float64(len(c.balances))
}

When the function returns, ❶ unlocks

The critical section now encompasses the entire function, including iteration. This prevents data races.

Another option, if the iterative operation is not lightweight, is to handle actual copies of the data, and only protect the copies:

func (c *Cache) AverageBalance() float64 {
    
    
    c.mu.RLock()
    m := make(map[string]float64, len(c.balances))     // ❶
    for k, v := range c.balances {
    
    
        m[k] = v
    }
    c.mu.RUnlock()

    sum := 0.
    for _, balance := range m {
    
    
        sum += balance
    }
    return sum / float64(len(m))
}

❶ copied the mapping

Once we're done with the deep copy, we release the mutex. Iteration is done on the copy outside the critical section.

Let's consider this solution. We have to iterate over the map value twice: once for copying and once for performing the operation (increment here). But the critical section is just a copy of the map. So this solution might be a good choice if and only if the operation is not fast . For example, if an operation requires a call to an external database, this solution may be more efficient. It is not possible to define a threshold when choosing one solution or the other, since the choice depends on factors such as the number of elements and the average size of the structure.

In conclusion, we have to be careful about mutex boundaries. In this section, we've seen why assigning an existing map (or an existing slice) to a map is not enough to prevent data races. Whether mapping or slicing, new variables are backed by the same dataset. There are two main solutions to prevent this: protect the entire function, or handle a copy of the actual data. In all cases, let us be cautious in designing the critical sections and make sure the boundaries are defined accurately.

Now let's discuss sync.WaitGroupa common mistake when using it.

9.11 #71: Misusesync.WaitGroup

sync.WaitGroupis a nmechanism for waiting for operations to complete; typically, we use it to wait for ngoroutines to complete. Let's first recall the public API and then we'll see a common error that leads to non-deterministic behavior.

sync.WaitGroupA wait group can be created with a zero value :

wg := sync.WaitGroup{
    
    }

Internally, sync.WaitGroupholds 0an internal counter initialized to by default. We can use Add(int)methods to increment this counter, use negative values Done()​​or Adddecrement it. If we want to wait for the counter to be equal to 0, we have to use a blocking Wait()method.

Note that the counter cannot be negative, otherwise the goroutine will panic.

In the example below, we'll initialize a wait group, start three goroutines that automatically update counters, and then wait for them to complete. We want to wait for these three goroutines to print the value of the counter (should be 3). Can you guess if there is something wrong with this code?

wg := sync.WaitGroup{
    
    }
var v uint64

for i := 0; i < 3; i++ {
    
    
    go func() {
    
                         // ❶
        wg.Add(1)                   // ❷
        atomic.AddUint64(&v, 1)     // ❸
        wg.Done()                   // ❹
    }()
}

wg.Wait()                           // ❺
fmt.Println(v)

❶ A goroutine is created

❷ Increment wait group counter

❸ atomically incrementv

❹ Decrement wait group counter

❺ Waits until all goroutines have incremented vbefore printing it

If we run this example, we get an indeterminate value: the code can print any value from 0to 3. -raceAlso, Go even has a data race if we enable the flag. How is this possible considering we are using sync/atomicpackages to update ? vWhat's wrong with this code?

The problem is wg.Add(1)that it is called in the newly created goroutine, not in the parent goroutine. Therefore, there is no guarantee that we have indicated to the wait group that we want to wg.Wait()wait three times for goroutines before calling.

Figure 9.11 shows 2possible scenarios when the code is printed. In this scenario, the main goroutine spins up three goroutines. But the last goroutine is wg.Done()executed after the first two goroutines have been called, so the parent goroutine is already unlocked. So in this case vit equals when the main goroutine reads 2. The race detector can also detect vunsafe access to .

Figure 9.11 The last goroutine call after the main goroutine has unblocked wg.Add(1).

The key thing to remember when dealing with goroutines is that without synchronization, execution is non-deterministic. For example, the following code can print abor ba:

go func() {
    
    
    fmt.Print("a")
}()
go func() {
    
    
    fmt.Print("b")
}()

Both goroutines can be assigned to different threads, there is no guarantee which thread will be executed first.

CPUs have to use memory barriers (also known as memory barriers ) to ensure order. Go provides different synchronization techniques for implementing memory fences: for example, sync.WaitGroupsupports a happens-before relationship between wg.Addand .wg.Wait

Going back to our example, there are two options to solve our problem. First, we can use 3:

wg := sync.WaitGroup{
    
    }
var v uint64

wg.Add(3)
for i := 0; i < 3; i++ {
    
    
    go func() {
    
    
        // ...
    }()
}

// ...

Or, secondly, we can call on each loop iteration wg.Add, and then spin up child goroutines:

wg := sync.WaitGroup{
    
    }
var v uint64

for i := 0; i < 3; i++ {
    
    
    wg.Add(1)
    go func() {
    
    
        // ...
    }()
}

// ...

Both solutions are fine. If the value we want to eventually set to the wait group counter is known in advance, then the first solution saves us from having to call it multiple times wg.Add. However, it needs to ensure that the same count is used everywhere to avoid subtle errors.

Let's be careful not to reproduce this common mistake made by GO developers. When used sync.WaitGroup, Addthe operation must be completed before the goroutine is started in the parent goroutine, and Donethe operation must be completed in the goroutine.

Another primitive syncpackage discussed below: sync.Cond.

9.12 #72: Forgetsync.Cond

syncOf the synchronization primitives in the package, probably sync.Condthe least used and understood. However, it provides functionality that we cannot achieve with channels. This section walks through a concrete example of sync.Condwhen and how it is useful.

The examples in this section implement a donation goal mechanism: an app that raises an alert whenever a certain goal is reached. We will have one goroutine responsible for increasing the balance (an updater goroutine). Instead, other goroutines will receive updates and print a message (listening goroutines) when a certain goal is reached. For example, one goroutine is waiting for a donation goal of $10, while another is waiting for a donation goal of $15.

The first simple solution is to use a mutex. The updater goroutine increments the balance every second. On the other hand, watch goroutines loop until their donation goals are reached:

type Donation struct {
    
                   // ❶
    mu             sync.RWMutex
    balance int
}
donation := &Donation{
    
    }

// Listener goroutines
f := func(goal int) {
    
                    // ❷
    donation.mu.RLock()
    for donation.balance < goal {
    
        // ❸
        donation.mu.RUnlock()
        donation.mu.RLock()
    }
    fmt.Printf("$%d goal reached\n", donation.balance)
    donation.mu.RUnlock()
}
go f(10)
go f(15)

// Updater goroutine
go func() {
    
    
    for {
    
                                // ❹
        time.Sleep(time.Second)
        donation.mu.Lock()
        donation.balance++
        donation.mu.Unlock()
    }
}()

Donation❶ Create and instantiate a structure containing the current balance and mutex

❷ Created a target

❸ Check whether the goal is achieved

❹ Constantly increasing balance

We use mutexes to protect access to shared donation.balancevariables. If we run this example, it works as expected:

$10 goal reached
$15 goal reached

The main problem - and what makes this implementation bad - is the busy loop. Each listening goroutine keeps looping until it reaches its donation goal, which wastes a lot of CPU cycles and makes the CPU usage huge. We need to find a better solution.

Let's take a step back. We had to find a way to signal from the updater whenever the balance was updated. If we think about signaling in GO, we have to think about channels. So let's try another version of the channel primitive:

type Donation struct {
    
    
    balance int
    ch      chan int                        // ❶
}

donation := &Donation{
    
    ch: make(chan int)}

// Listener goroutines
f := func(goal int) {
    
    
    for balance := range donation.ch {
    
          // ❷
        if balance >= goal {
    
    
            fmt.Printf("$%d goal reached\n", balance)
            return
        }
    }
}
go f(10)
go f(15)

// Updater goroutine
for {
    
    
    time.Sleep(time.Second)
    donation.balance++
    donation.ch <- donation.balance         // ❸
}

❶ Updated Donationso it contains a channel

❷ Receive updates from channels

❸ Whenever the balance is updated, a message is sent

Each listener receives from a shared channel. Meanwhile, the updater goroutine sends a message whenever the balance is updated. However, if we try this solution, here is a possible output:

$11 goal reached
$15 goal reached

The first goroutine should be notified when the balance is $10 instead of $11. What happened?

A message sent to a channel can only be received by one goroutine. In our example, Figure 9.12 shows what might happen if the first goroutine receives from the channel before the second.

Figure 9.12 The first goroutine receives the $1 message, then the second goroutine receives the $2 message, then the first goroutine receives the $3 message, and so on.

The default dispatch mode for multiple goroutines receiving from a shared channel is round robin. If a goroutine is not ready to receive a message (is not waiting on the channel), it may change; in this case, Go dispatches the message to the next available goroutine.

Each message is received by a separate goroutine. So, in this example, the first goroutine did not receive the $10 message, but the second did. Only one channel close event can be broadcast to multiple goroutines. But here we don't want to close the channel, because then the updater goroutine won't be able to send messages.

There is another problem with using channels in this situation. The listener will return whenever the donation goal is reached. Therefore, the updater goroutine must know when all listeners stop receiving messages for that channel. Otherwise, the channel will eventually become full, blocking the sender. A possible solution would be to add one to the mix sync.WaitGroup, but doing so would make the solution more complicated.

Ideally, we need to find a way to repeatedly broadcast a notification whenever the balance is updated to multiple goroutines. Fortunately, Go has a solution: sync.Cond. We discuss theory first; then we'll see how to use this primitive to solve our problem.

According to the official documentation ( pkg.go.dev/sync),

Cond implements a condition variable, a rendezvous of goroutines that wait for or announce an event to occur.

条件变量是等待特定条件的线程(这里是 goroutines)的容器。在我们的例子中,条件是余额更新。每当余额更新时,更新程序 gorroutine 就会广播一个通知,监听程序 gorroutine 会一直等到更新。此外,sync.Cond依靠一个sync.Locker(一个*sync .Mutex*sync.RWMutex)来防止数据竞争。下面是一个可能的实现:

type Donation struct {
    
    
    cond    *sync.Cond                    // ❶
    balance int
}

donation := &Donation{
    
    
    cond: sync.NewCond(&sync.Mutex{
    
    }),    // ❷
}

// Listener goroutines
f := func(goal int) {
    
    
    donation.cond.L.Lock()
    for donation.balance < goal {
    
    
        donation.cond.Wait()              // ❸
    }
    fmt.Printf("%d$ goal reached\n", donation.balance)
    donation.cond.L.Unlock()
}
go f(10)
go f(15)

// Updater goroutine
for {
    
    
    time.Sleep(time.Second)
    donation.cond.L.Lock()
    donation.balance++                    // ❹
    donation.cond.L.Unlock()
    donation.cond.Broadcast()             // ❺
}

❶ 添加一个*sync.Cond

*sync.Cond依赖于互斥体。

❸ 在锁定/解锁状态下等待条件(余额更新)

❹ 在锁定/解锁范围内增加余额

❺ 广播满足条件的事实(余额更新)

首先,我们使用sync.NewCond创建一个*sync.Cond,并提供一个*sync.Mutex。监听器和更新程序 goroutines 呢?

监听 goroutines 循环,直到达到捐赠余额。在循环中,我们使用Wait方法,该方法一直阻塞到满足条件。

注意,让我们确保术语条件在这里得到理解。在这种情况下,我们讨论的是更新余额,而不是捐赠目标条件。所以,这是两个监听器共享的一个条件变量。

Wait的调用必须发生在临界区内,这听起来可能有些奇怪。锁不会阻止其他 goroutines 等待相同的条件吗?实际上,Wait的实现是这样的:

  1. 解锁互斥体。

  2. 暂停 goroutine,并等待通知。

  3. 通知到达时锁定互斥体。

因此,监听 goroutines 有两个关键部分:

  • 访问for donation.balance < goal中的donation.balance

  • 访问fmt.Printf中的donation.balance

这样,对共享donation.balance变量的所有访问都受到保护。

Now, what about the updater goroutine? Balance updates are done in critical sections to prevent data races. Then we call Broadcastthe method, which wakes up all goroutines waiting for the condition every time the balance is updated.

So if we run this example, it prints what we expect:

10$ goal reached
15$ goal reached

In our implementation, the condition variable is based on the balance being updated. So the listener variables are woken up every time a new donation is made to check if their donation goal is reached. This solution prevents us from having busy loops consuming CPU cycles in the duplicate checks.

Let's also note sync.Conda possible downside when using it. When we send a notification -- for example, to a chan struct-- even if there are no active receivers, the message is buffered, which guarantees that the notification will eventually be received. Use sync.Condthe and Broadcastmethod to wake up all goroutines currently waiting on the condition. If not, the notification will be missed. This is also the basic principle we must keep in mind.

signal() and broadcast()

We can use Signal()instead Broadcast()to wake up a single goroutine. In terms of semantics, it chan structis the same as sending a message in a non-blocking way:

ch := make(chan struct{
    
    })
select {
    
    
case ch <- struct{
    
    }{
    
    }:
default:
}

Signaling in GO can be implemented using channels. The only event that multiple goroutines can catch is channel closing, but this can only happen once. So if we repeatedly send notifications to multiple goroutines, sync.Condis a solution. This primitive is based on condition variables, which set thread containers waiting for certain conditions. Using sync.Cond, we can broadcast a signal to wake up all goroutines waiting for a certain condition.

Let's extend our knowledge about concurrency primitives using golang.org/xthe and package.errgroup

9.13 #73: Not usederrgroup

Regardless of the programming language, it's rarely a good idea to overdo it. It's also common for codebases to reimplement how to spin up multiple goroutines and aggregate errors. But a package in the Go ecosystem is designed to support this frequent use case. Let's take a look at it and understand why it should be part of a Go developer's toolset.

is a library that provides extensions to the standard library. syncThe sub-repository contains a convenience package: errgroup.

Suppose we have to deal with a function where we receive as parameters some data with which we want to call an external service. We can't make one call due to constraints; we make multiple calls each time with a different subset. Furthermore, these calls are made in parallel (see Figure 9.13).

Figure 9.13 Each circle results in a parallel call.

In case of an error during the call, we want to return. If there are multiple errors, we only want to return one of them. Let's write the skeleton implementation using only standard concurrency primitives:

func handler(ctx context.Context, circles []Circle) ([]Result, error) {
    
    
    results := make([]Result, len(circles))
    wg := sync.WaitGroup{
    
    }                    // ❶
    wg.Add(len(results))

    for i, circle := range circles {
    
    
        i := i                                // ❷
        circle := circle                      // ❸

        go func() {
    
                               // ❹
            defer wg.Done()                   // ❺
            result, err := foo(ctx, circle)
            if err != nil {
    
    
                // ?
            }
            results[i] = result               // ❻
        }()
    }

    wg.Wait()
    // ...
}

❶ creates a wait group to wait for all the goroutines we spin

❷ A new ivariable is created in the goroutine (see error #63, "Accidental use of goroutine and loop variables")

❸ Also applies tocircle

❹ Trigger a goroutine every cycle

❺ Instructs goroutines when to finish

❻ aggregated results

We decided to use one sync.WaitGroupto wait for all goroutines to finish and handle aggregation on one slice. Here's one way of doing it; another way is to send each partial result to a channel and aggregate them in another goroutine. If sequencing is required, the main challenge will be reordering incoming messages. Therefore, we decided to take the simplest approach and share slices.

Note that this implementation is data-race-free because each goroutine writes to a specific index.

However, there is one key case that we haven't solved yet. What if foo(the call made in the new goroutine) returns an error? How should we deal with it? There are various options, including:

  • Just resultslike slices, we can share an error slice between goroutines. Each goroutine will write to this slice on error. We have to iterate over this slice in the parent goroutine to determine if an error occurred (O(n) time complexity).

  • We can give goroutines access to an error variable through a shared mutex.

  • We can consider sharing an error channel, which the parent goroutine will receive and handle.

No matter which option you choose, it will make the solution very complicated. For this reason, errgroupthe package was designed and developed.

It exports a function WithContextthat returns a struct given a context *Group. This structure provides synchronization, error propagation, and context cancellation for a set of goroutines, and only exports two methods:

  • GoTrigger the call in a new goroutine.

  • WaitBlock until all programs complete. It returns the first non-nil error, if any.

Let's use errgroupthe rewrite solution. First we need to import errgroupthe package:

$ go get golang.org/x/sync/errgroup

The implementation is as follows:

func handler(ctx context.Context, circles []Circle) ([]Result, error) {
    
    
    results := make([]Result, len(circles))
    g, ctx := errgroup.WithContext(ctx)      // ❶

    for i, circle := range circles {
    
    
        i := i
        circle := circle
        g.Go(func() error {
    
                      // ❷
            result, err := foo(ctx, circle)
            if err != nil {
    
    
                return err
            }
            results[i] = result
            return nil
        })
    }

    if err := g.Wait(); err != nil {
    
             // ❸
        return nil, err
    }
    return results, nil
}

❶ Created one errgroup. the group for the given parent context

❷ Call Go to improve the error handling logic and aggregate the results into a new goroutine

❸ call Waitto wait on all goroutines

First, we create and by providing the parent context *errgroup.Group. In each iteration, we g.Gofire off a call in a new goroutine. This method takes a func() erroras input, wraps the call to with a closure foo, and handles the result and errors. The main difference from our first implementation is that if we get an error, we return it from this closure. Then, g.Waitallows us to wait for all goroutines to complete.

This solution is inherently simpler than the first one (the first one is partial because we don't handle errors). We don't have to rely on additional concurrency primitives and it errgroup.Groupis sufficient for our use case.

Another benefit we haven't addressed yet is shared environments. Suppose we have to fire three parallel calls:

  • The first returns an error within 1 millisecond.

  • The second and third calls return results or errors within 5 seconds.

We want to return an error, if any. Therefore, there is no need to wait until the second and third calls are over. Use to errgroup.WithContextcreate a shared context that is used across all parallel calls. Since the first call returns an error within 1 millisecond, it cancels the context and thus the other goroutines. So, we don't have to wait 5 seconds before returning an error. This is errgroupanother benefit of using .

Note that g.Gothe process called by must be context-aware . Otherwise, canceling the context has no effect.

errgroupIn conclusion, it might be worth considering if is a solution when we have to trigger multiple goroutines and deal with error and context propagation . As we've seen, this package supports synchronization of a group of goroutines and provides answers for handling errors and sharing context.

syncThe last section of this chapter discusses a common mistake Go developers make when duplicating types.

9.14 #74: Replication sync type

syncPackage provides basic synchronization primitives such as mutexes, condition variables and wait groups. For all these types, there is one hard rule to follow: they should never be copied. Let's go over the rationale and possible problems.

We will create a thread-safe data structure to store the counter. It will contain a that represents the current value of each counter map[string]int. We will also use one sync.Mutex, since access must be protected. Let's add a incrementmethod to increment a given counter name:

type Counter struct {
    
    
    mu       sync.Mutex
    counters map[string]int
}

func NewCounter() Counter {
    
                        // ❶
    return Counter{
    
    counters: map[string]int{
    
    }}
}

func (c Counter) Increment(name string) {
    
    
    c.mu.Lock()                                // ❷
    defer c.mu.Unlock()
    c.counters[name]++
}

❶ Factory function

❷ Increment the counter in the critical section

Incremental logic is done in a critical section: between c.mu.Lock()and c.mu .Unlock(). -raceLet's try our approach by running the following example with the and options, which spins up two goroutines and increments their respective counters:

counter := NewCounter()

go func() {
    
    
    counter.Increment("foo")
}()
go func() {
    
    
    counter.Increment("bar")
}()

If we run this example, it will cause a data race:

==================
WARNING: DATA RACE
...

The problem in our Counterimplementation is that the mutex is copied. Because the receiver of is a value, it performs a copy of the structure Incrementwhenever we call it , which also copies the mutex. Therefore, the increment is not done in a shared critical section.IncrementCounter

syncTypes should not be copied. This rule applies to the following types:

  • sync.Cond

  • sync.Map

  • sync.Mutex

  • sync.RWMutex

  • sync.Once

  • sync.Pool

  • sync.WaitGroup

Therefore, mutexes should not be copied. What are the alternatives?

The first is to modify Incrementthe receiver type of the method:

func (c *Counter) Increment(name string) {
    
    
    // Same code
}

Changing the receiver type avoids copy-on- Incrementcall Counter. Therefore, the internal mutex is not copied.

If we want to keep a value receiver, a second option is to change the type of the field Counterin muto a pointer:

type Counter struct {
    
    
    mu       *sync.Mutex        // ❶
    counters map[string]int
}

func NewCounter() Counter {
    
    
    return Counter{
    
    
        mu: &sync.Mutex{
    
    },      // ❷
        counters: map[string]int{
    
    },
    }
}

❶ Changed mutype

❷ Changed Mutexinitialization method

It still copies the struct if Incrementthere is a value receiver Counter. However, since muit is now a pointer, it will only perform a pointer copy, not an sync.Mutexactual copy of the . Therefore, this solution also prevents data races.

Note that we also changed muthe initialization method. Because muit is a pointer, if we Counteromit it when creating it, it will be initialized to the zero value of a pointer: nil. This will cause c.mu.Lock()the goroutine to panic when called.

We may face the problem of inadvertently duplicating fields in the following cases sync:

  • Call a method with a value receiver (as we can see)

  • call synca function with parameters

  • call synca function with an argument containing fields

In each case, we should be very cautious. Also, let's note that some linters can catch this - for example, using go vet:

$ go vet .
./main.go:19:9: Increment passes lock by value: Counter contains sync.Mutex

As a rule of thumb, whenever multiple goroutines must access a common syncelement, we must ensure that they all depend on the same instance. This rule applies to syncall types defined in the package. Using pointers is one way to solve this problem: we can have a syncpointer to an element, or a pointer to synca structure containing the element.

Summarize

  • When propagating contexts, it should be important to understand the conditions under which a context can be canceled: for example, an HTTP handler cancels a context when a response has already been sent.

  • Avoiding leaks means that whenever you start a goroutine, you should have a plan for eventually blocking it.

  • To avoid errors with goroutines and loop variables, create local variables or call functions instead of closures.

  • Understanding selectthe case of random selection where multiple options are possible with multiple channels prevents making false assumptions that can lead to subtle concurrency bugs.

  • Use chan struct{}type to send notification.

  • Using nilchannels should be part of your concurrency toolset as it allows you to remove use cases from selectstatements .

  • Given a problem, carefully decide on the correct channel type to use. Only unbuffered channels can provide strong synchronization guarantees.

  • You should have a good reason for specifying channel dimensions other than specifying them for buffered channels.

  • Realizing that string formatting may cause calls to existing functions means being careful about possible deadlocks and other data races.

  • The call appendis not always data-race-free; therefore, it should not be used concurrently on a shared slice.

  • Remembering that slices and graphs are pointers prevents common data races.

  • For exact usage sync.WaitGroup, the method is called before spinning up goroutines Add.

  • You can sync.Condsend repeated notifications to multiple goroutines using .

  • You can synchronize a group of goroutines and use errgrouppackages to handle errors and context.

  • syncTypes that should not be copied.

10. Standard library

This chapter covers

  • Provide the correct duration
  • time.AfterBe aware of potential memory leaks when using
  • Avoid common mistakes in JSON processing and SQL
  • Close transient resources
  • returnRemember statements in HTTP handlers
  • Why production apps shouldn't use the default HTTP client and server

The Go standard library is a set of core packages that enhance and extend the language. For example, Go developers can write HTTP clients or servers, process JSON data, or interact with SQL databases. All of these features are provided by the standard library. However, it is very easy to misuse the standard library, or we may have limited knowledge of its behavior, which can lead to bugs and writing applications that should not be considered production-grade. Let's look at some of the most common mistakes when using the standard library.

10.1 #75: Wrong duration provided

The standard library provides accepted time.Durationgeneric functions and methods. However, because time.Durationit is int64an alias for the type, newcomers to the language may be confused and provide the wrong duration. For example, developers with a Java or JavaScript background are used to passing numeric types around.

To illustrate this common mistake, let's create a new one time.Tickerthat will provide a clock tick every second:

ticker := time.NewTicker(1000)
for {
    
    
    select {
    
    
    case <-ticker.C:
        // Do something
    }
}

If we run this code, we'll notice that ticks aren't delivered every second; they're delivered every microsecond.

因为time.Duration基于int64类型,所以之前的代码是正确的,因为1000是有效的int64。但是time.Duration代表两个瞬间之间经过的时间,单位为纳秒。所以我们给NewTicker提供了 1000 纳秒= 1 微秒的持续时间。

这种错误经常发生。事实上,Java 和 JavaScript 等语言的标准库有时会要求开发人员以毫秒为单位提供持续时间。

此外,如果我们想有目的地创建一个间隔为 1 微秒的time.Ticker,我们不应该直接传递一个int64。相反,我们应该始终使用time.Duration API 来避免可能的混淆:

ticker = time.NewTicker(time.Microsecond)
// Or
ticker = time.NewTicker(1000 * time.Nanosecond)

这并不是本书中最复杂的错误,但是具有其他语言背景的开发人员很容易陷入这样一个陷阱,认为time包中的函数和方法应该是毫秒级的。我们必须记住使用time.Duration API 和提供一个int64和一个时间单位。

现在,让我们讨论一下在使用time.After和包时的一个常见错误。

10.2 #76:time.After和内存泄漏

time.After(time.Duration)是一个方便的函数,它返回一个通道,并在向该通道发送消息之前等待一段规定的时间。通常,它用在并发代码中;否则,如果我们想要睡眠给定的持续时间,我们可以使用time.Sleep(time.Duration)time.After的优势在于它可以用于实现这样的场景,比如“如果我在这个通道中 5 秒钟没有收到任何消息,我会…"但是代码库经常在循环中包含对time.After的调用,正如我们在本节中所描述的,这可能是内存泄漏的根本原因。

让我们考虑下面的例子。我们将实现一个函数,该函数重复使用来自通道的消息。如果我们超过 1 小时没有收到任何消息,我们也希望记录一个警告。下面是一个可能的实现:

func consumer(ch <-chan Event) {
    
    
    for {
    
    
        select {
    
    
        case event := <-ch:               // ❶
            handle(event)
        case <-time.After(time.Hour):     // ❷
            log.Println("warning: no messages received")
        }
    }
}

❶ 处理事件

❷ 递增空闲计数器

这里,我们在两种情况下使用select:从ch接收消息和 1 小时后没有消息(time.After在每次迭代中被求值,因此超时每次被重置)。乍一看,这段代码还不错。但是,这可能会导致内存使用问题。

我们说过,time.After返回一个通道。我们可能期望这个通道在每次循环迭代中都是关闭的,但事实并非如此。一旦超时,由time.After创建的资源(包括通道)将被释放,并使用内存直到超时结束。多少内存?在 Go 1.15 中,每次调用time.After大约使用 200 字节的内存。如果我们收到大量的消息,比如每小时 500 万条,我们的应用将消耗 1 GB 的内存来存储和time.After资源。

我们可以通过在每次迭代中以编程方式关闭通道来解决这个问题吗?不会。返回的通道是一个<-chan time.Time,意味着它是一个只能接收的通道,不能关闭。

我们有几个选择来修正我们的例子。第一种是使用上下文来代替time.After:

func consumer(ch <-chan Event) {
    
    
    for {
    
                                                                       // ❶
        ctx, cancel := context.WithTimeout(context.Background(), time.Hour) // ❷
        select {
    
    
        case event := <-ch:
            cancel()                                                        // ❸
            handle(event)
        case <-ctx.Done():                                                  // ❹
            log.Println("warning: no messages received")
        }
    }
}

❶ 主循环

❷ 创建了一个超时的上下文

❸ 如果我们收到消息,取消上下文

❹ 上下文取消

这种方法的缺点是,我们必须在每次循环迭代中重新创建一个上下文。创建上下文并不是 Go 中最轻量级的操作:例如,它需要创建一个通道。我们能做得更好吗?

第二个选项来自time包:time.NewTimer。这个函数创建了一个结构,该结构导出了以下内容:

  • 一个C字段,它是内部计时器通道

  • 一种Reset(time.Duration)方法来重置持续时间

  • 一个Stop()方法来停止计时器

时间。内部构件后

我们要注意的是time.After也依赖于time.Timer。但是,它只返回C字段,所以我们无法访问Reset方法:

package time

func After(d Duration) <-chan Time {
    
    
    return NewTimer(d).C                // ❶
}

❶ 创建了一个新计时器并返回通道字段

让我们使用time.NewTimer实现一个新版本:

func consumer(ch <-chan Event) {
    
    
    timerDuration := 1 * time.Hour
    timer := time.NewTimer(timerDuration)     // ❶

    for {
    
                                         // ❷
        timer.Reset(timerDuration)            // ❸
        select {
    
    
        case event := <-ch:
            handle(event)
        case <-timer.C:                       // ❹
            log.Println("warning: no messages received")
        }
    }
}

❶ 创建了一个新的计时器

❷ 主循环

❸ 重置持续时间

❹ 计时器到期

在这个实现中,我们在每次循环迭代中保持一个循环动作:调用Reset方法。然而,调用Reset比每次都创建一个新的上下文要简单得多。它速度更快,对垃圾收集器的压力更小,因为它不需要任何新的堆分配。因此,使用time.Timer是我们最初问题的最佳解决方案。

注意为了简单起见,在这个例子中,前面的 goroutine 没有停止。正如我们在错误#62 中提到的,“启动一个 goroutine 却不知道何时停止”,这不是一个最佳实践。在生产级代码中,我们应该找到一个退出条件,比如可以取消的上下文。在这种情况下,我们还应该记得使用defer timer.Stop()停止time.Timer,例如,在timer创建之后。

在循环中使用time.After并不是导致内存消耗高峰的唯一情况。该问题与重复调用的代码有关。循环是一种情况,但是在 HTTP 处理函数中使用time.After会导致同样的问题,因为该函数会被多次调用。

一般情况下,使用time.After时要谨慎。请记住,创建的资源只有在计时器到期时才会被释放。当重复调用time.After时(例如,在一个循环中,一个 Kafka 消费函数,或者一个 HTTP 处理器),可能会导致内存消耗的高峰。在这种情况下,我们应该倾向于time.NewTimer

下一节讨论 JSON 处理过程中最常见的错误。

10.3 #77:常见的 JSON 处理错误

Go encoding/jsonpackages have excellent support for JSON. This section covers three common mistakes associated with encoding (marshalling) and decoding (unmarshalling) JSON data.

10.3.1 Unexpected behavior due to type embedding

In Mistake #10 "Not realizing possible problems with type embedding", we discuss issues related to type embedding. In the context of JSON processing, let's discuss another potential effect of type embedding, which can lead to unexpected marshaling/unmarshalling results.

In the example below we create a Eventstruct containing an ID and an embedded timestamp:

type Event struct {
    
    
    ID int
    time.Time       // ❶
}

❶ Embedded fields

Because time.Timeit is embedded, in the way we described before, we have Eventdirect access to and time.Timemethods at the level: for example, event .Second().

What are the possible effects of JSON marshaling on embedded fields? Let's find out in the example below. We'll instantiate one Eventand marshal it into JSON. What should be the output of this code?

event := Event{
    
    
    ID:   1234,
    Time: time.Now(),       // ❶
}

b, err := json.Marshal(event)
if err != nil {
    
    
    return err
}

fmt.Println(string(b))

❶ The name of the anonymous field during structure instantiation is the name of the structure (time).

We might expect this code to print something like this:

{
    
    "ID":1234,"Time":"2021-05-18T21:15:08.381652+02:00"}

Instead, it prints the following:

"2021-05-18T21:15:08.381652+02:00"

How do we interpret this output? What happened to IDfields and values? 1234Since this field is exported, it should have been marshaled. To understand this question, we must emphasize two points.

First, as discussed in Mistake #10, if an embedded field type implements an interface, then the struct containing the embedded field will also implement that interface. json.MarshalerSecond, we can change the default marshaling behavior by having a type implement the interface. The interface contains a single MarshalJSONfunction:

type Marshaler interface {
    
    
    MarshalJSON() ([]byte, error)
}

Here is an example of custom marshaling:

type foo struct{
    
    }                             // ❶

func (foo) MarshalJSON() ([]byte, error) {
    
        // ❷
    return []byte(`"foo"`), nil               // ❸
}

func main() {
    
    
    b, err := json.Marshal(foo{
    
    })             // ❹
    if err != nil {
    
    
        panic(err)
    }
    fmt.Println(string(b))
}

❶ defines the structure

❷ implements MarshalJSONthe method

❸ responded with a static response

❹ Then, json.Marshalrely on custom MarshalJSONimplementations.

Because we Marshalerchanged the default JSON marshaling behavior through the implementation and interface, this code prints out "foo".

With those two points clarified, let's go back to our original Eventquestion about structures:

type Event struct {
    
    
    ID int
    time.Time
}

We must know that the interface time.Timeis implementedjson.Marshaler . Because time.Timeit's Eventan embedded field, the compiler will hoist its methods. Therefore, Eventit is also realized json.Marshaler.

Therefore, json.Marshalpassing one Eventto will use time.Timethe provided marshaling behavior instead of the default. That's why marshalling one Eventcauses IDthe field to be ignored.

Note that we also face the opposite problem if we use json.Unmarshalunmarshal one .Event

To solve this problem, there are two main possibilities. First, we can add a name so time.Timethe field is no longer embedded:

type Event struct {
    
    
    ID   int
    Time time.Time      // ❶
}

time.Timeis no longer embedded.

Thus, if we marshal Eventa version of this struct, it will print something like this:

{
    
    "ID":1234,"Time":"2021-05-18T21:15:08.381652+02:00"}

If we want or must keep embedded time.Timefields, another option is to let the Eventimplemented json.Marshalerinterface:

func (e Event) MarshalJSON() ([]byte, error) {
    
    
    return json.Marshal(
        struct {
    
                // ❶
            ID   int
            Time time.Time
        }{
    
    
            ID:   e.ID,
            Time: e.Time,
        },
    )
}

❶ An anonymous structure is created

In this solution, we implement a custom MarshalJSONmethod while defining an Eventanonymous struct that reflects the struct. But this solution is more cumbersome and requires us to ensure that MarshalJSONmethods and Eventstructures are always up to date.

We should be careful with embedded fields. While hoisting fields and methods of embedded field types can sometimes be convenient, it can also lead to subtle bugs because it causes the parent struct to implement the interface without explicit signaling. Again, when using embedded fields, you must clearly understand the possible side effects.

In the next section, we'll see another time.Timecommon usage-related JSON error.

10.3.2 JSON and monotonic clocks

When marshalling or unmarshalling a time.Timestruct containing a type, we sometimes face unexpected comparison errors. Inspections time.Timehelp refine our assumptions and prevent possible errors.

An operating system handles two different clock types: wall clocks and monotonic clocks. This section looks first at these clock types, and then at time.Timethe possible impact when using JSON and .

Wall clocks are used to determine the current time of day. This clock is subject to change. For example, if a clock is synchronized using the Network Time Protocol (NTP), it can jump backwards or forwards in time. We should not use wall clock to measure duration because we may face strange behavior like negative duration. This is why the operating system provides a second type of clock: the monotonic clock. A monotonic clock guarantees that time always moves forward, unaffected by time jumps. It will be affected by frequency adjustments (for example, if the server detects that the local quartz clock is moving at a different speed than the NTP server), but not by time jumps.

In the following example we consider a struct containing a single time.Timefield (non-embedded) Event:

type Event struct {
    
    
    Time time.Time
}

We instantiate one Event, marshal it into JSON, and unpack it into another structure. We then compare the two structures. Let's see if the marshalling/unmarshalling process is always symmetrical:

t := time.Now()                    // ❶
event1 := Event{
    
                       // ❷
    Time: t,
}

b, err := json.Marshal(event1)     // ❸
if err != nil {
    
    
    return err
}

var event2 Event
err = json.Unmarshal(b, &event2)   // ❹
if err != nil {
    
    
    return err
}

fmt.Println(event1 == event2)

❶ Get the current local time

❷ Instantiate a Eventstructure

❸ Marshaling JSON

❹ Unmarshal JSON

What should be the output of this code? It prints yes falseand no true. How do we explain this?

First, let's print out the contents of event1and event2:

fmt.Println(event1.Time)
fmt.Println(event2.Time)
2021-01-10 17:13:08.852061 +0100 CET m=+0.000338660
2021-01-10 17:13:08.852061 +0100 CET

Code is event1and event2prints different content. Except for m=+0.000338660parts, they are the same. What does it mean?

In Go, time.Timeit is possible to include a wall clock and a monotonic time instead of separating the two clocks into two different APIs. When we time.Now()get the local time using , it returns one time.Timeand two times:

2021-01-10 17:13:08.852061 +0100 CET m=+0.000338660
------------------------------------ --------------
             Wall time               Monotonic time

In contrast, when we unmarshal JSON, time.Timethe fields do not contain monotonic time - only wall time. So when we compare these structures, the result is due to the monotonic time difference; falsethis is also why we see differences when printing both structures. How can we solve this problem? There are two main options.

When we use ==an operator to compare two time.Timefields, it compares all struct fields, including the monotonic part. To avoid this, we can use Equalmethods instead:

fmt.Println(event1.Time.Equal(event2.Time))
true

Equalmethod does not account for monotonic time; therefore, this code prints true. But in this case we only compared time.Timethe fields, not the parent Eventstruct.

The second option is to keep ==comparing two structures, but use the sum Truncatemethod to remove monotonic time. This method returns time.Timethe result of rounding the value down to a multiple of the given duration. We can use it by providing a zero duration like so:

t := time.Now()
event1 := Event{
    
    
    Time: t.Truncate(0),             // ❶
}

b, err := json.Marshal(event1)
if err != nil {
    
    
    return err
}

var event2 Event
err = json.Unmarshal(b, &event2)
if err != nil {
    
    
    return err
}

fmt.Println(event1 == event2)        // ❷

❶ Stripped of monotonous time

❷ Using ==operators to perform comparisons

In this version, the two time.Timefields are equal. Therefore, this code prints true.

time. time and place

Let's also note that each is associated time.Timewith a timezone . time.LocationFor example:

t := time.Now() // 2021-01-10 17:13:08.852061 +0100 CET

Here, the location is set to CET because I used that time.Now(), which returns my current local time. JSON marshaling results are position dependent. To prevent this, we can stick to a specific location:

location, err := time.LoadLocation("America/New_York")    // ❶
if err != nil {
    
    
    return err
}
t := time.Now().In(location) // 2021-05-18 22:47:04.155755 -0500 EST

❶ Acquired "America/New_York"current position

Alternatively, we can get the current time in UTC:

t := time.Now().UTC() // 2021-05-18 22:47:04.155755 +0000 UTC

In conclusion, the marshalling/unmarshalling process is not always symmetrical, the situation we are facing is a contained time.Timestructure. We should keep this principle in mind so we don't write wrong tests.

10.3.3 Any mapping

We can provide a map instead of a structure when unmarshalling data. The rationale is that passing a map is more flexible than passing a static struct when the keys and values ​​are indeterminate. However, there is one rule to keep in mind to avoid false assumptions and possible panics.

Let's write an example that unmarshals messages into a map:

b := getMessage()
var m map[string]any
err := json.Unmarshal(b, &m)    // ❶
if err != nil {
    
    
    return err
}

❶ provides a mapping pointer

Let's provide the following JSON for the preceding code:

{
    
    
    "id": 32,
    "name": "foo"
}

Since we use a generic map[string]any, it automatically parses all the different fields:

map[id:32 name:foo]

However, if we use anymappings, there is an important issue to remember: any numeric value, whether it contains decimals or not, is converted to float64a type. We can m["id"]observe this by looking at the types printed:

fmt.Printf("%T\n", m["id"])
float64

We should make sure we're not making false assumptions and expecting values ​​without decimals to be converted to integers by default. For example, making incorrect assumptions about type conversions could crash a goroutine.

The next section discusses the most common mistakes when writing applications that interact with SQL databases.

10.4 #78: Common SQL errors

database/sqlPackage provides a common interface to SQL (or SQL-like) databases. It's also fairly common to see some patterns or errors when using this package. Let's dig into five common mistakes.

10.4.1 Forgetting that sql.Opena connection to a database does not have to be established

sql.OpenA common misconception when using is to expect the function to establish a connection to the database:

db, err := sql.Open("mysql", dsn)
if err != nil {
    
    
    return err
}

But that's not necessarily true. According to literature records ( pkg.go.dev/database/sql),

Open may just validate its parameters without creating a connection to the database.

Actually, the behavior depends on the SQL driver used. For some drivers, sql.Openthe connection is not established: it is only prepared for later use (for example, with db.Query). Therefore, the first connection to the database may be established lazily.

Why do we need to know about this behavior? For example, in some cases we want the service to be ready only after we know all dependencies are properly setup and reachable. If we don't know this, the service may accept traffic despite being misconfigured.

If we want to ensure that sql.Openthe function used also guarantees that the underlying database is accessible, we should use Pingthe method:

db, err := sql.Open("mysql", dsn)
if err != nil {
    
    
    return err
}
if err := db.Ping(); err != nil {
    
         // ❶
    return err
}

sql.OpenCall Pingthe method after

PingForces the code to establish a connection, ensuring the data source name is valid and the database is accessible. Note that Pingthe alternative is PingContextthat it requires an additional context to communicate when the ping should be canceled or timed out.

As counterintuitive as it may be, let's remember that sql.Opena connection doesn't have to be established, the first connection can be opened lazily. If we want to test our configuration and make sure the database is reachable, we should sql.Opencall the Pingor PingContextmethod afterwards.

10.4.2 Forget about connection pooling

Just as the default HTTP client and server provide default behavior that may not be valid in production (see bug #81, "Use the default HTTP client and server"), it is critical to understand how database connections are handled in Go. sql.OpenReturns a *sql.DBstructure. This structure does not represent a single database connection; instead, it represents a pool of connections. This is worth noting, so we won't try to implement it manually. A connection in a pool can have two states:

  • Already used (for example, by another goroutine that fires the query)

  • Idle (already created but not used temporarily)

It's also important to remember that creating a pool results in four available configuration parameters, which we may want to override. Each of these parameters is *sql.DBan exported method of:

  • SetMaxOpenConns- The maximum number of open connections to the database (default: unlimited)

  • SetMaxIdleConns- maximum number of idle connections (default: 2)

  • SetConnMaxIdleTime- the maximum amount of time a connection can be idle before being closed (default: unlimited)

  • SetConnMaxLifetime- the maximum amount of time a connection can remain open before being closed (default: unlimited)

Figure 10.1 shows an example with up to five connections. It has four ongoing connections: three idle and one in use. Therefore, there is still a slot available for an additional connection. If a new query comes in, it will pick an idle connection (if still available). If there are no more idle connections, the pool will create a new connection if an additional slot is available; otherwise, it will wait until a connection becomes available.

Figure 10.1 A connection pool with five connections

So, why do we need to adjust these configuration parameters?

  • Settings SetMaxOpenConnsare very important for production grade applications. Because the default is unlimited, we should set this to make sure it fits what the underlying database can handle.

  • If our application generates a lot of concurrent requests, then the value of SetMaxIdleConns(default: 2) should be increased. Otherwise, the application may experience frequent reconnections.

  • SetConnMaxIdleTimeThis setting is important if our application is likely to face bursts of requests . When the application returns to a more peaceful state, we want to ensure that created connections are eventually released.

  • For example, settings can be helpful if we are connecting to a load-balanced database server SetConnMaxLifetime. In this case, we want to make sure that our application doesn't use the connection for a long time.

For production-level applications, we must consider these four parameters. We can also use multiple connection pools if an application faces different use cases.

10.4.3 Not using prepared statements

Prepared statements are functions implemented by many SQL databases to execute repetitive SQL statements. Internally, SQL statements are precompiled and separated from the provided data. There are two main benefits:

  • Efficiency - statements do not need to be recompiled (compilation is parsing + optimization + translation).

  • Security - This approach reduces the risk of SQL injection attacks.

Therefore, if a statement is repeated, we should use prepared statements. We should also use prepared statements in untrusted contexts (such as exposing an endpoint on the internet where a request is mapped to a SQL statement).

In order to use prepared statements, instead of calling *sql.DBthe Querymethod, we call Prepare:

stmt, err := db.Prepare("SELECT * FROM ORDER WHERE ID = ?")   // ❶
if err != nil {
    
    
    return err
}
rows, err := stmt.Query(id)                                   // ❷
// ...

❶ Prepared statement

❷ Execute prepared queries

We prepare the statement, then execute it while providing parameters. PrepareThe first output of the method is a *sql.Stmt, which can be reused and run concurrently. When the statement is no longer needed, Close()it must be closed using the and method.

Note that Preparethe and Querymethod provides another context: PrepareContextand QueryContext.

For efficiency and safety, we need to remember to use prepared statements when it makes sense.

10.4.4 Error Handling Null Values

The next mistake is mishandling null values ​​with queries. Let's write an example where we retrieve an employee's department and age:

rows, err := db.Query("SELECT DEP, AGE FROM EMP WHERE ID = ?", id)    // ❶
if err != nil {
    
    
    return err
}
// Defer closing rows

var (
    department string
    age int
)
for rows.Next() {
    
    
    err := rows.Scan(&department, &age)                               // ❷
    if err != nil {
    
    
        return err
    }
    // ...
}

❶ Execute query

❷ scan each row

We use Queryto execute a query. We then iterate over the rows and Scancopy the columns into the values ​​pointed to by the departmentand pointers using ageIf we run this example, we may Scanget the following error when calling:

2021/10/29 17:58:05 sql: Scan error on column index 0, name "DEPARTMENT":
converting NULL to string is unsupported

Here, the SQL driver throws an error because the department value is equal to NULL. There are two options to prevent an error from being returned if a column is nullable Scan.

第一种方法是将department声明为字符串指针:

var (
    department *string      // ❶
    age        int
)
for rows.Next() {
    
    
    err := rows.Scan(&department, &age)
    // ...
}

❶ 将类型从字符串更改为*string

我们给scan提供的是指针的地址,而不是直接字符串类型的地址。通过这样做,如果值为NULLdepartment将为nil

另一种方法是使用sql.NullXXX类型中的,如sql.NullString:

var (
    department sql.NullString    // ❶
    age        int
)
for rows.Next() {
    
    
    err := rows.Scan(&department, &age)
    // ...
}

❶ 将类型更改为sql.NullString

sql.NullString是字符串顶部的包装。它包含两个导出字段:String包含字符串值,Valid表示字符串是否不是NULL。可以访问以下包装器:

  • sql.NullString

  • sql.NullBool

  • sql.NullInt32

  • sql.NullFloat64

  • sql.NullTime

两个都采用的工作方式,用sql.NullXXX更清晰地表达的意图,正如核心GO维护者 Russ Cox(mng.bz/rJNX)所说:

没有有效的区别。我们认为人们可能想要使用NullString,因为它太常见了,并且可能比*string更清楚地表达了意图。但是这两种方法都可以。

因此,可空列的最佳实践是要么将其作为指针处理,要么使用和sql.NullXXX类型。

10.4.5 不处理行迭代错误

另一个常见的错误是在迭代行时漏掉可能的错误。让我们看一个错误处理被误用的函数:

func get(ctx context.Context, db *sql.DB, id string) (string, int, error) {
    
    
    rows, err := db.QueryContext(ctx,
        "SELECT DEP, AGE FROM EMP WHERE ID = ?", id)
    if err != nil {
    
                                         // ❶
        return "", 0, err
    }
    defer func() {
    
    
        err := rows.Close()                             // ❷
        if err != nil {
    
    
            log.Printf("failed to close rows: %v\n", err)
        }
    }()

    var (
        department string
        age        int
    )
    for rows.Next() {
    
    
        err := rows.Scan(&department, &age)             // ❸
        if err != nil {
    
    
            return "", 0, err
        }
    }

    return department, age, nil
}

❶ 在执行查询时处理错误

❷ 在关闭行时处理错误

❸ 在扫描行时处理错误

在这个函数中,我们处理三个错误:执行查询时,关闭行,扫描行。但这还不够。我们必须知道for rows .Next() {}循环可以中断,无论是当没有更多的行时,还是当准备下一行时发生错误时。在行迭代之后,我们应该调用rows.Err来区分两种情况:

func get(ctx context.Context, db *sql.DB, id string) (string, int, error) {
    
    
    // ...
    for rows.Next() {
    
    
        // ...
    }

    if err := rows.Err(); err != nil {
    
        // ❶
        return "", 0, err
    }

    return department, age, nil
}

❶ 检查rows.Err确定上一个循环是否因为错误而停止

Here's a best practice to remember: since rows.Nextwe might stop when we've iterated through all rows, or when an error occurs while preparing the next row, we should check after iteration rows.Err.

Now let's discuss a common mistake: forgetting to close a transient resource.

10.5 #79: Don't close transient resources

Developers often work with transient (temporary) resources that must be closed at some point in the code: for example, to avoid leaks on disk or in memory. Structs can often implement io.Closerinterfaces to communicate that transient resources must be closed. Let's look at three common examples of what happens when resources are not closed properly, and how to handle them properly.

10.5.1 HTTP body

First, let's discuss this issue in the context of HTTP. We'll write a getBodymethod that makes an HTTP GET request and returns an HTTP body response. Here's the first implementation:

type handler struct {
    
    
    client http.Client
    url    string
}

func (h handler) getBody() (string, error) {
    
    
    resp, err := h.client.Get(h.url)           // ❶
    if err != nil {
    
    
        return "", err
    }

    body, err := io.ReadAll(resp.Body)         // ❷
    if err != nil {
    
    
        return "", err
    }

    return string(body), nil
}

❶ Issue an HTTP GET request

❷ Read resp.Bodyand []byteget the text in the form of

We consume http.Getand use io.ReadAllparse responses. This method looks fine, it correctly returns the HTTP response body. However, there is a resource leak. Let's find out where.

respis a *http.Responsetype. It contains one Body io.ReadCloserfield ( io.ReadCloserimplemented io.Readerand io.Closer). If http.Getno error is returned, this body must be closed; otherwise it is a resource leak. In this case, our application will keep some memory that is no longer needed but cannot be reclaimed by GC, and in the worst case, may prevent clients from reusing TCP connections.

The most convenient way to handle body closures is to handle deferstatements like this:

defer func() {
    
    
    err := resp.Body.Close()
    if err != nil {
    
    
        log.Printf("failed to close response: %v\n", err)
    }
}()

In this implementation, we handle the body resource closure as a deferfunction that executes once getBodyit returns.

Note that on the server side, when implementing the HTTP handler, we don't need to close the request body, because the server will automatically close the request body.

We should also understand that whether we read the response body or not, it must be closed. For example, if we are only interested in the HTTP status code and not the body, it must be closed anyway to avoid leaks:

func (h handler) getStatusCode(body io.Reader) (int, error) {
    
    
    resp, err := h.client.Post(h.url, "application/json", body)
    if err != nil {
    
    
        return 0, err
    }

    defer func() {
    
                    // ❶
        err := resp.Body.Close()
        if err != nil {
    
    
            log.Printf("failed to close response: %v\n", err)
        }
    }()

    return resp.StatusCode, nil
}

❶ closes the response body even if we don't read it

This function closes the body even if we didn't read it.

Another important thing to remember is that when we close the body, the behavior is different depending on whether we have read it:

  • The default HTTP transport may close the connection if we close the body without reading it.

  • The default HTTP transport does not close the connection if we close the body after reading it; therefore, it can be reused.

So, if getStatusCodecalled repeatedly and we want to use a keep-alive connection, we should read the body even if we're not interested in it:

func (h handler) getStatusCode(body io.Reader) (int, error) {
    
    
    resp, err := h.client.Post(h.url, "application/json", body)
    if err != nil {
    
    
        return 0, err
    }

    // Close response body

    _, _ = io.Copy(io.Discard, resp.Body)     // ❶

    return resp.StatusCode, nil
}

❶ Read the response body

In this example, we read the body to keep the connection alive. Note that we are not using io.ReadAll, but using io.Copyto io.Discard, an io.Writerimplementation. This code reads the body, but discards it without making any copies, which is io.ReadAllmore efficient than .

When to close the response body

Typically implementations close the body if the response is not empty, not if the error is nil:

resp, err := http.Get(url)
if resp != nil {
    
                    // ❶
    defer resp.Body.Close()     // ❷
}

if err != nil {
    
    
    return "", err
}

If the answer is not zero, ❶…

❷ ...close the response body as a delayed function.

This implementation is not required. This is based on the fact that in some cases (such as redirection failures), neither will be respeither . But according to the official GO documentation ( ),errnilpkg.go.dev/net/http

On error, any response can be ignored. Only CheckRedirecton failure will there be a non-nil response with a non-nil error, and even in this case, the response returned will be the same. The body is closed.

Therefore, there is no need to if resp != nil {}check. deferWe should stick with the original solution and only close the body in the function if there is no error .

Closing resources to avoid leaks is not just about HTTP body management. In general, all io.Closerstructs that implement an interface should be closed at some point. The interface contains a single Closemethod:

type Closer interface {
    
    
    Close() error
}

Now let's look sql.Rowsat the impact.

10.5.2 sql.Rows

sql.Rowsis the structure used as the result of an SQL query. Since this structure is implemented io.Closer, it must be closed. The following example omits the closing of the line:

db, err := sql.Open("postgres", dataSourceName)
if err != nil {
    
    
    return err
}

rows, err := db.Query("SELECT * FROM CUSTOMERS")    // ❶
if err != nil {
    
    
    return err
}

// Use rows

return nil

❶ Execute SQL query

Forgetting to close a row means a connection leak, which prevents the database connection from being put back into the connection pool.

We can handle closures as functions if err != nilfollowing a block defer:

// Open connection

rows, err := db.Query("SELECT * FROM CUSTOMERS")     // ❶
if err != nil {
    
    
    return err
}

defer func() {
    
                                           // ❷
    if err := rows.Close(); err != nil {
    
    
        log.Printf("failed to close rows: %v\n", err)
    }
}()

// Use rows

❶ Execute SQL query

❷ Close a row

After Querycalling, if no error is returned, we should finally close rowsto prevent connection leaks.

NOTE As mentioned in the previous section, dbthe variable ( *sql.DBtype) represents a pool of connections. It also implements io.Closerthe interface. But as the docs show, it is rare to close one sql.DB, as it should be long-lived and shared by many goroutines.

Next, let's discuss closing resources when processing files.

10.5.3 os.File

os.FileRepresents an open file descriptor. As sql.Rowsin, eventually must be closed:

f, err := os.OpenFile(filename, os.O_APPEND|os.O_WRONLY, os.ModeAppend)   // ❶
if err != nil {
    
    
    return err
}

defer func() {
    
    
    if err := f.Close(); err != nil {
    
                                         // ❷
        log.Printf("failed to close file: %v\n", err)
    }
}()

❶ Open the file

❷ Close the file descriptor

In this example, we use deferto delay the call to Closethe method. If we don't end up closing one os.File, it won't cause a leak by itself: os.Filethe file is automatically closed when garbage collected. However, it's better to call explicitly Closebecause we don't know when the next GC will be triggered (unless we run it manually).

Explicit invocation Closehas another benefit: active monitoring of returned errors. This should be the case for writable files, for example.

Writing to a file descriptor is not a synchronous operation. Data is buffered for performance reasons. close(2)The BSD man page for mentions that a closure causes a previously uncommitted write operation (still in the buffer) encountered during an I/O error to fault. So if we want to write to the file, we should propagate any errors that occur while closing the file:

func writeToFile(filename string, content []byte) (err error) {
    
    
    // Open file

    defer func() {
    
                 // ❶
        closeErr := f.Close()
        if err == nil {
    
    
            err = closeErr
        }
    }()

    _, err = f.Write(content)
    return
}

If the write is successful, ❶ will return a close error

In this example, we use named parameters and set the error to the response if the write succeeds f.Close. This way the client will be aware if something is wrong with this function and react accordingly.

Also, successfully closing writable os.Filedoes not guarantee that the file will be written to disk. Write operations can still reside in the file system's buffers without being flushed to disk. If persistence is a key factor, we can use Sync()methods to commit changes. In this case, Closeerrors from can be safely ignored:

func writeToFile(filename string, content []byte) error {
    
    
    // Open file

    defer func() {
    
    
        _ = f.Close()       // ❶
    }()

    _, err = f.Write(content)
    if err != nil {
    
    
        return err
    }

    return f.Sync()         // ❷
}

❶ Ignoring possible mistakes

❷ Commit writes to disk

This example is a synchronous write function. It ensures that the content is written to disk before being returned. But its disadvantage is that it will affect performance.

To wrap up this section, we've seen how important it is to close ephemeral resources to avoid leaks. Ephemeral resources must be closed at the right time and under specific circumstances. It is not always clear in advance what has to end. We can only get this information by carefully reading the API documentation and/or through experience. But we should remember that if a struct implements io.Closerthe interface, we eventually have to call Closethe method. Last but not least, it is essential to understand what to do if the closure fails: is logging a message enough, or should we also propagate it? The appropriate action is implementation dependent, as shown by the three examples in this section.

Now let's switch to a common mistake related to HTTP processing: forgetting returnstatements.

10.6 #80: Forgot to return statement after responding to HTTP request

When writing an HTTP handler, it's easy to forget about the statement that follows the response to an HTTP request. This could lead to a weird situation where we should stop the processor after an error, but we don't.

We can observe this happening in the following example:

func handler(w http.ResponseWriter, req *http.Request) {
    
    
    err := foo(req)
    if err != nil {
    
    
        http.Error(w, "foo", http.StatusInternalServerError)    // ❶
    }

    // ...
}

❶ Handling Errors

If fooan error is returned, we handle http.Errorit with , which fooresponds to the request with an error message and a 500 Internal Server Error. The problem with this code is that if we branch if err != nil, the application will continue executing because http.Errorthe processor execution will not be stopped.

What is the real impact of this error? First, let's discuss it at the HTTP level. For example, suppose we complete the previous HTTP handler by adding a step to write a successful HTTP response body and status code:

func handler(w http.ResponseWriter, req *http.Request) {
    
    
    err := foo(req)
    if err != nil {
    
    
        http.Error(w, "foo", http.StatusInternalServerError)
    }

    _, _ = w.Write([]byte("all good"))
    w.WriteHeader(http.StatusCreated)
}

In err != nilthe case of , the HTTP response is as follows:

foo
all good

The response contains error and success messages.

We will only return the first HTTP status code: 500 in the previous example. However, Go also logs a warning:

2021/10/29 16:45:33 http: superfluous response.WriteHeader call
from main.handler (main.go:20)

This warning means that we tried to write the status code more than once, and doing so is redundant.

As far as execution is concerned, the main effect is to continue executing the function where it should have stopped. For example, if fooan error is returned along with a pointer, continuing execution will mean using that pointer, which may result in a null pointer dereference (and thus crash a goroutine).

The way to correct this error is to continue to consider http.Erroradding returnthe statement after:

func handler(w http.ResponseWriter, req *http.Request) {
    
    
    err := foo(req)
    if err != nil {
    
    
        http.Error(w, "foo", http.StatusInternalServerError)
        return    // ❶
    }

    // ...
}

❶ Added return statement

Because of returnthe statement, if we if err != nilend at the branch, the function will stop executing.

This mistake may not be the most complicated in the book. However, it is easy to forget this and this mistake happens quite often. We always need to remember http.Errornot to stop the execution of a handler, it has to be added manually. Such problems can and should be caught in testing if we have enough coverage.

本章的最后一节继续我们对 HTTP 的讨论。我们明白了为什么生产级应用不应该依赖默认的 HTTP 客户端和服务器实现。

10.7 #81:使用默认的 HTTP 客户端和服务器

http包提供了 HTTP 客户端和服务器实现。然而,开发人员很容易犯一个常见的错误:在最终部署到生产环境中的应用的上下文中依赖默认实现。让我们看看问题和如何克服它们。

10.7.1 HTTP 客户端

我们来定义一下默认客户端是什么意思。我们将使用一个 GET 请求作为例子。我们可以像这样使用http.Client结构的零值:

client := &http.Client{
    
    }
resp, err := client.Get("https://golang.org/")

或者我们可以使用http.Get函数:

resp, err := http.Get("https://golang.org/")

最后,两种方法都是一样的。http.Get函数使用http .DefaultClient,其也是基于http.Client的零值:

// DefaultClient is the default Client and is used by Get, Head, and Post.
var DefaultClient = &Client{
    
    }

那么,使用默认的 HTTP 客户端有什么问题呢?

首先,默认客户端没有指定任何超时。这种没有超时的情况并不是我们想要的生产级系统:它会导致许多问题,比如永无止境的请求会耗尽系统资源。

在深入研究发出请求时的可用超时之前,让我们回顾一下 HTTP 请求中涉及的五个步骤:

  1. 建立 TCP 连接。

  2. TLS 握手(如果启用)。

  3. 发送请求。

  4. 读取响应标题。

  5. 读取响应正文。

图 10.2 显示了这些步骤与主客户端超时的关系。

图 10.2 HTTP 请求期间的五个步骤,以及相关的超时

四种主要超时如下:

  • net.Dialer.Timeout——指定拨号等待连接完成的最长时间。

  • http.Transport.TLSHandshakeTimeout——指定等待 TLS 握手的最长时间。

  • http.Transport.ResponseHeaderTimeout——指定等待服务器响应头的时间。

  • http.Client.Timeout——指定请求的时限。它包括从步骤 1(拨号)到步骤 5(读取响应正文)的所有步骤。

HTTP 客户端超时

在指定http.Client .Timeout时,您可能会遇到以下错误:

net/http: request canceled (Client.Timeout exceeded while awaiting 
headers)

此错误意味着端点未能及时响应。我们得到这个关于头的错误是因为读取它们是等待响应的第一步。

下面是一个覆盖这些超时的 HTTP 客户端示例:

client := &http.Client{
    
    
    Timeout: 5 * time.Second,                  // ❶
    Transport: &http.Transport{
    
    
        DialContext: (&net.Dialer{
    
    
            Timeout: time.Second,              // ❷
        }).DialContext,
        TLSHandshakeTimeout:   time.Second,    // ❸
        ResponseHeaderTimeout: time.Second,    // ❹
    },
}

❶ 全局请求超时

❷ 拨号超时

❸ TLS 握手超时

❹ 响应标头超时

我们创建一个客户端,拨号、TLS 握手和读取响应头的超时时间为 1 秒。同时,每个请求都有一个 5 秒的全局超时。

关于默认 HTTP 客户端,要记住的第二个方面是如何处理连接。默认情况下,HTTP 客户端使用连接池。默认客户端重用连接(可以通过将http.Transport.DisableKeepAlives设置为true来禁用)。有一个额外的超时来指定空闲连接在池中保持多长时间:http.Transport.IdleConnTimeout。默认值是 90 秒,这意味着在此期间,连接可以被其他请求重用。之后,如果连接没有被重用,它将被关闭。

要配置池中的连接数,我们必须覆盖http.Transport.MaxIdleConns。该值默认设置为100。但是有一些重要的事情需要注意:每台主机的http.Transport.MaxIdleConnsPerHost限制,默认设置为 2。例如,如果我们向同一个主机触发100请求,那么在此之后,只有 2 个连接会保留在连接池中。因此,如果我们再次触发 100 个请求,我们将不得不重新打开至少 98 个连接。如果我们必须处理对同一台主机的大量并行请求,这种配置也会影响平均延迟。

对于生产级系统,我们可能希望覆盖默认超时。调整与连接池相关的参数也会对延迟产生重大影响。

10.7.2 HTTP 服务器

在实现 HTTP 服务器时,我们也应该小心。同样,可以使用零值http.Server创建默认服务器:

server := &http.Server{
    
    }
server.Serve(listener)

或者我们可以使用一个函数,比如http.Servehttp.ListenAndServehttp .ListenAndServeTLS,它们也依赖于默认的http.Server

一旦连接被接受,HTTP 响应就分为五个步骤:

  1. 等待客户端发送请求。

  2. TLS 握手(如果启用)。

  3. 读取请求标题。

  4. 读取请求正文。

  5. 写入响应。

注意,对于已经建立的连接,不必重复 TLS 握手。

图 10.3 显示了这些步骤与主服务器超时的关系。三种主要超时如下:

  • http.Server.ReadHeaderTimeout——字段,指定读取请求头的最大时间量

  • http.Server.ReadTimeout——指定读取整个请求的最长时间的字段

  • http.TimeoutHandler——一个包装器函数,指定处理器完成的最大时间

图 10.3 HTTP 响应的五个步骤,以及相关的超时

最后一个参数不是服务器参数,而是一个位于处理器之上的包装器,用于限制其持续时间。如果处理器未能及时响应,服务器将通过特定消息响应 503 服务不可用,传递给处理器的上下文将被取消。

注意我们故意省略了http.Server.WriteTimeout,因为http.TimeoutHandler已经发布(Go 1.8),所以没有必要。http.Server.WriteTimeout有一些问题。首先,它的行为取决于是否启用了 TLS,这使得理解和使用它变得更加复杂。如果超时,它还会关闭 TCP 连接,而不返回正确的 HTTP 代码。它不会将取消传播到处理器上下文,所以处理器可能会继续执行,而不知道 TCP 连接已经关闭。

当向不受信任的客户端公开我们的端点时,最佳实践是至少设置http.Server.ReadHeaderTimeout字段,并且使用http.TimeoutHandler包装函数。否则,客户端可能会利用此缺陷,例如,创建永无止境的连接,这可能会导致系统资源耗尽。

以下是如何设置具有这些超时的服务器:

s := &http.Server{
    
    
    Addr:              ":8080",
    ReadHeaderTimeout: 500 * time.Millisecond,
    ReadTimeout:       500 * time.Millisecond,
    Handler:           http.TimeoutHandler(handler, time.Second, "foo"),   // ❶
}

❶ 包装了 HTTP 处理器

http.TimeoutHandler包装提供的处理器。这里,如果handler在 1 秒内没有响应,服务器返回一个 503 状态码,用foo作为 HTTP 响应。

正如我们所描述的 HTTP 客户端一样,在服务器端,我们可以在激活 keep-alive 时为下一个请求配置最长时间。我们使用http.Server.IdleTimeout来完成:

s := &http.Server{
    
    
    // ...
    IdleTimeout: time.Second,
}

注意,如果没有设置http.Server.IdleTimeout,则http.Server .ReadTimeout的值用于空闲超时。如果两者都没有设置,则不会有任何超时,连接将保持打开状态,直到被客户端关闭。

对于生产级应用,我们需要确保不使用默认的 HTTP 客户端和服务器。否则,请求可能会因为没有超时而永远停滞不前,甚至恶意客户端会利用我们的服务器没有任何超时这一事实。

总结

  • 对接受time.Duration的函数保持谨慎。尽管传递整数是允许的,但还是要努力使用 time API 来防止任何可能的混淆。

  • 避免在重复的函数(比如循环或者 HTTP 处理器)中调用time.After可以避免内存消耗高峰。由time.After创建的资源只有在计时器到期时才会被释放。

  • 在 Go 结构中使用嵌入字段时要小心。这样做可能会导致偷偷摸摸的错误,比如实现json .Marshaler接口的嵌入式time.Time字段,因此会覆盖默认的封送处理行为。

  • 当比较两个time.Time结构时,回想一下time.Time包含一个挂钟和一个单调时钟,使用==操作符的比较是在两个时钟上进行的。

  • 为了避免在解组 JSON 数据时提供映射时的错误假设,请记住默认情况下 numerics 被转换为float64

  • 如果您需要测试您的配置并确保数据库可访问,请调用PingPingContext方法。

  • 为生产级应用配置数据库连接参数。

  • 使用 SQL 预准备语句使查询更高效、更安全。

  • 使用指针或sql.NullXXX类型处理表中可空的列。

  • 在行迭代后调用*sql.RowsErr方法,以确保在准备下一行时没有遗漏错误。

  • 最终关闭所有实现io.Closer的结构以避免可能的泄漏。

  • 为了避免 HTTP 处理器实现中的意外行为,如果您希望处理器在http.Error之后停止,请确保不要错过return语句。

  • 对于生产级应用,不要使用默认的 HTTP 客户端和服务器实现。这些实现缺少生产中应该强制的超时和行为。

十一、测试

本章涵盖

  • 对测试进行分类,使它们更加健壮
  • 使 Go 测试具有确定性
  • Using utility packages such as httptestandiotest
  • Avoid Common Benchmark Mistakes
  • Improve the testing process

Testing is an important aspect of the project lifecycle. It provides countless benefits, such as building confidence in the application, acting as code documentation, and making refactoring easier. Compared to some other languages, Go has powerful primitives for writing tests. In this chapter, we'll focus on common mistakes that make the testing process brittle, inefficient, and inaccurate.

11.1 #82: Not classifying tests

A test pyramid is a model for grouping tests into categories (see Figure 11.1). Unit tests occupy the bottom of the pyramid. Most tests should be unit tests: they are cheap to write, fast to execute, and highly deterministic. Usually, when we go

At higher levels in the pyramid, tests become more complex, slower to run, and harder to guarantee their determinism.

Figure 11.1 An example of a test pyramid

A common technique is to be explicit about what kind of tests to run. For example, depending on the stage of the project life cycle, we may want to run only unit tests or run all tests in the project. Not categorizing tests means potential wasted time and effort, and loss of accuracy of test coverage. This section discusses the three main ways of classifying tests in Go.

11.1.1 Build tags

The most common approach to classification testing is to use build labels. A build tag is a special comment at the beginning of a Go file followed by a blank line.

For example, look at this bar.gofile:

//go:build foo

package bar

This file contains footags. Note that a package may contain multiple files with different build tags.

Note As of Go 1.17, the syntax // +build foois //go:build foosuperseded. Currently (Go 1.18), gofmtthe two forms are synchronized to help with migration.

构建标签主要用于两种情况。首先,我们可以使用build标签作为构建应用的条件选项:例如,如果我们希望只有在启用了cgo的情况下才包含源文件(cgo是一种让包调用 C 代码的方法),我们可以添加//go:build cgo``build标签。第二,如果我们想要将一个测试归类为集成测试,我们可以添加一个特定的构建标志,比如integration

下面是一个db_test.go文件示例:

//go:build integration

package db

import (
    "testing"
)

func TestInsert(t *testing.T) {
    
    
    // ...
}

这里我们添加了integration``build标签来分类这个文件包含集成测试。使用构建标签的好处是我们可以选择执行哪种测试。例如,让我们假设一个包包含两个测试文件:

  • 我们刚刚创建的文件:db_test.go

  • 另一个不包含构建标签的文件:contract_test.go

如果我们在这个包中运行go test而没有任何选项,它将只运行没有构建标签的测试文件(contract_test.go):

$ go test -v .
=== RUN   TestContract
--- PASS: TestContract (0.01s)
PASS

然而,如果我们提供了integration标签,运行go test也将包括db_test.go:

$ go test --tags=integration -v .
=== RUN   TestInsert
--- PASS: TestInsert (0.01s)
=== RUN   TestContract
--- PASS: TestContract (2.89s)
PASS

因此,运行带有特定标签的测试包括没有标签的文件和匹配这个标签的文件。如果我们只想运行集成测试呢?一种可能的方法是在单元测试文件上添加一个否定标记。例如,使用!integration意味着只有当integration标志启用时,我们才想要包含测试文件(contract_test.go):

//go:build !integration

package db

import (
    "testing"
)

func TestContract(t *testing.T) {
    
    
    // ...
}

使用这种方法,

  • integration标志运行go test仅运行集成测试。

  • 在没有integration标志的情况下运行go test只会运行单元测试。

让我们讨论一个在单个测试层次上工作的选项,而不是一个文件。

11.1.2 环境变量

正如 Go 社区的成员 Peter Bourgon 所提到的,build标签有一个主要的缺点:缺少一个测试被忽略的信号(参见 mng.bz/qYlr )。在第一个例子中,当我们在没有构建标志的情况下执行go test时,它只显示了被执行的测试:

$ go test -v .
=== RUN   TestUnit
--- PASS: TestUnit (0.01s)
PASS
ok      db  0.319s

如果我们不小心处理标签的方式,我们可能会忘记现有的测试。出于这个原因,一些项目喜欢使用环境变量来检查测试类别的方法。

例如,我们可以通过检查一个特定的环境变量并可能跳过测试来实现TestInsert集成测试:

func TestInsert(t *testing.T) {
    
    
    if os.Getenv("INTEGRATION") != "true" {
    
    
        t.Skip("skipping integration test")
    }

    // ...
}

如果INTEGRATION环境变量没有设置为true,测试将被跳过,并显示一条消息:

$ go test -v .
=== RUN   TestInsert
    db_integration_test.go:12: skipping integration test     // ❶
--- SKIP: TestInsert (0.00s)
=== RUN   TestUnit
--- PASS: TestUnit (0.00s)
PASS
ok      db  0.319s

❶ 显示跳过测试的消息

使用这种方法的一个好处是明确哪些测试被跳过以及为什么。这种技术可能没有build标签使用得广泛,但是它值得了解,因为正如我们所讨论的,它提供了一些优势。

接下来,让我们看看另一种分类测试的方法:短模式。

11.1.3 短模式

另一种对测试进行分类的方法与它们的速度有关。我们可能必须将短期运行的测试与长期运行的测试分离开来。

作为一个例子,假设我们有一组单元测试,其中一个非常慢。我们希望对慢速测试进行分类,这样我们就不必每次都运行它(特别是当触发器是在保存一个文件之后)。短模式允许我们进行这种区分:

func TestLongRunning(t *testing.T) {
    
    
    if testing.Short() {
    
                            // ❶
        t.Skip("skipping long-running test")
    }
    // ...
}

❶ 将测试标记为长期运行

使用testing.Short,我们可以在运行测试时检索是否启用了短模式。然后我们使用Skip来跳过测试。要使用短模式运行测试,我们必须通过-short:

% go test -short -v .
=== RUN   TestLongRunning
    foo_test.go:9: skipping long-running test
--- SKIP: TestLongRunning (0.00s)
PASS
ok      foo  0.174s

执行测试时,明确跳过TestLongRunning。请注意,与构建标签不同,该选项适用于每个测试,而不是每个文件。

总之,对测试进行分类是成功测试策略的最佳实践。在本节中,我们已经看到了三种对测试进行分类的方法:

  • 在测试文件级别使用构建标签

  • 使用环境变量来标记特定的测试

  • 基于使用短模式的测试步速

We can also combine approaches: for example, if our project contains long-running unit tests, use build tags or environment variables to classify tests (for example, as unit or integration tests) and short patterns.

In the next section, we'll discuss why enabling -racethe flag is important.

11.2 #83: Do not enable race flags

In Bug #58 "Don't understand the race problem", we defined a data race as occurring when two goroutines access the same variable at the same time, at least one of which is written to. We should also know that Go has a standard race detection facility to help detect data races. A common mistake is to forget the importance of this tool and not enable it. This section discusses what the race detector catches, how to use it, and its limitations.

In Go, the race detector is not a static analysis tool used during compilation; rather, it is a tool that finds data races that occur at runtime. To enable it, we have to enable the flag when compiling or running tests -race. For example:

$ go test -race ./...

Once the race detector is enabled, the compiler will instrument the code to detect data races. Instrumentation refers to the addition of extra instructions by the compiler: here, all memory accesses are tracked and recorded when and how they occur. At runtime, the race detector monitors for data races. However, we should keep in mind the runtime overhead of enabling the race detector:

  • Memory usage may increase by 5 to 10 times.

  • Execution time may increase by a factor of 2 to 20.

Because of this overhead, it is generally recommended to only enable race detectors during local testing or continuous integration (CI). In production, we should avoid using it (or only use it for canary releases).

Go will issue a warning if a race is detected. For example, this example contains a data race because ican be read and written at the same time:

package main

import (
    "fmt"
)

func main() {
    
    
    i := 0
    go func() {
    
     i++ }()
    fmt.Println(i)
}

Running the app with -raceflags logs the following data race warning:

==================
WARNING: DATA RACE
Write at 0x00c000026078 by goroutine 7:                // ❶
  main.main.func1()
      /tmp/app/main.go:9 +0x4e

Previous read at 0x00c000026078 by main goroutine:     // ❷
  main.main()
      /tmp/app/main.go:10 +0x88

Goroutine 7 (running) created at:                      // ❸
  main.main()
      /tmp/app/main.go:9 +0x7a
==================

❶ indicates that it is written by goroutine 7

❷ Point out that it is read by the main goroutine

❸ Point out the creation time of goroutine 7

Let's make sure you feel comfortable reading this information. Go always logs the following:

  • 被牵连的并发 goroutine:这里是主 goroutine 和 goroutine 7。

  • 代码中出现访问的地方:在本例中,是第 9 行和第 10 行。

  • 创建这些 goroutine 的时间:goroutine 7 是在main()中创建的。

注意在内部,竞争检测器使用向量时钟,这是一种用于确定事件部分顺序的数据结构(也用于分布式系统,如数据库)。每一个 goroutine 的创建都会导致一个向量时钟的产生。该工具在每次存储器访问和同步事件时更新向量时钟。然后,它比较向量时钟以检测潜在的数据竞争。

竞争检测器不能捕捉假阳性(一个明显的数据竞争,而不是真正的数据竞争)。因此,如果我们得到警告,我们知道我们的代码包含数据竞争。相反,它有时会导致假阴性(遗漏实际的数据竞争)。

关于测试,我们需要注意两件事。首先,竞争检测器只能和我们的测试一样好。因此,我们应该确保针对数据竞争对并发代码进行彻底的测试。其次,考虑到可能的假阴性,如果我们有一个测试来检查数据竞争,我们可以将这个逻辑放在一个循环中。这样做增加了捕获可能的数据竞争的机会:

func TestDataRace(t *testing.T) {
    
    
    for i := 0; i < 100; i++ {
    
    
        // Actual logic
    }
}

此外,如果一个特定的文件包含导致数据竞争的测试,我们可以使用!race``build标签将其从竞争检测中排除:

//go:build !race

package main

import (
    "testing"
)

func TestFoo(t *testing.T) {
    
    
    // ...
}

func TestBar(t *testing.T) {
    
    
    // ...
}

只有在禁用竞争检测器的情况下,才会构建该文件。否则,整个文件不会被构建,所以测试不会被执行。

总之,我们应该记住,如果不是强制性的,强烈推荐使用并发性为应用运行带有-race标志的测试。这种方法允许我们启用竞争检测器,它检测我们的代码来捕捉潜在的数据竞争。启用时,它会对内存和性能产生重大影响,因此必须在特定条件下使用,如本地测试或 CI。

下面讨论与和执行模式相关的两个标志:parallelshuffle

11.3 #84:不使用测试执行模式

在运行测试时,go命令可以接受一组标志来影响测试的执行方式。一个常见的错误是没有意识到这些标志,错过了可能导致更快执行或更好地发现可能的 bug 的机会。让我们来看看其中的两个标志:parallelshuffle

11.3.1 并行标志

并行执行模式允许我们并行运行特定的测试,这可能非常有用:例如,加速长时间运行的测试。我们可以通过调用t.Parallel来标记测试必须并行运行:

func TestFoo(t *testing.T) {
    
    
    t.Parallel()
    // ...
}

当我们使用t.Parallel标记一个测试时,它与所有其他并行测试一起并行执行。然而,在执行方面,Go 首先一个接一个地运行所有的顺序测试。一旦顺序测试完成,它就执行并行测试。

例如,以下代码包含三个测试,但其中只有两个被标记为并行运行:

func TestA(t *testing.T) {
    
    
    t.Parallel()
    // ...
}

func TestB(t *testing.T) {
    
    
    t.Parallel()
    // ...
}

func TestC(t *testing.T) {
    
    
    // ...
}

运行该文件的测试会产生以下日志:

=== RUN   TestA
=== PAUSE TestA           // ❶
=== RUN   TestB
=== PAUSE TestB           // ❷
=== RUN   TestC           // ❸
--- PASS: TestC (0.00s)
=== CONT  TestA           // ❹
--- PASS: TestA (0.00s)
=== CONT  TestB
--- PASS: TestB (0.00s)
PASS

❶ 暂停TestA

❷ 暂停TestB

❸ 运行TestC

❹ 恢复TestATestB

TestC第一个被处决。TestATestB首先被记录,但是它们被暂停,等待TestC完成。然后两者都被恢复并并行执行。

默认情况下,可以同时运行的最大测试数量等于GOMAXPROCS值。为了序列化测试,或者,例如,在进行大量 I/O 的长时间运行的测试环境中增加这个数字,我们可以使用的-parallel标志来改变这个值:

$ go test -parallel 16 .

这里,并行测试的最大数量被设置为 16。

现在让我们看看运行 Go 测试的另一种模式:shuffle

11.3.2 混洗标志

从 Go 1.17 开始,可以随机化测试和基准的执行顺序。有什么道理?编写测试的最佳实践是将它们隔离开来。例如,它们不应该依赖于执行顺序或共享变量。这些隐藏的依赖关系可能意味着一个可能的测试错误,或者更糟糕的是,一个在测试过程中不会被发现的错误。为了防止这种情况,我们可以使用和-shuffle标志来随机化测试。我们可以将其设置为onoff来启用或禁用测试混洗(默认情况下禁用):

$ go test -shuffle=on -v .

然而,在某些情况下,我们希望以相同的顺序重新运行测试。例如,如果在 CI 期间测试失败,我们可能希望在本地重现错误。为此,我们可以传递用于随机化测试的种子,而不是将on传递给-shuffle标志。我们可以通过启用详细模式(-v)在运行混洗测试时访问这个种子值:

$ go test -shuffle=on -v .
-test.shuffle 1636399552801504000     // ❶
=== RUN   TestBar
--- PASS: TestBar (0.00s)
=== RUN   TestFoo
--- PASS: TestFoo (0.00s)
PASS
ok      teivah  0.129s

❶ 种子值

我们随机执行测试,但是go test打印种子值:1636399552801504000。为了强制测试以相同的顺序运行,我们将这个种子值提供给shuffle:

$ go test -shuffle=1636399552801504000 -v .
-test.shuffle 1636399552801504000
=== RUN   TestBar
--- PASS: TestBar (0.00s)
=== RUN   TestFoo
--- PASS: TestFoo (0.00s)
PASS
ok      teivah  0.129s

测试以相同的顺序执行:TestBar然后是TestFoo

一般来说,我们应该对现有的测试标志保持谨慎,并随时了解最近 Go 版本的新特性。并行运行测试是减少运行所有测试的总执行时间的一个很好的方法。并且shuffle模式可以帮助我们发现隐藏的依赖关系,这可能意味着在以相同的顺序运行测试时的测试错误,甚至是看不见的 bug。

11.4 #85:不使用表驱动测试

表驱动测试是一种有效的技术,用于编写精简的测试,从而减少样板代码,帮助我们关注重要的东西:测试逻辑。本节通过一个具体的例子来说明为什么在使用 Go 时表驱动测试是值得了解的。

让我们考虑下面的函数,它从字符串中删除所有的新行后缀(\n\r\n):

func removeNewLineSuffixes(s string) string {
    
    
    if s == "" {
    
    
        return s
    }
    if strings.HasSuffix(s, "\r\n") {
    
    
        return removeNewLineSuffixes(s[:len(s)-2])
    }
    if strings.HasSuffix(s, "\n") {
    
    
        return removeNewLineSuffixes(s[:len(s)-1])
    }
    return s
}

这个函数递归地删除所有前导的\r\n\n后缀。现在,假设我们想要广泛地测试这个函数。我们至少应该涵盖以下情况:

  • 输入为空。

  • 输入以\n结束。

  • 输入以\r\n结束。

  • 输入以多个\n结束。

  • 输入结束时没有换行符。

以下方法为每个案例创建一个单元测试:

func TestRemoveNewLineSuffix_Empty(t *testing.T) {
    
    
    got := removeNewLineSuffixes("")
    expected := ""
    if got != expected {
    
    
        t.Errorf("got: %s", got)
    }
}

func TestRemoveNewLineSuffix_EndingWithCarriageReturnNewLine(t *testing.T) {
    
    
    got := removeNewLineSuffixes("a\r\n")
    expected := "a"
    if got != expected {
    
    
        t.Errorf("got: %s", got)
    }
}

func TestRemoveNewLineSuffix_EndingWithNewLine(t *testing.T) {
    
    
    got := removeNewLineSuffixes("a\n")
    expected := "a"
    if got != expected {
    
    
        t.Errorf("got: %s", got)
    }
}

func TestRemoveNewLineSuffix_EndingWithMultipleNewLines(t *testing.T) {
    
    
    got := removeNewLineSuffixes("a\n\n\n")
    expected := "a"
    if got != expected {
    
    
        t.Errorf("got: %s", got)
    }
}

func TestRemoveNewLineSuffix_EndingWithoutNewLine(t *testing.T) {
    
    
    got := removeNewLineSuffixes("a\n")
    expected := "a"
    if got != expected {
    
    
        t.Errorf("got: %s", got)
    }
}

每个函数都代表了我们想要涵盖的一个特定案例。然而,有两个主要缺点。首先,函数名更复杂(TestRemoveNewLineSuffix_EndingWithCarriageReturnNewLine有 55 个字符长),这很快会影响函数测试内容的清晰度。第二个缺点是这些函数之间的重复量,因为结构总是相同的:

  1. removeNewLineSuffixes

  2. 定义期望值。

  3. 比较数值。

  4. 记录错误信息。

如果我们想要改变这些步骤中的一个——例如,将期望值作为错误消息的一部分包含进来——我们将不得不在所有的测试中重复它。我们写的测试越多,代码就越难维护。

相反,我们可以使用表驱动测试,这样我们只需编写一次逻辑。表驱动测试依赖于子测试,一个测试函数可以包含多个子测试。例如,以下测试包含两个子测试:

func TestFoo(t *testing.T) {
    
    
    t.Run("subtest 1", func(t *testing.T) {
    
        // ❶
        if false {
    
    
            t.Error()
        }
    })
    t.Run("subtest 2", func(t *testing.T) {
    
        // ❷
        if 2 != 2 {
    
    
            t.Error()
        }
    })
}

❶ 进行第一个子测试,称为子测试 1

❷ 进行第二个子测试,称为子测试 2

TestFoo函数包括两个子测试。如果我们运行这个测试,它显示了subtest 1subtest 2的结果:

--- PASS: TestFoo (0.00s)
    --- PASS: TestFoo/subtest_1 (0.00s)
    --- PASS: TestFoo/subtest_2 (0.00s)
PASS

我们还可以使用和-run标志运行一个单独的测试,并将父测试名与子测试连接起来。例如,我们可以只运行subtest 1:

$ go test -run=TestFoo/subtest_1 -v      // ❶
=== RUN   TestFoo
=== RUN   TestFoo/subtest_1
--- PASS: TestFoo (0.00s)
    --- PASS: TestFoo/subtest_1 (0.00s)

❶ 使用-run标志只运行子测试 1

让我们回到我们的例子,看看如何使用子测试来防止重复测试逻辑。主要想法是为每个案例创建一个子测试。变化是存在的,但是我们将讨论一个映射数据结构,其中键代表测试名称,值代表测试数据(输入,预期)。

表驱动测试通过使用包含测试数据和子测试的数据结构来避免样板代码。下面是一个使用映射的可能实现:

func TestRemoveNewLineSuffix(t *testing.T) {
    
    
    tests := map[string]struct {
    
                       // ❶
        input    string
        expected string
    }{
    
    
        `empty`: {
    
                                     // ❷
            input:    "",
            expected: "",
        },
        `ending with \r\n`: {
    
    
            input:    "a\r\n",
            expected: "a",
        },
        `ending with \n`: {
    
    
            input:    "a\n",
            expected: "a",
        },
        `ending with multiple \n`: {
    
    
            input:    "a\n\n\n",
            expected: "a",
        },
        `ending without newline`: {
    
    
            input:    "a",
            expected: "a",
        },
    }
    for name, tt := range tests {
    
                      // ❸
        t.Run(name, func(t *testing.T) {
    
               // ❹
            got := removeNewLineSuffixes(tt.input)
            if got != tt.expected {
    
    
                t.Errorf("got: %s, expected: %s", got, tt.expected)
            }
        })
    }
}

❶ 定义了测试数据

❷ : Each entry in the map represents a subtest.

❸ Iterate over the map

❹ Run a new subtest for each map entry

testsA variable is a map. The key is the test name, and the value represents the test data: in our case, input and expected strings. Each map entry is a new test case that we want to cover. We run a new subtest for each map entry.

This test addresses the two shortcomings we discussed:

  • Each test name is now a string instead of a Pascal-cased function name, which makes it easier to read.

  • That logic is written once and shared across all the different cases. Modifying the test structure or adding a new test requires minimal effort.

There is one last thing we need to mention about table-driven tests, which can also be a source of bugs: As we mentioned earlier, we can t.Parallelflag a test to run in parallel by calling it. We can also t.Rundo this in subtests inside the provided closure:

for name, tt := range tests {
    
    
    t.Run(name, func(t *testing.T) {
    
    
        t.Parallel()                   // ❶
        // Use tt
    })
}

❶ Marks subtests that run in parallel

However, this closure uses a loop variable. To prevent issues similar to those discussed in bug #63, "Accidental use of goroutines and loop variables", which could cause closures to use the wrong ttvariable's value, we should create another variable or shadow tt:

for name, tt := range tests {
    
    
    tt := tt                          // ❶
    t.Run(name, func(t *testing.T) {
    
    
        t.Parallel()
        // Use tt
    })
}

❶ Trace ttso that it is local to the loop iteration

This way, each closure will have access to its own ttvariables.

In summary, if multiple unit tests have a similar structure, we can use table-driven tests to commonize them. Because this technique prevents duplication, it makes it simple to change test logic and add new use cases more easily.

Next, let's discuss how to prevent flaky tests in Go.

11.5 #86: sleep in unit tests

A quirky test is one that passes and fails without requiring any code changes. Quirky tests are one of the biggest hurdles in testing because they are expensive to debug and erode our confidence in the accuracy of the tests. In Go, calling in a test time.Sleepcan be a signal that something might go wrong. For example, concurrent code is often tested using sleeps. This section covers specific techniques for removing sleeps from tests, preventing us from writing volatile tests.

We'll illustrate this part with a function that returns a value and starts a goroutine that performs tasks in the background. We'll call a function to take a slice of Foothe structure, and return the best element (the first). Meanwhile, another goroutine will be responsible for calling the method with the nth Fooelement Publish:

type Handler struct {
    
    
    n         int
    publisher publisher
}

type publisher interface {
    
    
    Publish([]Foo)
}

func (h Handler) getBestFoo(someInputs int) Foo {
    
    
    foos := getFoos(someInputs)        // ❶
    best := foos[0]                    // ❷

    go func() {
    
    
        if len(foos) > h.n {
    
               // ❸
            foos = foos[:h.n]
        }
        h.publisher.Publish(foos)      // ❹
    }()

    return best
}

❶ get Foothe slice

foos❷ Keep the first element (check length omitted for simplicity )

❸ Keep only nthe previous Foostructure

❹ Call Publishmethod

HandlerThe structure contains two fields: a field and a dependency nto publish the first n Foostructure . publisherFirst we get a slice Foo; but before returning the first element, we spin up a new goroutine, filter foosthe slice, and call Publish.

How can we test this function? The part of writing the claim response is pretty straightforward. PublishBut what if we also wanted to check what was passed to ?

We can imitate publisherthe interface to record Publishthe parameters passed when calling the method. We can then sleep for a few milliseconds before checking the logged parameters:

type publisherMock struct {
    
    
    mu  sync.RWMutex
    got []Foo
}

func (p *publisherMock) Publish(got []Foo) {
    
    
    p.mu.Lock()
    defer p.mu.Unlock()
    p.got = got
}

func (p *publisherMock) Get() []Foo {
    
    
    p.mu.RLock()
    defer p.mu.RUnlock()
    return p.got
}

func TestGetBestFoo(t *testing.T) {
    
    
    mock := publisherMock{
    
    }
    h := Handler{
    
    
        publisher: &mock,
        n:         2,
    }

    foo := h.getBestFoo(42)
    // Check foo

    time.Sleep(10 * time.Millisecond)    // ❶
    published := mock.Get()
    // Check published
}

Publishsleeps for 10 milliseconds before checking the arguments passed to

We wrote a publishermock of pair that relies on a mutex to protect publishedaccess to fields. In our unit tests, we call to allow some time before time.Sleepchecking the arguments passed to .Publish

This test is inherently unreliable. There is no strict guarantee that 10 ms will be enough (in this case, it is possible but not guaranteed).

So, what are the options to improve this unit test? First, we can use retries to periodically assert a given condition. For example, we could write a function that takes as arguments an assertion, a maximum number of retries plus a wait time, and calls the function periodically to avoid busy loops:

func assert(t *testing.T, assertion func() bool,
    maxRetry int, waitTime time.Duration) {
    
    
    for i := 0; i < maxRetry; i++ {
    
    
        if assertion() {
    
                   // ❶
            return
        }
        time.Sleep(waitTime)           // ❷
    }
    t.Fail()                           // ❸
}

❶ Check assertions

❷ sleep before retry

❸ After many attempts, finally failed

This function checks the provided assertion and fails after a certain number of retries. We also use time.Sleep, but we can shorten the sleep time with this code.

As an example, let's go back to TestGetBestFoo:

assert(t, func() bool {
    
    
    return len(mock.Get()) == 2
}, 30, time.Millisecond)

Instead of sleeping for 10 milliseconds, we sleep every millisecond and configure the maximum number of retries. This approach reduces execution time if the test succeeds, because we reduce the wait time. Therefore, implementing a retry strategy is a better approach than using passive sleep.

Note that some testing libraries, eg testify, provide retry functionality. For example, in testify, we can use Eventuallyfunctions that implement assertions and other features that should ultimately succeed, such as configuring error messages.

Another strategy is to use channels to synchronize the publish Foostructure goroutine with the test goroutine. For example, in the mock implementation, we could send this value to a channel instead of copying the received slice into a field:

type publisherMock struct {
    
    
    ch chan []Foo
}

func (p *publisherMock) Publish(got []Foo) {
    
    
    p.ch <- got                               // ❶
}

func TestGetBestFoo(t *testing.T) {
    
    
    mock := publisherMock{
    
    
        ch: make(chan []Foo),
    }
    defer close(mock.ch)

    h := Handler{
    
    
        publisher: &mock,
        n:         2,
    }
    foo := h.getBestFoo(42)
    // Check foo

    if v := len(<-mock.ch); v != 2 {
    
              // ❷
        t.Fatalf("expected 2, got %d", v)
    }
}

❶ Send received parameters

❷ compared the parameters

The publisher sends the received parameters to the channel. Meanwhile, the test goroutine sets up the mock and creates an assertion based on the value received. We can also implement a timeout policy to make sure we don't wait forever if something goes wrong mock.ch. For example, we can use selectwith time.Afterand .

我们应该支持哪个选项:重试还是同步?事实上,同步将等待时间减少到最低限度,如果设计得好的话,可以使测试完全确定。

如果我们不能应用同步,我们也许应该重新考虑我们的设计,因为我们可能有一个问题。如果同步确实不可能,我们应该使用重试选项,这是比使用被动睡眠来消除测试中的不确定性更好的选择。

让我们继续讨论如何在测试中防止剥落,这次是在使用时间 API 的时候。

11.6 #87:没有有效地处理时间 API

一些函数必须依赖于时间 API:例如,检索当前时间。在这种情况下,编写脆弱的单元测试可能会很容易失败。在本节中,我们将通过一个具体的例子来讨论选项。我们的目标并不是涵盖所有的用例及技术,而是给出关于使用时间 API 编写更健壮的函数测试的指导。

假设一个应用接收到我们希望存储在内存缓存中的事件。我们将实现一个Cache结构来保存最近的事件。此结构将公开三个方法,这些方法执行以下操作:

  • 追加事件

  • 获取所有事件

  • 在给定的持续时间内修剪事件(我们将重点介绍这种方法)

这些方法中的每一个都需要访问当前时间。让我们使用time.Now()编写第三种方法的第一个实现(我们将假设所有事件都按时间排序):

type Cache struct {
    
    
    mu     sync.RWMutex
    events []Event
}

type Event struct {
    
    
    Timestamp time.Time
    Data string
}

func (c *Cache) TrimOlderThan(since time.Duration) {
    
    
    c.mu.RLock()
    defer c.mu.RUnlock()

    t := time.Now().Add(-since)               // ❶
    for i := 0; i < len(c.events); i++ {
    
    
        if c.events[i].Timestamp.After(t) {
    
    
            c.events = c.events[i:]           // ❷
            return
        }
    }
}

❶ 从当前时间中减去给定的持续时间

❷ 负责整理这些事件

我们计算一个t变量,它是当前时间减去提供的持续时间。然后,因为事件是按时间排序的,所以一旦到达时间在t之后的事件,我们就更新内部的events片。

我们如何测试这种方法?我们可以依靠当前时间使用time.Now来创建事件:

func TestCache_TrimOlderThan(t *testing.T) {
    
    
    events := []Event{
    
                                            // ❶
        {
    
    Timestamp: time.Now().Add(-20 * time.Millisecond)},
        {
    
    Timestamp: time.Now().Add(-10 * time.Millisecond)},
        {
    
    Timestamp: time.Now().Add(10 * time.Millisecond)},
    }
    cache := &Cache{
    
    }
    cache.Add(events)                                         // ❷
    cache.TrimOlderThan(15 * time.Millisecond)                // ❸
    got := cache.GetAll()                                     // ❹
    expected := 2
    if len(got) != expected {
    
    
        t.Fatalf("expected %d, got %d", expected, len(got))
    }
}

❶ 利用time.Now()创建事件。

❷ 将这些事件添加到缓存中

❸ 整理了 15 毫秒前的事件

❹ 检索所有事件

我们使用time.Now()将一部分事件添加到缓存中,并增加或减少一些小的持续时间。然后,我们将这些事件调整 15 毫秒,并执行断言。

This approach has a major drawback: if the machine executing the tests is suddenly busy, we may prune fewer events than expected. We might be able to increase the provided duration to reduce the chance of test failures, but doing so is not always possible. For example, what if the timestamp field is an unexported field generated when an event is added? In this case it's not possible to pass a specific timestamp and you may end up adding sleeps in your unit tests.

The problem is TrimOlderThanrelated to the implementation of . Because it calls time.Now(), it's more difficult to implement robust unit tests. Let's discuss two ways to make our tests less brittle.

The first approach is to make the method that retrieves the current time a Cachedependency on the struct. In production we inject the real implementation, while in unit tests we pass a stub.

There are various techniques to handle this dependency, such as interfaces or function types. In our case, since we only rely on one method ( time.Now()), we can define a function type:

type now func() time.Time

type Cache struct {
    
    
    mu     sync.RWMutex
    events []Event
    now    now
}

nowType is a time.Timefunction that returns. In the factory function, we can pass the actual time.Nowfunction like this:

func NewCache() *Cache {
    
    
    return &Cache{
    
    
        events: make([]Event, 0),
        now:    time.Now,
    }
}

Because nowthe dependency is still not exported, it cannot be accessed by external clients. Also , in our unit tests, we can create a struct by injecting func() time.Timefake implementations based on predefined events :Cache

func TestCache_TrimOlderThan(t *testing.T) {
    
    
    events := []Event{
    
                                             // ❶
        {
    
    Timestamp: parseTime(t, "2020-01-01T12:00:00.04Z")},
        {
    
    Timestamp: parseTime(t, "2020-01-01T12:00:00.05Z")},
        {
    
    Timestamp: parseTime(t, "2020-01-01T12:00:00.06Z")},
    }
    cache := &Cache{
    
    now: func() time.Time {
    
                        // ❷
        return parseTime(t, "2020-01-01T12:00:00.06Z")
    }}
    cache.Add(events)
    cache.TrimOlderThan(15 * time.Millisecond)
    // ...
}

func parseTime(t *testing.T, timestamp string) time.Time {
    
    
    // ...
}

❶ Create events based on specific timestamps

❷ Inject a static function to fix the time

When creating a new Cachestructure, we inject dependencies according to the given time now. Thanks to this approach, the tests are robust. Even in the worst-case scenarios, the results of this test are deterministic.

use global variables

Instead of using fields, we can retrieve the time through a global variable:

var now = time.Now      // ❶

❶ defines global variablesnow

In general, we should try to avoid this mutable shared state. In our case, this will cause at least one specific problem: the tests will no longer be isolated, since they all depend on a shared variable. So, for example, tests cannot run in parallel. If possible, we should handle these cases as part of structural dependencies, promoting test isolation.

This solution is also scalable. time.AfterFor example, what about function calls ? We could add another afterdependency, or create an interface that combines the two methods Nowand Aftersums together. However, this approach has a major drawback: for example, if we create a unit test from an external package, the nowdependencies are not available (we explore this in Mistake 90 "Not exploring all Go testing features") .

In this case, we can use another technique. Instead of handling time as an unreported dependency, we can ask the client for the current time:

func (c *Cache) TrimOlderThan(now time.Time, since time.Duration) {
    
    
    // ...
}

To go a step further, we can combine the two function arguments into a single time.Timeone that represents a specific point in time until we want to adjust the event:

func (c *Cache) TrimOlderThan(t time.Time) {
    
    
    // ...
}

It is up to the caller to calculate this point in time:

cache.TrimOlderThan(time.Now().Add(time.Second))

And in the test, we also have to pass the corresponding time:

func TestCache_TrimOlderThan(t *testing.T) {
    
    
    // ...
    cache.TrimOlderThan(parseTime(t, "2020-01-01T12:00:00.06Z").
        Add(-15 * time.Millisecond))
    // ...
}

This approach is the easiest since it doesn't require creating another type and stub.

In general, we should be cautious about testing timecode that uses APIs. This could be an open door for wacky tests. In this section, we saw two approaches to this. We can make timethe interaction part of the dependency, which we can fake in our unit tests by using our own implementation or relying on an external library; or we can modify our API to require the client to provide the information we need, such as the current time (This technique is simpler, but more limited).

Let's now discuss two useful Go packages related to testing: httptestand iotest.

11.7 #88: Not using the test utility package

标准库提供了用于测试的实用工具包。一个常见的错误是没有意识到这些包,并试图重新发明轮子或依赖其他不方便的解决方案。本节研究其中的两个包:一个在使用 HTTP 时帮助我们,另一个在进行 I/O 和使用读取器和写入器时使用。

11.7.1 httptest

httptest包(pkg.go.dev/net/http/httptest)为客户端和服务器端的 HTTP 测试提供了工具。让我们看看这两个用例。

首先,让我们看看httptest如何在编写 HTTP 服务器时帮助我们。我们将实现一个处理器,它执行一些基本的操作:编写标题和正文,并返回一个特定的状态代码。为了清楚起见,我们将省略错误处理:

func Handler(w http.ResponseWriter, r *http.Request) {
    
    
    w.Header().Add("X-API-VERSION", "1.0")
    b, _ := io.ReadAll(r.Body)
    _, _ = w.Write(append([]byte("hello "), b...))     // ❶
    w.WriteHeader(http.StatusCreated)
}

❶ 将hello与请求正文连接起来

HTTP 处理器接受两个参数:请求和编写响应的方式。httptest包为两者提供了实用工具。对于请求,我们可以使用 HTTP 方法、URL 和正文使用httptest.NewRequest构建一个*http.Request。对于响应,我们可以使用httptest.NewRecorder来记录处理器中的变化。让我们编写这个处理器的单元测试:

func TestHandler(t *testing.T) {
    
    
    req := httptest.NewRequest(http.MethodGet, "http://localhost",     // ❶
        strings.NewReader("foo"))
    w := httptest.NewRecorder()                                        // ❷
    Handler(w, req)                                                    // ❸

    if got := w.Result().Header.Get("X-API-VERSION"); got != "1.0" {
    
       // ❹
        t.Errorf("api version: expected 1.0, got %s", got)
    }

    body, _ := ioutil.ReadAll(wordy)                                   // ❺
    if got := string(body); got != "hello foo" {
    
    
        t.Errorf("body: expected hello foo, got %s", got)
    }

    if http.StatusOK != w.Result().StatusCode {
    
                            // ❻
        t.FailNow()
    }
}

❶ 构建请求

❷ 创建了响应记录器

❸ 调用Handler

❹ 验证 HTTP 报头

❺ 验证 HTTP 正文

❻ 验证 HTTP 状态代码

使用httptest测试处理器并不测试传输(HTTP 部分)。测试的重点是用请求和记录响应的方法直接调用处理器。然后,使用响应记录器,我们编写断言来验证 HTTP 头、正文和状态代码。

让我们看看硬币的另一面:测试 HTTP 客户端。我们将编写一个负责查询 HTTP 端点的客户机,该端点计算从一个坐标开车到另一个坐标需要多长时间。客户端看起来像这样:

func (c DurationClient) GetDuration(url string,
    lat1, lng1, lat2, lng2 float64) (
    time.Duration, error) {
    
    
    resp, err := c.client.Post(
        url, "application/json",
        buildRequestBody(lat1, lng1, lat2, lng2),
    )
    if err != nil {
    
    
        return 0, err
    }

    return parseResponseBody(resp.Body)
}

这段代码对提供的 URL 执行 HTTP POST 请求,并返回解析后的响应(比如说,一些 JSON)。

What if we want to test this client? One option is to use Docker and start a mock server that returns some pre-registered responses. However, this approach makes test execution slow. Another option is to use httptest.NewServerto create a local HTTP server based on the handler we will provide. Once the server is up and running, we can pass its URL to GetDuration:

func TestDurationClientGet(t *testing.T) {
    
    
    srv := httptest.NewServer(                                             // ❶
        http.HandlerFunc(
            func(w http.ResponseWriter, r *http.Request) {
    
    
                _, _ = w.Write([]byte(`{"duration": 314}`))                // ❷
            },
        ),
    )
    defer srv.Close()                                                      // ❸

    client := NewDurationClient()
    duration, err :=
        client.GetDuration(srv.URL, 51.551261, -0.1221146, 51.57, -0.13)   // ❹
    if err != nil {
    
    
        t.Fatal(err)
    }

    if duration != 314*time.Second {
    
                                           // ❺
        t.Errorf("expected 314 seconds, got %v", duration)
    }
}

❶ Start the HTTP server

❷ Register handlers to serve responses

❸ Shut down the server

❹ Server URL provided

❺ Validated the response

In this test, we create a 314server with a static handler that returns seconds. We can also make assertions based on the request sent. Also, when we call GetDuration, we provide the URL of the server that was started. In contrast to the test handler, this test executes an actual HTTP call, but its execution takes only a few milliseconds.

We can also use TLS and httptest.NewTLSServerstart a new server, and create httptest.NewUnstartedServeran unstarted server using TLS so we can delay starting it.

Let's remember how useful it is when working in the context of HTTP applications httptest. Whether we are writing a server or a client, httptestit helps us create efficient tests.

11.7.2 iotestPackages

iotestpackage( pkg.go.dev/testing/iotest) implements utilities for testing readers and writers. It's a handy package that Go developers often forget about.

When implementing a custom one io.Reader, we should remember iotest.TestReaderto test it with This utility function tests that the reader is behaving correctly: it returns exactly how many bytes were read, filled the provided slice, etc. io.ReaderAtIt also tests for different behaviors if the provided reader implements an interface like this.

Let's say we have a custom one that streams lowercase letters LowerCaseReaderfrom a given input . io.ReaderHere's how to test that this reader isn't misbehaving:

func TestLowerCaseReader(t *testing.T) {
    
    
    err := iotest.TestReader(
        &LowerCaseReader{
    
    reader: strings.NewReader("aBcDeFgHiJ")},   // ❶
        []byte("acegi"),                                             // ❷
    )
    if err != nil {
    
    
        t.Fatal(err)
    }
}

❶ provides aio.Reader

❷ 期望

我们通过提供自定义的LowerCaseReader和一个期望来调用iotest.TestReader:小写字母acegi

iotest包的另一个用例是,以确保使用读取器和写入器的应用能够容忍错误:

  • iotest.ErrReader创建一个io.Reader返回一个提供的错误。

  • iotest.HalfReader创建一个io.Reader,它只读取从io.Reader请求的一半字节。

  • iotest.OneByteReader创建一个io.Reader,用于从io.Reader中读取每个非空字节。

  • iotest.TimeoutReader创建一个io.Reader,在第二次读取时返回一个没有数据的错误。后续调用将会成功。

  • iotest.TruncateWriter创建一个io.Writer写入一个io.Writer,但在n字节后静默停止。

例如,假设我们实现了以下函数,该函数从读取器读取所有字节开始:

func foo(r io.Reader) error {
    
    
    b, err := io.ReadAll(r)
    if err != nil {
    
    
        return err
    }

    // ...
}

我们希望确保我们的函数具有弹性,例如,如果提供的读取器在读取期间失败(例如模拟网络错误):

func TestFoo(t *testing.T) {
    
    
    err := foo(iotest.TimeoutReader(            // ❶
        strings.NewReader(randomString(1024)),
    ))
    if err != nil {
    
    
        t.Fatal(err)
    }
}

❶ 使用iotest.TimeoutReader包装提供的io.Reader

我们用io.TimeoutReader包装一个io.Reader。正如我们提到的,二读会失败。如果我们运行这个测试来确保我们的函数能够容忍错误,我们会得到一个测试失败。实际上,io.ReadAll会返回它发现的任何错误。

知道了这一点,我们就可以实现我们的自定义readAll函数,它可以容忍多达n个错误:

func readAll(r io.Reader, retries int) ([]byte, error) {
    
    
    b := make([]byte, 0, 512)
    for {
    
    
        if len(b) == cap(b) {
    
    
            b = append(b, 0)[:len(b)]
        }
        n, err := r.Read(b[len(b):cap(b)])
        b = b[:len(b)+n]
        if err != nil {
    
    
            if err == io.EOF {
    
    
                return b, nil
            }
            retries--
            if retries < 0 {
    
         // ❶
                return b, err
            }
        }
    }
}

❶ 容忍重试

这个实现类似于io.ReadAll,但是它也处理可配置的重试。如果我们改变初始函数的实现,使用自定义的readAll而不是io.ReadAll,测试将不再失败:

func foo(r io.Reader) error {
    
    
    b, err := readAll(r, 3)       // ❶
    if err != nil {
    
    
        return err
    }

    // ...
}

❶ 表示最多可重试三次

We've already seen an example of how to check whether a function is tolerant of errors when reading data io.Readerfrom it . The packages we rely on iotestwere tested.

When using io.Readerand io.Writerdoing I/O and work, let's remember iotesthow convenient bags are. As we've seen, it provides utilities for testing custom io.Readerbehavior and testing our app for errors when reading and writing data.

The next section discusses some common pitfalls that can lead to writing inaccurate benchmarks.

11.8 #89: Writing inaccurate benchmarks

In general, we should never guess performance. When writing optimizations, many factors may come into play, and even if we have strong opinions about the results, it's not a bad idea to test them. However, writing benchmarks is not straightforward. It can be very simple to write inaccurate benchmarks and make wrong assumptions based on them. The goal of this section is to examine common and specific pitfalls that lead to inaccuracies.

Before discussing these pitfalls, let's briefly review how benchmarks work in Go. The framework of the benchmark is as follows:

func BenchmarkFoo(b *testing.B) {
    
    
    for i := 0; i < b.N; i++ {
    
    
        foo()
    }
}

Function names start with a prefix Benchmark. The function under test ( foo) foris called in a loop. b.NRepresents a variable number of iterations. When running a benchmark, Go tries to match it to the requested benchmark time. The base time is set to 1 second by default and can -benchtimebe changed via a flag. b.NStart at 1; if the benchmark completes in 1 second, b.Nincrease and the benchmark runs again until b.Nit benchtimeroughly matches:

$ go test -bench=.
cpu: Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
BenchmarkFoo-4                73          16511228 ns/op

Here, the benchmark took about 1 second and foowas executed 73 times with an average execution time of 16511228 nanoseconds. We can -benchtimechange the base time using:

$ go test -bench=. -benchtime=2s
BenchmarkFoo-4               150          15832169 ns/op

fooThe number of executions was roughly double that of the previous base period.

Next, let's look at some common pitfalls.

11.8.1 Do not reset or pause timers

在某些情况下,我们需要在基准循环之前执行操作。这些操作可能需要相当长的时间(例如,生成大量数据),并且可能会显著影响基准测试结果:

func BenchmarkFoo(b *testing.B) {
    
    
    expensiveSetup()
    for i := 0; i < b.N; i++ {
    
    
        functionUnderTest()
    }
}

在这种情况下,我们可以在进入循环之前使用ResetTimer方法:

func BenchmarkFoo(b *testing.B) {
    
    
    expensiveSetup()
    b.ResetTimer()                // ❶
    for i := 0; i < b.N; i++ {
    
    
        functionUnderTest()
    }
}

❶ 重置基准计时器

调用ResetTimer将测试开始以来运行的基准时间和内存分配计数器清零。这样,可以从测试结果中丢弃昂贵的设置。

如果我们必须不止一次而是在每次循环迭代中执行昂贵的设置,那该怎么办?

func BenchmarkFoo(b *testing.B) {
    
    
    for i := 0; i < b.N; i++ {
    
    
        expensiveSetup()
        functionUnderTest()
    }
}

我们不能重置计时器,因为这将在每次循环迭代中执行。但是我们可以停止并恢复基准计时器,围绕对expensiveSetup的调用:

func BenchmarkFoo(b *testing.B) {
    
    
    for i := 0; i < b.N; i++ {
    
    
        b.StopTimer()                // ❶
        expensiveSetup()
        b.StartTimer()               // ❷
        functionUnderTest()
    }
}

❶ 暂停基准计时器

❷ 恢复基准计时器

这里,我们暂停基准计时器来执行昂贵的设置,然后恢复计时器。

注意,这种方法有一个问题需要记住:如果被测函数与设置函数相比执行速度太快,基准测试可能需要太长时间才能完成。原因是到达benchtime需要比 1 秒长得多的时间。基准时间的计算完全基于functionUnderTest的执行时间。因此,如果我们在每次循环迭代中等待很长时间,基准测试将会比 1 秒慢得多。如果我们想保持基准,一个可能的缓解措施是减少benchtime

我们必须确保使用计时器方法来保持基准的准确性。

11.8.2 对微观基准做出错误的假设

微基准测试测量一个微小的计算单元,并且很容易对它做出错误的假设。比方说,我们不确定是使用atomic.StoreInt32还是atomic.StoreInt64(假设我们处理的值总是适合 32 位)。我们希望编写一个基准来比较这两种函数:

func BenchmarkAtomicStoreInt32(b *testing.B) {
    
    
    var v int32
    for i := 0; i < b.N; i++ {
    
    
        atomic.StoreInt32(&v, 1)
    }
}

func BenchmarkAtomicStoreInt64(b *testing.B) {
    
    
    var v int64
    for i := 0; i < b.N; i++ {
    
    
        atomic.StoreInt64(&v, 1)
    }
}

如果我们运行该基准测试,下面是一些示例输出:

cpu: Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
BenchmarkAtomicStoreInt32
BenchmarkAtomicStoreInt32-4       197107742             5.682 ns/op
BenchmarkAtomicStoreInt64
BenchmarkAtomicStoreInt64-4       213917528             5.134 ns/op

我们很容易认为这个基准是理所当然的,并决定使用atomic.StoreInt64,因为它似乎更快。现在,为了做一个公平的基准测试,我们颠倒一下顺序,先测试atomic.StoreInt64,再测试atomic.StoreInt32。以下是一些输出示例:

BenchmarkAtomicStoreInt64
BenchmarkAtomicStoreInt64-4       224900722             5.434 ns/op
BenchmarkAtomicStoreInt32
BenchmarkAtomicStoreInt32-4       230253900             5.159 ns/op

This time, atomic.StoreInt32it works even better. What happened?

In the case of microbenchmarks, many factors can affect the results, such as machine activity while running the benchmark, power management, cooling, and better cache alignment of instruction sequences. We must remember that many factors, even outside the scope of our Go project, can affect the outcome.

Note that we should make sure that the machine running the benchmark is free. However, external processes may be running in the background, which may affect benchmark results. For this reason, perflocka tool like this can limit how much CPU a benchmark consumes. For example, we can run a benchmark using 70% of the total available CPU, allocating 30% to the operating system and other processes, and reducing the effect of machine activity on the results.

One option is -benchtimeto increase the base time using options. Similar to the law of large numbers in probability theory, if we run a benchmark many times, it should tend to approach its expected value (assuming we ignore the benefits of instruction caches and similar mechanisms).

Another option is to use external tools on top of classic benchmark tools. Tools, for example , benchstatare golang.org/xpart of the library that allow us to compute and compare statistics about benchmark executions.

Let's -countrun the benchmark 10 times with the and options and pipe the output to a specific file:

$ go test -bench=. -count=10 | tee stats.txt
cpu: Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
BenchmarkAtomicStoreInt32-4     234935682                5.124 ns/op
BenchmarkAtomicStoreInt32-4     235307204                5.112 ns/op
// ...
BenchmarkAtomicStoreInt64-4     235548591                5.107 ns/op
BenchmarkAtomicStoreInt64-4     235210292                5.090 ns/op
// ...

Then we can run against this file benchstat:

$ benchstat stats.txt
name                time/op
AtomicStoreInt32-4  5.10ns ± 1%
AtomicStoreInt64-4  5.10ns ± 1%

The result is the same: both functions take an average of 5.10 nanoseconds to complete. We can also see the percentage change between executions for a given benchmark: 1%. This metric tells us that both benchmarks are stable, giving us more confidence in the calculated average. Therefore, for the use cases we tested (in a specific Go version on a specific machine), we can conclude that its execution time is similar to atomic .StoreInt64, rather than atomic.StoreInt32faster or slower.

In general, we should be cautious about microbenchmarks. Many factors can significantly affect the results and may lead to incorrect assumptions. Increasing the benchmark time or using benchstattools such as repeatedly performing benchmarks and calculating statistics can effectively limit external factors and obtain more accurate results, leading to better conclusions.

We would also emphasize that we should be careful when using the results of microbenchmarks performed on a given machine if another system ends up running the app. Production systems can behave very differently than the systems on which we run microbenchmarks.

11.8.3 Not paying attention to compiler optimizations

Another common mistake related to writing benchmarks is being fooled by compiler optimizations, which can also lead to wrong benchmark assumptions. In this section, we look at the population count function ( the function that counts the number of digits github.com/golang/go/issues/14813set to ) from Go issue 14813 (also discussed by Go project member Dave Cheney):1

const m1 = 0x5555555555555555
const m2 = 0x3333333333333333
const m4 = 0x0f0f0f0f0f0f0f0f
const h01 = 0x0101010101010101

func popcnt(x uint64) uint64 {
    
    
    x -= (x >> 1) & m1
    x = (x & m2) + ((x >> 2) & m2)
    x = (x + (x >> 4)) & m4
    return (x * h01) >> 56
}

This function takes and returns a uint64. To benchmark this function, we can write the following code:

func BenchmarkPopcnt1(b *testing.B) {
    
    
    for i := 0; i < b.N; i++ {
    
    
        popcnt(uint64(i))
    }
}

However, if we execute this benchmark, we get surprisingly low results:

cpu: Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
BenchmarkPopcnt1-4      1000000000               0.2858 ns/op

A duration of 0.28 nanoseconds is about one clock cycle, so this number is unreasonably low. The problem is that developers aren't careful enough about compiler optimizations. In this case, the function under test is simple enough to be a candidate for inlining : this is an optimization that replaces the function call with the body of the called function, allows us to avoid function calls, and has a small memory footprint . Once the function is inlined, the compiler notices that the call has no side effects and replaces it with the following benchmark:

func BenchmarkPopcnt1(b *testing.B) {
    
    
    for i := 0; i < b.N; i++ {
    
    
        // Empty
    }
}

The benchmark is now empty - that's why we get close to a clock cycle result. To prevent this from happening, best practice is to follow the following pattern:

  1. On each loop iteration, assign the result to a local variable (the local variable in the reference function context).

  2. Assign the latest result to a global variable.

In our case, we wrote the following benchmarks:

var global uint64                         // ❶

func BenchmarkPopcnt2(b *testing.B) {
    
    
    var v uint64                          // ❷
    for i := 0; i < b.N; i++ {
    
    
        v = popcnt(uint64(i))             // ❸
    }
    global = v                            // ❹
}

❶ defines a global variable

❷ defines a local variable

❸ Assign the result to a local variable

❹ assign the result to a global variable

globalis a global variable, vbut a local variable whose scope is the base function. On each loop iteration, we popcntassign the result of to a local variable. Then we assign the latest result to a global variable.

Note Why not popcntjust assign the result of the call globalto to simplify testing? Writing to a global variable is slower than writing to a local variable (we discuss these concepts in Bug #95 "Don't understand stack and heap"). Therefore, we should write each result to a local variable to limit the memory footprint during each loop iteration.

If we run these two benchmarks, we now get significantly different results:

cpu: Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
BenchmarkPopcnt1-4      1000000000               0.2858 ns/op
BenchmarkPopcnt2-4      606402058                1.993 ns/op

BenchmarkPopcnt2is the exact version of the benchmark. It guarantees that we avoid inlining optimizations that would artificially slow down execution time or even remove calls to the function under test. Reliance BenchmarkPopcnt1on the results may lead to wrong assumptions.

Let's remember the pattern that prevents compiler optimizations from fooling benchmark results: assign the result of the function under test to a local variable, then assign the latest result to a global variable. This best practice also prevents us from making incorrect assumptions.

11.8.4 Confused by the Observer Effect

In physics, the observer effect is the perturbation of the observed system by the act of observing. This effect can also be seen in benchmarks and can lead to incorrect assumptions about the results. Let's look at a concrete example and try to mitigate it.

We want to implement a function that takes in a int64matrix of elements. This matrix has a fixed 512 columns, and we want to calculate the sum of the first eight columns, as shown in Figure 11.2.

Figure 11.2 Computing the sum of the first eight columns

For optimization, we also want to determine if changing the number of columns has an effect, so we also implement a second function, with 513 columns. The implementation is as follows:

func calculateSum512(s [][512]int64) int64 {
    
    
    var sum int64
    for i := 0; i < len(s); i++ {
    
         // ❶
        for j := 0; j < 8; j++ {
    
          // ❷
            sum += s[i][j]            // ❸
        }
    }
    return sum
}

func calculateSum513(s [][513]int64) int64 {
    
    
    // Same implementation as calculateSum512
}

❶ Iterate over each row

❷ Traverse the first eight columns

❸ increasesum

We iterate through each row, then through the first eight columns, and increment a returned sumvariable. calculateSum513The implementation in remains unchanged.

We want to benchmark these functions to see which one is the most performant given a fixed number of rows:

const rows = 1000

var res int64

func BenchmarkCalculateSum512(b *testing.B) {
    
    
    var sum int64
    s := createMatrix512(rows)       // ❶
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
    
    
        sum = calculateSum512(s)     // ❷
    }
    res = sum
}

func BenchmarkCalculateSum513(b *testing.B) {
    
    
    var sum int64
    s := createMatrix513(rows)       // ❸
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
    
    
        sum = calculateSum513(s)     // ❹
    }
    res = sum
}

❶ Created a matrix with 512 columns

❷ Calculate the total

❸ A matrix with 513 columns is created

❹ Calculate the total

We want to create the matrix only once, to limit the impact of the results. createMatrix512Therefore, we call and outside the loop createMatrix513. We might expect the result to be similar since we only want to iterate over the first eight columns, but this is not the case (on my machine):

cpu: Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
BenchmarkCalculateSum512-4        81854             15073 ns/op
BenchmarkCalculateSum513-4       161479              7358 ns/op

The second benchmark with 513 columns was about 50% faster. Again, since we only iterated the first eight columns, this result is rather surprising.

In order to understand this difference, we need to understand the basics of CPU caches. In a nutshell, a CPU is made up of different caches (usually L1, L2, and L3). These caches lower the average cost of accessing data from main memory. In some cases, the CPU can fetch data from main memory and copy it to L1. In this case, the CPU attempts to calculateSumread the subset of the matrix of interest (the first eight columns of each row) into L1. However, in one case (513 columns) the matrix fits in memory and in the other (512 columns) it doesn't.

Note that explaining why is outside the scope of this chapter, but we look at this in Bug #91 "Don't understand CPU caches"

Back to the benchmark, the main problem is that we reuse the same matrix in both cases. Because the function is repeated thousands of times, we do not measure the execution of the function when it receives a plain new matrix. Instead, we measure a function that takes a matrix that already contains a subset of the cells present in the cache. Therefore, calculateSum513it has better execution time because it results in fewer cache misses.

This is an example of the observer effect. Because we've been observing a CPU-bound function that is called repeatedly, CPU caches may come into play and significantly affect the results. In this example, to prevent this effect, we should create a matrix per test session instead of reusing one:

func BenchmarkCalculateSum512(b *testing.B) {
    
    
    var sum int64
    for i := 0; i < b.N; i++ {
    
    
        b.StopTimer()
        s := createMatrix512(rows)     // ❶
        b.StartTimer()
        sum = calculateSum512(s)
    }
    res = sum
}

❶ A new matrix is ​​created on each loop iteration

Now, a new matrix is ​​created on each loop iteration. If we run the benchmark again (and tune benchtime- otherwise it takes too long to execute), the results are much closer:

cpu: Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
BenchmarkCalculateSum512-4         1116             33547 ns/op
BenchmarkCalculateSum513-4          998             35507 ns/op

Instead of making calculateSum513the false assumption of being faster, we see that both benchmarks produce similar results when fed new matrices.

正如我们在本节中看到的,因为我们重用了相同的矩阵,CPU 缓存显著影响了结果。为了防止这种情况,我们必须在每次循环迭代中创建一个新的矩阵。一般来说,我们应该记住,观察测试中的函数可能会导致结果的显著差异,特别是在低级别优化很重要的 CPU 绑定函数的微基准环境中。强制基准在每次迭代期间重新创建数据是防止这种影响的好方法。

在本章的最后一节,让我们看看一些关于GO测试的常见技巧。

11.9 #90:没有探索所有的 Go 测试功能

在编写测试时,开发人员应该了解 Go 的特定测试特性和选项。否则,测试过程可能不太准确,甚至效率更低。这一节讨论的主题可以让我们在编写 Go 测试时更加舒适。

11.9.1 代码覆盖率

在开发过程中,直观地看到测试覆盖了代码的哪些部分是很方便的。我们可以使用的-coverprofile标志来访问这些信息:

$ go test -coverprofile=coverage.out ./...

这个命令创建一个coverage.out文件,然后我们可以使用go tool cover打开它:

$ go tool cover -html=coverage.out

该命令打开 web 浏览器并显示每行代码的覆盖率。

默认情况下,只对当前被测试的包进行代码覆盖率分析。例如,假设我们有以下结构:

/myapp
  |_ foo
    |_ foo.go
    |_ foo_test.go
  |_ bar
    |_ bar.go
    |_ bar_test.go

如果foo.go的某个部分只在bar_test.go中测试,默认情况下,它不会显示在覆盖率报告中。要包含它,我们必须在myapp文件夹中,并且使用-coverpkg标志:

go test -coverpkg=./... -coverprofile=coverage.out ./...

我们需要记住这个特性来查看当前的代码覆盖率,并决定哪些部分值得更多的测试。

注意在跟踪代码覆盖率时要保持谨慎。拥有 100%的测试覆盖率并不意味着一个没有 bug 的应用。正确地推理我们的测试覆盖的内容比任何静态的阈值更重要。

11.9.2 不同包的测试

When writing unit tests, one approach is to focus on behavior rather than internals. Suppose we expose an API to clients. We probably want our tests to focus on things that are visible from the outside, rather than implementation details. That way, if the implementation changes (for example, if we refactor one function into two), the tests will remain the same. They're also easier to understand because they show how our API is used. If we want to enforce this practice, we can use a different package.

In Go, all files in a folder should belong to the same package, with one exception: a test file can belong to a _testpackage. For example, suppose the following counter.gosource files belong to countera package:

package counter

import "sync/atomic"

var count uint64

func Inc() uint64 {
    
    
    atomic.AddUint64(&count, 1)
    return count
}

Test files can exist in the same package and access internal files such as countvariables. Or it can exist in a counter_testpackage, such as this counter_test.gofile:

package counter_test

import (
    "testing"

    "myapp/counter"
)

func TestCount(t *testing.T) {
    
    
    if counter.Inc() != 1 {
    
    
        t.Errorf("expected 1")
    }
}

In this case, the test is implemented in an external package and does not have access to internal packages such as countvariables. Using this practice, we can guarantee that the test will not use any unexported elements; therefore, it will focus on testing the exposed behavior.

11.9.3 Utility functions

When writing tests, we can handle errors differently than production code. For example, suppose we want to test a function that takes a Customerstruct as an argument. Since Customerthe creation of the will be reused, we decided to create a specific createCustomerfunction for testing purposes. The function will return a possible error with a Customer:

func TestCustomer(t *testing.T) {
    
    
    customer, err := createCustomer("foo")     // ❶
    if err != nil {
    
    
        t.Fatal(err)
    }
    // ...
}

func createCustomer(someArg string) (Customer, error) {
    
    
    // Create customer
    if err != nil {
    
    
        return Customer{
    
    }, err
    }
    return customer, nil
}

❶ Create one Customerand check for errors

We use createCustomerthe utility function to create a client, and then we perform the rest of the tests. However, in the context of test functions, we can *testing.Tsimplify error management by passing variables to utility functions:

func TestCustomer(t *testing.T) {
    
    
    customer := createCustomer(t, "foo")     // ❶
    // ...
}

func createCustomer(t *testing.T, someArg string) Customer {
    
    
    // Create customer
    if err != nil {
    
    
        t.Fatal(err)                         // ❷
    }
    return customer
}

❶ Call the utility function and providet

❷ If we can't create a client, we simply fail

If one cannot be created Customer, createCustomerthe test fails instead of returning an error. This makes it TestCustomersmaller to write and easier to read.

Let's keep this practice about error management and testing in mind to improve our tests.

11.9.4 Installation and removal

In some cases, we may need to prepare a test environment. For example, in an integration test we start a specific Docker container and then stop it. We can call setup and teardown functions per test or per package. Fortunately, in GO, both are possible.

To do this for every test, we can use defercalls to the setup and teardown functions as pre-ops:

func TestMySQLIntegration(t *testing.T) {
    
    
    setupMySQL()
    defer teardownMySQL()
    // ...
}

It is also possible to register a function to be executed at the end of the test. For example, let's assume TestMySQLIntegrationa call is required createConnectionto create a database connection. If we want this function to also include the teardown part, we can t.Cleanupregister a cleanup function with:

func TestMySQLIntegration(t *testing.T) {
    
    
    // ...
    db := createConnection(t, "tcp(localhost:3306)/db")
    // ...
}

func createConnection(t *testing.T, dsn string) *sql.DB {
    
    
    db, err := sql.Open("mysql", dsn)
    if err != nil {
    
    
        t.FailNow()
    }
    t.Cleanup(          // ❶
        func() {
    
    
            _ = db.Close()
        })
    return db
}

❶ Registers a function to be executed at the end of the test

At the end of the test, execute the provided t.Cleanupshutdown. This makes future unit tests easier to write, since they won't be responsible for closing dbvariables.

Note that we can register multiple cleanup functions. In this case, they will be executed just like we used defer: last in first out.

To handle the installation and removal of each package, we have to use TestMainfunctions. Here is TestMaina simple implementation of :

func TestMain(m *testing.M) {
    
    
    os.Exit(m.Run())
}

This particular function takes a *testing.Mparameter that exposes a Runmethod that runs all tests. Therefore, we can surround this call with mount and teardown numbers:

func TestMain(m *testing.M) {
    
    
    setupMySQL()                 // ❶
    code := m.Run()              // ❷
    teardownMySQL()              // ❸
    os.Exit(code)
}

❶ Install MySQL

❷ Responsible for testing

❸ Disassemble MySQL

This code starts MySQL once before all tests, and then shuts it down.

Using these practices to add setup and teardown functions, we can configure a complex environment for our tests.

Summarize

  • Categorizing tests using build flags, environment variables, or short patterns makes the testing process more efficient. You can use build flags or environment variables to create test categories (for example, unit tests vs. integration tests) and distinguish between short-running and long-running tests to decide which tests to execute.

  • Enabling the flag is strongly recommended when writing concurrent applications -race. Doing this allows you to catch potential data races that could lead to software bugs.

  • Using -parallelflags is an effective way to speed up tests, especially long-running tests.

  • Use -shuffleflags to help ensure that test suites do not rely on false assumptions that could hide bugs.

  • Table-driven tests are an effective way to group a set of similar tests to prevent code duplication and make future updates easier to handle.

  • Use sync to avoid sleeps to make tests less volatile and more robust. If synchronization is not possible, consider a retry approach.

  • Understanding how to use the time API handler functions is another way to make your tests less error-prone. You can use standard techniques like handling the time as part of a hidden dependency, or requiring the client to provide the time.

  • httptestPackages help with HTTP applications. It provides a set of utilities for testing clients and servers.

  • iotestPackages help write io.Readerand test applications for error tolerance.

  • About benchmarks:

    • Use the time method to maintain the accuracy of the baseline.
    • tools like augment benchtimeor use benchstatcan be helpful when dealing with microbenchmarks.
    • Be careful with microbenchmark results if you end up running your application on a different system than the system you ran the microbenchmark on.
    • Make sure the function under test causes side effects, preventing compiler optimizations from fooling you with benchmark results.
    • To prevent the observer effect, the benchmark is forced to recreate data used by CPU-bound functions.
  • Use -coverprofilecode coverage with flags to quickly see which parts of your code need more attention.

  • Put unit tests in a different package to force writing tests that focus on exposed behavior rather than internals.

  • Using *testing.Tvariables instead of classic ones if err != nilto handle errors makes the code shorter and easier to read.

  • You can use the setup and teardown functions to configure a complex environment, such as in the case of integration tests.

12. Optimization

This chapter covers

  • Investigate the concept of mechanical empathy
  • Understand the heap and stack and reduce allocations
  • Use standard Go diagnostic tools
  • Understand how the garbage collector works
  • Run Docker and Kubernetes in GO

Before we start this chapter, a disclaimer: In most cases, it is better to write readable, clear code than to write optimized but more complicated and harder to understand code. Optimization often comes at a cost, and we recommend you follow this famous quote from software engineer Wes Dyer:

Make it right, make it clear, make it concise, make it fast, in that order.

This does not mean that optimizing your application for speed and efficiency is prohibited. For example, we can try to identify code paths that need to be optimized because it is necessary to do so, such as to keep our customers happy or to reduce our costs. In this chapter, we discuss common optimization techniques; some are specific and some are not. We also discuss ways to identify bottlenecks so we don't work blind.

12.1 #91: Not understanding CPU caches

You don't have to be an engineer to be a race car driver, but you do have to have mechanical empathy.

- A term coined by three-time F1 world champion Jackie Stewart

In short, when we understand how a system is designed to be used, whether it's an F1 car, an airplane or a computer, we can stay consistent with the design for optimal performance. In this section, we discuss some concrete examples where mechanical empathy for how CPU caches work can help us optimize Go applications.

12.1.1 CPU Architecture

Let's first go over the basics of CPU architecture and why CPU caches are important. We'll use the Intel Core i5-7300 as an example.

Modern CPUs rely on caches to speed up memory access, in most cases through three cache levels: L1, L2, and L3. On an i5-7300, the sizes of these caches are as follows:

  • L1: 64 KB

  • L2: 256 KB

  • Level 3: 4 MB

The i5-7300 has two physical cores, but four logical cores (also known as virtual cores or threads ). In the Intel family, dividing one physical core into multiple logical cores is called hyperthreading.

Figure 12.1 gives an overview of the Intel Core i5-7300 ( Tnrepresenting threadsn ). Each physical core (core 0 and core 1) is divided into two logical cores (thread 0 and thread 1). The L1 cache is divided into two sub-cache: L1D for data and L1I for instructions (32 KB each). Caching isn't just about data, when the CPU executes an application, it can also cache some instructions for the same reason: to speed up overall execution.

Figure 12.1 The i5-7300 has L3 cache, two physical cores, and four logical cores.

The closer the memory location is to the logical core, the faster the access (see mng.bz/o29v):

  • L1: About 1 nanosecond

  • L2: about 4 times slower than L1

  • L3: about 10 times slower than L1

The physical location of the CPU cache can also explain these differences. L1 and L2 are called on-die , which means they are on the same piece of silicon as the rest of the processor. In contrast, L3 is off-chip , which partly explains the difference in latency compared to L1 and L2.

For main memory (or RAM), the average access speed is 50 to 100 times slower than L1. We can access up to 100 variables stored on L1 and only pay for one access to main memory. So, as Go developers, one avenue for improvement is to make sure our applications use CPU caches.

12.1.2 Cache Lines

Understanding the concept of a cache line is crucial. But before introducing what they are, let's understand why we need them.

When a specific memory location is accessed (for example, by reading a variable), one of the following things may happen in the near future:

  • The same location will be referenced again.

  • A nearby storage location will be referenced.

The former refers to temporal locality, and the latter refers to spatial locality. Both are part of a principle known as reference location.

For example, let's look at the following int64function that computes the sum of a slice:

func sum(s []int64) int64 {
    
    
    var total int64
    length := len(s)
    for i := 0; i < length; i++ {
    
    
        total += s[i]
    }
    return total
}

In this example, temporal locality applies to multiple variables: i, , lengthand total. We keep accessing these variables throughout the iteration. Spatial locality applies to code instructions and slices s. Because a slice is backed by an array allocated contiguously in memory, in this case access s[0]also means access s[1], s[2]etc.

Temporal locality is part of the reason we need CPU caches: to speed up repeated accesses to the same variable. However, instead of copying individual variables from main memory to cache , the CPU copies what we call cache lines due to spatial locality .

A cache line is a fixed-size contiguous segment of memory, typically 64 bytes (8 int64variables). Whenever the CPU decides to cache a block of memory from RAM, it copies the block of memory to a cache line. Because memory is a hierarchy, when the CPU wants to access a particular memory location, it first checks L1, then L2, then L3, and finally, if the location is not in those caches, it checks main memory.

Let's illustrate fetching memory blocks with a concrete example. The first time we call the function with a slice of 16 int64elements sum. When sumaccessed s[0], this memory address is not yet in the cache. If the CPU decides to cache this variable (we also discuss this decision later in this chapter), it copies the entire block of memory; see Figure 12.2.

Figure 12.2 The access s[0]causes the CPU to copy the 0x000 memory block.

First, the access s[0]results in a cache miss because the address is not in the cache. This miss is called a forced miss . However, if the CPU fetches the 0x000 memory block, accessing elements from 1 to 7 will result in a cache hit. When sumaccessing s[8], the same logic applies (see Figure 12.3).

图 12.3 访问s[8]使 CPU 复制 0x100 内存块。

同样,访问s8会导致强制未命中。但是如果将0x100内存块复制到高速缓存行中,也会加快对元素 9 到 15 的访问。最后,迭代 16 个元素导致 2 次强制缓存未命中和 14 次缓存命中。

CPU 缓存策略

你可能想知道当 CPU 复制一个内存块时的确切策略。例如,它会将一个块复制到所有级别吗?只去 L1?在这种情况下,L2 和 L3 怎么办?

我们必须知道存在不同的策略。有时缓存是包含性的(例如,L2 数据也存在于 L3 中),有时缓存是排他性的(例如,L3 被称为牺牲缓存,因为它只包含从 L2 逐出的数据)。

一般来说,这些策略都是 CPU 厂商隐藏的,知道了不一定有用。所以,这些问题我们就不深究了。

让我们看一个具体的例子来说明 CPU 缓存有多快。我们将实现两个函数,它们在迭代一片int64元素时计算总数。在一种情况下,我们将迭代每两个元素,在另一种情况下,迭代每八个元素:

func sum2(s []int64) int64 {
    
    
    var total int64
    for i := 0; i < len(s); i+=2 {
    
         // ❶
        total += s[i]
    }
    return total
}

func sum8(s []int64) int64 {
    
    
    var total int64
    for i := 0; i < len(s); i += 8 {
    
       // ❷
        total += s[i]
    }
    return total
}

❶ 迭代每两个元素

❷ 迭代每八个元素

除了迭代之外,这两个函数是相同的。如果我们对这两个函数进行基准测试,我们的直觉可能是第二个版本会快四倍,因为我们需要增加的元素少了四倍。然而,运行基准测试表明sum8在我的机器上只快了 10%:仍然更快,但是只快了 10%。

The reason has to do with cache lines. We see that a cache line is usually 64 bytes and contains up to 8 int64variables. Here, the runtime of these loops is governed by memory accesses, not incremental instructions. In the first case, three out of four accesses resulted in cache hits. Therefore, the difference in execution time of the two functions is not significant. This example shows why cache lines matter, and how easily we can be fooled by our intuitions if we lack mechanical empathy - in this case, about how the CPU caches data.

Let's move on to locality of reference and look at a concrete example using spatial locality.

12.1.3 Struct slices and sliced ​​structures

This section looks at an example that compares the execution times of two functions. The first takes a part of the structure as an argument and sums over all afields:

type Foo struct {
    
    
    a int64
    b int64
}

func sumFoo(foos []Foo) int64 {
    
             // ❶
    var total int64
    for i := 0; i < len(foos); i++ {
    
        // ❷
        total += foos[i].a
    }
    return total
}

❶ Get Fooslices

❷ Iterate over each Fooand sum each field

sumFooReceives Fooa fraction and aincrements by reading each field total.

The second function also calculates the sum. But this time, the argument is a struct containing slices:

type Bar struct {
    
    
    a []int64                           // ❶
    b []int64
}

func sumBar(bar Bar) int64 {
    
                // ❷
    var total int64
    for i := 0; i < len(bar.a); i++ {
    
       // ❸
        total += bar.a[i]               // ❹
    }
    return total
}

aand bnow slice.

❷ Receive a single structure

❸ traversebar

❹ Addedtotal

sumBarReceives a structure containing two slices Bar: aand b. It iterates over aeach element to increment total.

Do we expect any difference in speed between these two functions? Before running the benchmark, let's visualize the difference in memory in Figure 12.4. The amount of data is the same in both cases: 16 Fooelements in the slice and 16 Barelements in the slice. Each black bar represents one that was read to calculate the sum int64, while each gray bar represents one that was skipped int64.

Figure 12.4 Slices are more compact, so fewer cache lines need to be iterated over.

In sumFoothe case of , we receive a slice of a struct containing two fields aand a . So we have a chain of sums bin memory . Instead, in the case of , we receive a structure containing two slices, and . Therefore, all elements of are allocated contiguously.absumBaraba

This difference does not result in any memory compression optimizations. But both functions aim to iterate over each a, and doing so requires four cache lines in one case and only two in the other.

If you benchmark both functions, sumBarfaster (about 20% faster on my machine). The main reason is better spatial locality, which allows the CPU to fetch fewer cache lines from memory.

This example demonstrates how spatial locality can have a significant impact on performance. To optimize applications, we should organize data to get the most value out of each individual cache line.

But is using spatial locality enough to help the CPU? We are still missing one key feature: predictability.

12.1.4 Predictability

Predictability refers to the CPU's ability to predict how an application will accelerate its execution. Let's look at a concrete example where a lack of predictability can negatively impact application performance.

Again, let's look at two functions that sum a list of elements. The first loop traverses a linked list and sums all values:

type node struct {
    
                 // ❶
    value int64
    next  *node
}

func linkedList(n *node) int64 {
    
    
    var total int64
    for n != nil {
    
                 // ❷
        total += n.value       // ❸
        n = n.next
    }
    return total
}

❶ Linked list data structure

❷ Iterate over each node

❸ increasetotal

This function takes a linked list, traverses it, and increments a total.

On the other hand, let's look again at sum2the function that iterates over a slice, one of two elements:

func sum2(s []int64) int64 {
    
    
    var total int64
    for i := 0; i < len(s); i+=2 {
    
         // ❶
        total += s[i]
    }
    return total
}

❶ Iterate every two elements

Let's assume the linked list is allocated contiguously: for example, by a single function. On 64-bit architectures, a word is 64 bits long. Figure 12.5 compares the two data structures (linked list or slice) that the function accepts; dark bars represent

The element we use to increment the total int64.

Figure 12.5 In memory, linked lists and slices are compressed in a similar fashion.

在这两个例子中,我们面临类似的压缩。因为链表是由一连串的值和 64 位指针元素组成的,所以我们使用两个元素中的一个来增加总和。同时,sum2的例子只读取了两个元素中的一个。

这两个数据结构具有相同的空间局部性,因此我们可以预期这两个函数的执行时间相似。但是在片上迭代的函数要快得多(在我的机器上大约快 70%)。原因是什么?

要理解这一点,我们得讨论一下大步走的概念。跨越与 CPU 如何处理数据有关。有三种不同类型的步幅(见图 12.6):

  • 单位步幅——我们要访问的所有值都是连续分配的:比如一片int64元素。这一步对于 CPU 来说是可预测的,也是最有效的,因为它需要最少数量的缓存行来遍历元素。

  • 恒定步幅——对于 CPU 来说仍然是可预测的:例如,每两个元素迭代一次的切片。这个步幅需要更多的缓存行来遍历数据,因此它的效率比单位步幅低。

  • 非单位步幅——CPU 无法预测的一个步幅:比如一个链表或者一片指针。因为 CPU 不知道数据是否是连续分配的,所以它不会获取任何缓存行。

图 12.6 三种类型的步幅

对于sum2,我们面对的是一个不变的大步。但是,对于链表来说,我们面临的是非单位跨步。即使我们知道数据是连续分配的,CPU 也不知道。因此,它无法预测如何遍历链表。

由于不同的步距和相似的空间局部性,遍历一个链表比遍历一个值要慢得多。由于更好的空间局部性,我们通常更喜欢单位步幅而不是常数步幅。但是,无论数据如何分配,CPU 都无法预测非单位步幅,从而导致负面的性能影响。

到目前为止,我们已经讨论了 CPU 缓存速度很快,但明显小于主内存。因此,CPU 需要一种策略来将内存块提取到缓存行。这种策略称为缓存放置策略和会显著影响性能。

12.1.5 Cache placement strategy

In bug #89 "Writing inaccurate benchmarks", we discussed a matrix example where we had to sum the first eight columns. At this point, we do not explain why changing the total number of columns affects the results of the benchmark. This might sound counterintuitive: Since we only need to read the first eight columns, why would changing the total number of columns affect execution time? Let's take a look at this part.

As a reminder, the implementation is as follows:

func calculateSum512(s [][512]int64) int64 {
    
         // ❶
    var sum int64
    for i := 0; i < len(s); i++ {
    
    
        for j := 0; j < 8; j++ {
    
    
            sum += s[i][j]
        }
    }
    return sum
}

func calculateSum513(s [][513]int64) int64 {
    
         // ❷
    // Same implementation as calculateSum512
}

❶ Receive a matrix of 512 columns

❷ Receive a matrix with 513 columns

We iterate over each row, summing the first eight columns each time. We did not observe any difference when both functions were benchmarked with a new matrix each time. However, if we keep reusing the same matrix, it's calculateSum513about 50% faster on my machine. The reason is in CPU caches and how blocks of memory are copied to cache lines. Let's examine this to understand this difference.

When the CPU decides to copy a block of memory and put it in the cache, it must follow a certain strategy. Assuming the L1D cache is 32 KB and the cache line is 64 bytes, if a block is randomly placed in L1D, the CPU will have to iterate 512 cache lines to read a variable in the worst case. This kind of cache is called fully associative .

To improve the speed of accessing addresses from CPU caches, designers have developed different strategies for cache placement. Let's skip history and discuss the most widely used option today: group associative caching , which relies on cache partitioning.

To make the following diagram clearer, we will simplify the problem:

  • We assume that the L1D cache is 512 bytes (8 cache lines).

  • The matrix consists of 4 rows and 32 columns, we will only read the first 8 columns.

Figure 12.7 shows how this matrix is ​​stored in memory. We will use the binary representation of memory block addresses. Again, the gray blocks represent the first 8 int64elements we want to iterate over. The remaining blocks are skipped during iteration.

Figure 12.7 A matrix stored in memory, and an empty cache for execution

Each memory block contains 64 bytes, so there are 8 int64elements. The first memory block starts at 0x000000000000, the second at 0001000000000 (binary 512), and so on. We also show a cache that can hold 8 lines.

Note, as we'll see in Bug #94 "Don't know data alignment", slices don't necessarily start at the beginning of the block.

With group-associative cache policies, the cache is divided into groups. We assume that the cache is two-way set associative, meaning that each set contains two lines. A memory block can only belong to one collection, and its location is determined by the memory address. To understand this, we have to break the memory block address into three parts:

  • The block offset is based on the block size. Here the block size is 512 bytes, 512 equals 2^9. Therefore, the first 9 bits of the address represent the Block Offset (BO).

  • The set index indicates the set to which an address belongs. Since the cache is two-way set associative and contains 8 lines, we have 8 / 2 = 4a set. Also, 4 equals 2^2, so the next two bits represent the set index (SI).

  • The remainder of the address consists of tag bits (TB). In Figure 12.7, for simplicity, we use 13 bits to represent an address. To calculate TB, we use 13 - BO - SI. This means that the remaining two bits represent the tag bits.

Suppose the function starts up and tries to read from address 000000000000 s[0][0]. Since this address is not yet in the cache, the CPU computes its set index and copies it into the corresponding cache set (Figure 12.8).

Figure 12.8 Memory address 000000000000 is copied to set 0.

As mentioned earlier, 9 bits represent the block offset: this is the smallest common prefix of each memory block address. Then, 2 bits represent the collection index. When the address is 0000000000000, SI equals 00. Therefore, the memory block is copied to junction 0.

当函数从s[0][1]读取到s[0][7]时,数据已经在缓存中。CPU 是怎么知道的?CPU 计算存储块的起始地址,计算集合索引和标记位,然后检查集合 0 中是否存在 00。

接下来函数读取s[0][8],这个地址还没有被缓存。所以同样的操作发生在复制内存块 0100000000000(图 12.9)。

图 12.9 内存地址 010000000000 被复制到集合 0。

该存储器的集合索引等于 00,因此它也属于集合 0。高速缓存线被复制到组 0 中的下一个可用线。然后,再一次,从s[1][1]s[1][7]的读取导致缓存命中。

现在事情越来越有趣了。该函数读取s[2][0],该地址不在缓存中。执行相同的操作(图 12.10)。

图 12.10 内存地址 1000000000000 替换集合 0 中的现有缓存行。

设置的索引再次等于 00。但是,set 0 已满 CPU 做什么?将内存块复制到另一组?不会。CPU 会替换现有缓存线之一来复制内存块 1000000000000。

缓存替换策略依赖于 CPU,但它通常是一个伪 LRU 策略(真正的 LRU(最久未使用)会太复杂而难以处理)。在这种情况下,假设它替换了我们的第一个缓存行:000000000000。当迭代第 3 行时,这种情况重复出现:内存地址 1100000000000 也有一个等于 00 的集合索引,导致替换现有的缓存行。

现在,让我们假设基准程序用一个从地址 000000000000 开始指向同一个矩阵的片来执行函数。当函数读取s[0][0]时,地址不在缓存中。该块已被替换。

Benchmarking will result in more cache misses than using the CPU cache from one execution to the next. This type of cache miss is called a conflict miss : if the cache is not partitioned, this kind of miss cannot happen. All the variables we iterate over belong to a memory block with set index 00. Therefore, we use only one cache set instead of distributing it across the entire cache.

Earlier we discussed the concept of strides, which we defined as how the CPU traverses our data. In this example, this step is called the critical step : it results in accesses to memory addresses with the same bank index, which are therefore stored to the same cache bank.

Let's go back to the real world example, with two functions calculateSum512and calculateSum513. Benchmarks are performed on a 32 KB eight-way set-associative L1D cache: 64 sets in total. Since a cache line is 64 bytes, the key stride is equal to 64 × 64B = 4 KB. A type of four KB int64represents 512 elements. Therefore, we reach a critical step size with a matrix of 512 columns, so we have a poor cache distribution. Also, if the matrix contains 513 columns, it does not lead to a critical step. That's why we observe such a huge difference in the two benchmarks.

In conclusion, we must be aware that modern caches are partitioned. Depending on the stride size, only one set is used in some cases, which can hurt application performance and cause collision misses. This stride is called a critical stride. For performance-intensive applications, we should avoid critical steps to fully utilize the CPU cache.

Note that our example also highlights why we should pay attention to the results of microbenchmarks performed on systems other than production systems. Performance may vary significantly if the production system has a different cache architecture.

Let's move on to the CPU cache impact. This time, we see concrete effects when writing concurrent code.

12.2 #92: Writing concurrent code that leads to false sharing

到目前为止,我们已经讨论了 CPU 缓存的基本概念。我们已经看到,一些特定的缓存(通常是 L1 和 L2)并不在所有逻辑内核之间共享,而是特定于一个物理内核。这种特殊性会产生一些具体的影响,比如并发性和错误共享的概念,这会导致性能显著下降。让我们通过一个例子来看看什么是虚假分享,然后看看如何防止它。

在这个例子中,我们使用了两个结构,InputResult:

type Input struct {
    
    
    a int64
    b int64
}

type Result struct {
    
    
    sumA int64
    sumB int64
}

目标是实现一个count函数,该函数接收Input的一部分并计算以下内容:

  • 所有Input.a字段的总和变成Result.sumA

  • 所有Input.b字段的总和变成Result.sumB

为了举例,我们实现了一个并发解决方案,其中一个 goroutine 计算sumA,另一个计算sumB:

func count(inputs []Input) Result {
    
    
    wg := sync.WaitGroup{
    
    }
    wg.Add(2)

    result := Result{
    
    }                        // ❶

    go func() {
    
    
        for i := 0; i < len(inputs); i++ {
    
    
            result.sumA += inputs[i].a        // ❷
        }
        wg.Done()
    }()

    go func() {
    
    
        for i := 0; i < len(inputs); i++ {
    
    
            result.sumB += inputs[i].b        // ❸
        }
        wg.Done()
    }()

    wg.Wait()
    return result
}

❶ 初始化Result结构

❷ 计算sumA

❸ 计算sumB

我们旋转了两个 goroutines:一个迭代每个a字段,另一个迭代每个b字段。从并发的角度来看,这个例子很好。例如,它不会导致数据竞争,因为每个 goroutine 都会增加自己的数据

可变。但是这个例子说明了降低预期性能的错误共享概念。

让我们看看主内存(见图 12.11)。因为sumAsumB是连续分配的,所以在大多数情况下(八分之七),两个变量都被分配到同一个内存块。

图 12.11 在这个例子中,sumAsumB是同一个内存块的一部分。

现在,让我们假设机器包含两个内核。在大多数情况下,我们最终应该在不同的内核上调度两个线程。因此,如果 CPU 决定将这个内存块复制到一个缓存行,它将被复制两次(图 12.12)。

图 12.12 每个块都被复制到核心 0 和核心 1 上的缓存行。

Because L1D (L1 data) is per-core, both cache lines are duplicated. Recall that, in our example, each goroutine updates its own variable: one side sumA, and the other sumB(Figure 12.13).

Figure 12.13 Each goroutine updates its own variables.

Because these cache lines are replicated, one of the goals of the CPU is to guarantee cache coherency. For example, if one goroutine updates sumAand another reads sumA(after some synchronization), we expect our application to get the latest value.

However, our example doesn't do that. Both goroutines access their own variables, not shared variables. We might hope that the CPU knows this and understands it's not a conflict, but it doesn't. When we write to a variable in the cache, the granularity of CPU tracking is not the variable: it is the cache line.

When a cache line is shared among multiple cores and at least one goroutine is a writing thread, the entire cache line is invalidated. This happens even if the updates are logically independent (eg, sumAand sumB). This is the problem with false sharing and it slows down performance.

Note that internally, the CPU uses the MESI protocol to guarantee cache coherency. It keeps track of each cache line, marking it as Modified, Exclusive, Shared, or Invalid (MESI).

One of the most important aspects to understand about memory and caches is that sharing memory across cores is not real - it's an illusion. This understanding comes from the fact that we don't think of machines as black boxes; instead, we try to generate mechanical empathy for the underlying layers.

So how do we tackle false sharing? There are two main solutions.

The first solution is to use the same approach we've shown, but make sure sumAand sumBdon't belong to the same cache line. For example, we can update the structure, adding paddingResult between fields . Padding is a technique for allocating additional memory. Since we need an 8-byte allocation and a 64-byte long cache line, we need byte padding:int6464–8 = 56

type Result struct {
    
    
    sumA int64
    _    [56]byte     // ❶
    sumB int64
}

❶ Filling

Figure 12.14 shows a possible memory allocation. With padding, sumAand sumBwill always be part of different memory blocks, and thus different cache lines.

Figure 12.14 sumAand sumBare part of different memory blocks.

If we benchmark both solutions (with and without padding), we see that the padding solution is significantly faster (about 40% faster on my machine). This is an important improvement as padding is added between the two fields to prevent false sharing.

The second solution is to redesign the structure of the algorithm. For example, instead of having two goroutines share the same structure, we can have them communicate their local results through channels. The resulting benchmark is about the same as the fill.

In conclusion, we must remember that sharing memory across goroutines is an illusion at the lowest memory level. False sharing occurs if a cache line is shared between two cores when at least one goroutine is a writing thread. If we need to optimize an application that depends on concurrency, we should check whether false sharing is applicable, because this mode will reduce the performance of the application. We can prevent false sharing through padding or communication.

The next section discusses how CPUs execute instructions in parallel, and how to take advantage of this ability.

12.3 #93: Instruction-level parallelism is not considered

Instruction-level parallelism is another factor that can significantly affect performance. Before defining this concept, let's discuss a specific example and how to optimize it.

We will write a int64function that receives an array of two elements. This function will iterate a certain number of times (a constant). During each iteration it will do the following:

  • Increments the first element of the array.

  • If the first element is even, increment the second element of the array.

Here is the Go version:

const n = 1_000_000

func add(s [2]int64) [2]int64 {
    
    
    for i := 0; i < n; i++ {
    
           // ❶
        s[0]++                     // ❷
        if s[0]%2 == 0 {
    
               // ❸
            s[1]++
        }
    }
    return s
}

nIterations

❷ Increments[0]

❸ If s[0]it is even, increments[1]

The instructions executed in the loop are shown in Figure 12.15 (one increment requires one read and one write operation). The order of instructions is sequential: first we increment s[0]; then, before incrementing s[1], we need to read again s[0].

Figure 12.15 Three main steps: Increment, Check, Increment

Note that this sequence of instructions does not match the granularity of assembly instructions. But for clarity, we use a simplified view.

Let's take a moment to discuss the theory behind instruction-level parallelism (ILP). Decades ago, CPU designers stopped focusing solely on clock speed to increase CPU performance. They developed several optimizations, including ILP, which allows developers to execute a sequence of instructions in parallel. A processor that implements ILP within a single virtual core is called a superscalar processor . For example, Figure 12.16 shows a CPU executing an application consisting of three instructions, I1, , I2and I3.

* Executing a series of instructions requires different stages. In short, the CPU needs to decode instructions and execute them. Execution is handled by execution units, which perform various operations and calculations.

Figure 12.16 Although written sequentially, the three instructions are executed in parallel.

In Figure 12.16, the CPU decides to execute these three instructions in parallel. Note that not all instructions have to complete in a single clock cycle. For example, an instruction to read a value already present in a register will complete in one clock cycle, but an instruction to read an address that must be fetched from main memory may take tens of clock cycles to complete.

If executed sequentially, this sequence of instructions would take the following times (a function t(x)representing xthe time it takes the CPU to execute instructions):

total time = t(I1) + t(I2) + t(I3)

Due to ILP, the total time is as follows:

total time = max(t(I1), t(I2), t(I3))

In theory, ILP looks like magic. But it also brings some challenges called adventure .

As an example, what if I3you set a variable to 42, I2but a conditional instruction (for example)? if foo == 1In theory, this scenario should prevent parallel execution of I2sums I3. This is called a control adventure or a branch adventure . In practice, CPU designers use branch prediction to solve control hazards.

For example, the CPU can calculate that the condition has been true 99 times out of the last 100; therefore, it will execute I2and in parallel I3. In the case of a misprediction ( I2which happens to be false), the CPU will flush its current execution pipeline, making sure there are no inconsistencies. This refresh results in a performance penalty of 10 to 20 clock cycles.

Other types of hazards prevent parallel execution of instructions. As software engineers, we should be aware of this. For example, let's consider the following two instructions that update registers (temporary storage areas used to perform operations):

  • I1Add the numbers in registers A and B to C.

  • I2Add the numbers in registers C and D to D.

Since the results I2depend on the value of register C I1, the two instructions cannot be executed simultaneously. I1Must be done I2before . This is called a data hazard . To deal with data hazards, CPU designers came up with a trick called forwarding , which essentially bypasses writes to registers. This technique does not solve the problem, but rather attempts to mitigate the impact.

Note that there are also and structure hazards when at least two instructions in the pipeline require the same resource . As Go developers, we can't really influence these kinds of adventures, so we don't discuss them in this section.

Now that we have a decent understanding of ILP theory, let's go back to our original problem and focus on the contents of the loop:

s[0]++
if s[0]%2 == 0 {
    
    
    s[1]++
}

As we discussed, data hazards prevent instructions from executing concurrently. Let's look at the sequence of instructions in Figure 12.17; this time we emphasize the adventures between instructions.

Figure 12.17 illustrates the type of adventure between

Because of ifthe statement, the sequence contains a control adventure. However, as discussed, optimizing execution and predicting what branch should be taken is the domain of the CPU. There are also multiple data hazards. As we discussed, data hazards prevent the ILP from executing instructions in parallel. Figure 12.18 shows the sequence of instructions from an ILP perspective: the only independent instructions are s[0]check and s[1]increment, so the two instruction sets can be executed in parallel, thanks to branch prediction.

Figure 12.18 Both increments are performed sequentially.

What about increments? Can we improve the code to reduce data hazards?

Let's write another version ( add2) that introduces a temporary variable:

func add(s [2]int64) [2]int64 {
    
         // ❶
    for i := 0; i < n; i++ {
    
    
        s[0]++
        if s[0]%2 == 0 {
    
    
            s[1]++
        }
    }
    return s
}

func add2(s [2]int64) [2]int64 {
    
        // ❷
    for i := 0; i < n; i++ {
    
    
        v := s[0]                   // ❸
        s[0] = v + 1
        if v%2 != 0 {
    
    
            s[1]++
        }
    }
    return s
}

❶ First Edition

❷ Second Edition

❸ A new variable is introduced to fix s[0]the value

In this new version, we s[0]fixed the value of as a new variable v. Before we increment s[0], and check if it is even. To replicate this behavior, since vis based on s[0], for increment s[1]we now check vfor odd numbers.

Figure 12.19 compares the hazards of the two versions. The number of steps is the same. The biggest difference is about data hazards: s[0]the increment step and the check vstep now depend on the same instruction ( read s[0] into v).

Figure 12.19 A striking difference: vdata hazards for the check step

Why is this important? Because it allows the CPU to increase parallelism (Figure 12.20).

Figure 12.20 In the second version, the two incremental steps can be executed in parallel.

尽管有相同数量的步骤,第二个版本增加了可以并行执行的步骤数量:三个并行路径而不是两个。同时,应该优化执行时间,因为最长路径已经减少。如果我们对这两个函数进行基准测试,我们会看到第二个版本的速度有了显著的提高(在我的机器上大约提高了 20%),这主要是因为 ILP。

让我们后退一步来结束这一节。我们讨论了现代 CPU 如何使用并行性来优化一组指令的执行时间。我们还研究了数据冒险,它会阻止并行执行指令。我们还优化了一个 Go 示例,减少了数据冒险的数量,从而增加了可以并行执行的指令数量。

理解 Go 如何将我们的代码编译成汇编,以及如何使用 ILP 等 CPU 优化是另一个改进的途径。在这里,引入一个临时变量可以显著提高性能。这个例子演示了机械共鸣如何帮助我们优化 Go 应用。

让我们也记住对这种微优化保持谨慎。因为 Go 编译器一直在发展,所以当 Go 版本发生变化时,应用生成的程序集也可能发生变化。

下一节讨论数据对齐的效果。

12.4 #94:不知道数据对齐

数据对齐是一种安排如何分配数据的方式,以加速 CPU 的内存访问。不了解这个概念会导致额外的内存消耗,甚至降低性能。本节讨论这个概念,它适用的地方,以及防止代码优化不足的技术。

为了理解数据对齐是如何工作的,让我们首先讨论一下没有它会发生什么。假设我们分配了两个变量,一个int32 (32 字节)和一个int64 (64 字节):

var i int32
var j int64

在没有数据对齐的情况下,在 64 位架构上,这两个变量的分配如图 12.21 所示。j变量分配可以用两个词来概括。如果 CPU 想要读取j,它将需要两次内存访问,而不是一次。

图 12.21 j两个字上的分配

为了避免这种情况,变量的内存地址应该是其自身大小的倍数。这就是数据对齐的概念。在 Go 中,对齐保证如下:

  • byteuint8int8 : 1 字节

  • uint16int16 : 2 字节

  • uint32int32float32 : 4 字节

  • uint64int64float64complex64 : 8 字节

  • complex128 : 16 字节

所有这些类型都保证是对齐的:它们的地址是它们大小的倍数。例如,任何int32变量的地址都是 4 的倍数。

让我们回到现实世界。图 12.22 显示了ij在内存中分配的两种不同情况。

图 12.22 在这两种情况下,j都与自己的尺寸对齐。

在第一种情况下,就在i之前分配了一个 32 位变量。因此,ij被连续分配。第二种情况,32 位变量在i之前没有分配(例如,它是一个 64 位变量);所以,i是一个字的开头。考虑到数据对齐(地址是 64 的倍数),不能将ji一起分配,而是分配给下一个 64 的倍数。灰色框表示 32 位填充。

接下来,让我们看看填充何时会成为问题。我们将考虑以下包含三个字段的结构:

type Foo struct {
    
    
    b1 byte
    i  int64
    b2 byte
}

We have a bytetype (1 byte), one int64(8 bytes), and another bytetype (1 byte). On 64-bit architectures, this structure is allocated in memory, as shown in Figure 12.23. b1Allocated first. Because iit is one int64, its address must be a multiple of 8. So it's impossible to allocate at 0x01 and b1together. What is the next address that is a multiple of 8? 0x08. b2Assigned to the next available address which is a multiple of 1: 0x10.

Figure 12.23 The structure occupies a total of 24 bytes.

Since the size of the structure must be a multiple of the word size (8 bytes), its address is not 17 bytes, but a total of 24 bytes. During compilation, the Go compiler adds padding to ensure data alignment:

type Foo struct {
    
    
    b1 byte
    _  [7]byte     // ❶
    i  int64
    b2 byte
    _  [7]byte     // ❶
}

❶ Added by the compiler

Every time a structure is created Foo, it requires 24 bytes of memory, but only 10 bytes contain data - the remaining 14 bytes are padding. Because the structure is an atomic unit, it will never be reorganized, even after garbage collection (GC); it will always occupy 24 bytes of memory. Note that the compiler doesn't rearrange the fields; it only adds padding to keep the data aligned.

How can I reduce the amount of memory allocated? A rule of thumb is to reorganize the structure so that its fields are in descending order of type size. In our case, int64the type is first, and then two bytetypes:

type Foo struct {
    
    
    i  int64
    b1 byte
    b2 byte
}

Figure 12.24 shows Foohow this new version is allocated in memory. iAllocated first, occupying a complete word. The main difference is that present b1and b2can coexist in the same word.

Figure 12.24 The structure now occupies 16 bytes of memory.

Again, the structure must be a multiple of the word size; but it only takes up 16 bytes instead of 24. iWe save 33% of memory just by moving to the first position.

What would be the specific impact if we used the first version of Foothe structure (24 bytes) instead of compressed? If structures are preserved Foo(for example, an in-memory Foocache), our application will consume additional memory. However, even without preserving Foothe structure, there are other effects. For example, if we create variables frequently Fooand allocate them to the heap (we will discuss this concept in the next section), the result will be more frequent GC, affecting the overall application performance.

Speaking of performance, there is another effect of spatial locality. For example, let's consider the following sumfunction, which takes a part of Fooa structure as an argument. The function iterates over the slice and sums all ifields ( ):int64

func sum(foos []Foo) int64 {
    
    
    var s int64
    for i := 0; i < len(foos); i++ {
    
    
        s += foos[i].i                 // ❶
    }
    return s
}

iSum over all fields

Because a slice is backed by an array, this implies a Foocontiguous allocation of a structure.

Let's discuss the two versions of Foothe backing array and examine the data (128 bytes) of two cache lines. In Figure 12.25, each gray bar represents 8 bytes of data, and the darker bars are ivariables (the fields we want to sum).

Figure 12.25 Because each cache line contains more ivariables, iterating Fooover a slice requires fewer cache lines.

As we can see, each cache line is more useful in the latest version of Foo, since it contains on average 33% more ivariables. Therefore it is more efficient to iterate Fooover a slice to sum over all elements.int64

We can confirm this observation with a benchmark. If we run both benchmarks using slices of 10,000 elements, Foothe version using the latest structure is about 15% faster on my machine. 15% faster than changing the position of a single field in a structure.

让我们注意数据对齐。正如我们在本节中所看到的,重新组织 Go 结构的字段以按大小降序排列可以防止填充。防止填充意味着分配更紧凑的结构,这可能会导致优化,如减少 GC 的频率和更好的空间局部性。

下一节讨论栈和堆之间的根本区别以及它们为什么重要。

12.5 #95:不了解栈与堆

在 Go 中,一个变量既可以分配在栈上,也可以分配在堆上。这两种类型的内存有着根本的不同,会对数据密集型应用产生重大影响。让我们来看看这些概念和编译器在决定变量应该分配到哪里时所遵循的规则。

12.5.1 栈与堆

首先,让我们讨论一下栈和堆的区别。栈是默认内存;它是一种后进先出(LIFO)的数据结构,存储特定 goroutine 的所有局部变量。当一个 goroutine 启动时,它会获得 2 KB 的连续内存作为其栈空间(这个大小会随着时间的推移而变化,并且可能会再次改变)。但是,这个大小在运行时不是固定的,可以根据需要增加或减少(但是它在内存中始终保持连续,从而保持数据局部性)。

当 Go 进入一个函数时,会创建一个栈帧,表示内存中只有当前函数可以访问的区间。让我们看一个具体的例子来理解这个概念。这里,main函数将打印一个sumValue函数的结果:

func main() {
    
    
    a := 3
    b := 2

    c := sumValue(a, b)        // ❶
    println(c)                 // ❷
}

//go:noinline                  // ❸
func sumValue(x, y int) int {
    
    
    z := x + y
    return z
}

❶ 调用sumValue函数

❷ 打印了结果

❸ 禁用内联

这里有两点需要注意。首先,我们使用println内置函数代替fmt.Println,这将强制在堆上分配c变量。其次,我们在sumValue函数上禁用内联;否则,函数调用不会发生(我们在错误#97“不依赖内联”中讨论了内联)。

图 12.26 显示了ab分配后的栈。因为我们执行了main,所以为这个函数创建了一个栈框架。在这个栈帧中,两个变量ab被分配给栈。所有存储的变量都是有效的地址,这意味着它们可以被引用和访问。

图 12.26 ab分配在栈上。

图 12.27 显示了如果我们进入函数到语句会发生什么。Go 运行时创建一个新的栈框架,作为当前 goroutine 栈的一部分。xy被分配在当前栈帧的z旁边。

图 12.27 调用sumValue创建一个新的栈框架。

前一个栈帧(main)包含仍被视为有效的地址。我们不能直接访问ab;但是如果我们有一个指针在a上,例如,它将是有效的。我们不久将讨论指针。

让我们转到main函数的最后一条语句:println。我们退出了sumValue函数,那么它的栈框架会发生什么变化呢?参见图 12.28。

图 12.28 删除了sumValue栈框架,并用main中的变量代替。在本例中,x已被c擦除,而yz仍在内存中分配,但无法访问。

栈帧没有完全从内存中删除。当一个函数返回时,Go 不需要花时间去释放变量来回收空闲空间。但是这些先前的变量不能再被访问,当来自父函数的新变量被分配到栈时,它们替换了先前的分配。从某种意义上说,栈是自清洁的;它不需要额外的机制,比如 GC。

Now, let's make a small change to understand the limitations of the stack. The function will return a pointer instead of a int:

func main() {
    
    
    a := 3
    b := 2

    c := sumPtr(a, b)
    println(*c)
}

//go:noinline
func sumPtr(x, y int) *int {
    
         // ❶
    z := x + y
    return &z
}

❶ returns a pointer

mainThe variable in cis now a *inttype. After the call sumPtr, let's go straight to the last printlnstatement. What happens if zyou keep allocated state on the stack (which is not possible)? See Figure 12.29.

Figure 12.29 cA variable references an address that is no longer valid.

If cthe reference is to zthe address of a variable z, allocated on the stack, we have a big problem. The address will no longer be valid, and the added mainstack frame will continue to grow and erase zvariables. For this reason, the stack is not enough, we need another type of memory: the heap.

The memory heap is a pool of memory shared by all goroutines. In Figure 12.30, the three goroutines G1, G2and G3each have their own stack. They all share the same heap.

Figure 12.30 Three goroutines have their own stack but share the heap

In the previous example, we saw zthat variables do not live on the stack; therefore, escape to the heap. If after the function returns, the compiler cannot prove that the variable is not referenced, then the variable will be allocated on the heap.

Why should we care? What is the point of understanding the difference between stack and heap? Because this has a big impact on performance.

As we said, the stack is self-cleaning and accessed by a single goroutine. Instead, the heap must be cleaned up by an external system: GC. The more heap we allocate, the more pressure we put on the GC. When the GC is running, it uses 25% of the available CPU capacity and may incur millisecond "stop-the-world" latencies (phases in which the application is paused).

We also have to understand that allocating on the stack is faster for the Go runtime because it's simple: a pointer refers to the address of available memory below. Conversely, allocating on the heap requires more effort to find the correct location, and therefore more time.

To illustrate the differences, let's benchmark sumValueand :sumPtr

var globalValue int
var globalPtr *int

func BenchmarkSumValue(b *testing.B) {
    
    
    b.ReportAllocs()                    // ❶
    var local int
    for i := 0; i < b.N; i++ {
    
    
        local = sumValue(i, i)          // ❷
    }
    globalValue = local
}

func BenchmarkSumPtr(b *testing.B) {
    
    
    b.ReportAllocs()                    // ❸
    var local *int
    for i := 0; i < b.N; i++ {
    
    
        local = sumPtr(i, i)            // ❹
    }
    globalValue = *local
}

❶ Reporting heap allocations

❷ Sum by value

❸ Report heap allocation

❹ Summing with pointers

If we run these benchmarks (with inlining still disabled), we get the following results:

BenchmarkSumValue-4   992800992    1.261 ns/op   0 B/op   0 allocs/op
BenchmarkSumPtr-4     82829653     14.84 ns/op   8 B/op   1 allocs/op

sumPtrsumValueAn order of magnitude slower than approx., a direct consequence of replacing the stack with the heap.

Note that this example shows that using pointers to avoid copying is not necessarily faster; it depends on the context. So far in this book, we've only discussed values ​​and pointers through the prism of semantics: pointers are used when values ​​must be shared. In most cases, this should be the rule to follow. Also keep in mind that modern CPUs are very efficient at copying data, especially within the same cache line. Let's avoid premature optimization and focus on readability and semantics first.

We should also note that in the previous benchmark we called b.ReportAllocs(), it emphasized heap allocations (stack allocations were not counted):

  • B/op:How many bytes are allocated per operation

  • allocs/op:How much to allocate per operation

Next, let's discuss the conditions under which variables escape to the heap.

12.5.2 Escape Analysis

Hazard analysis is the work performed by the compiler to decide whether a variable should be allocated on the stack or the heap. Let's look at the main rules.

When an allocation cannot be done on the stack, it is done on the heap. Although this sounds like a simple rule, it's important to remember. For example, if the compiler cannot prove that the variable is not referenced after the function returns, then the variable is allocated on the heap. In the previous section, sumPtrthe function returned a pointer to a variable created in the function's scope. In general, sharing upwards will venture into the heap.

But what about the opposite case? What if we accept a pointer, like in the example below?

func main() {
    
    
    a := 3
    b := 2
    c := sum(&a, &b)
    println(c)
}

//go:noinline
func sum(x, y *int) int {
    
         // ❶
    return *x + *y
}

❶ accept pointer

sumAccepts two pointers to variables created in the parent. Figure 12.31 shows the current stack if we move to the statements sumin the function .return

Figure 12.31 xand yvariables refer to effective addresses.

Despite being part of another stack frame, xand yvariables refer to effective addresses. So, aand bdon't escape; they can stay on the stack. In general, down sharing stays on the stack.

Here are other situations where variables can venture to the heap:

  • Global variables because multiple goroutines can access them.

  • A pointer to send to the channel:

    type Foo struct{
          
           s string }
    ch := make(chan *Foo, 1)
    foo := &Foo{
          
          s: "x"}
    ch <- foo
    

    Here, foofled to the garbage dump.

  • The variable referenced by the value sent to the channel:

    type Foo struct{
          
           s *string }
    ch := make(chan Foo, 1)
    s := "x"
    bar := Foo{
          
          s: &s}
    ch <- bar
    

    Because sit is referenced by its address Foo, it will venture to the heap in these cases.

  • If the local variable is too large, it cannot be placed on the stack.

  • If the size of a local variable is unknown. For example, s := make([]int, 10)probably wouldn't venture into the heap, but s := make([]int, n)would, since its size is based on a variable.

  • If using appendthe backing array that reallocates the slice.

While this list provides an idea for our understanding of the compiler's decisions, it is not exhaustive and may change in future Go releases. To confirm a hypothesis, we can -gcflagsaccess the compiler's decision using:

$ go build -gcflags "-m=2"
...
./main.go:12:2: z escapes to heap:

Here, the compiler notifies us that zthe variable will escape to the heap.

Understanding the fundamental difference between the heap and the stack is critical to optimizing Go applications. As we've seen, heap allocation is more complicated for the Go runtime, requiring an external system with GC to free data. In some data-intensive applications, heap management can consume as much as 20 or 30 percent of the total CPU time. On the other hand, the stack is self-cleaning and local to a single goroutine, which makes allocations faster. Therefore, optimizing memory allocation can have a large return on investment.

Understanding the rules of escape analysis is also essential to writing more efficient code. In general, down-sharing stays on the stack, while up-sharing moves to the heap. This should prevent common mistakes like premature optimization where we want to return a pointer, e.g. "to avoid copying" let's focus on readability and semantics first, then optimize allocation as needed.

The next section discusses how to reduce allocations.

12.6 Don't know how to reduce the allocation

Allocation reduction is a common optimization technique to speed up Go applications. This book has covered a few ways to reduce the number of heap allocations:

  • Under-optimized string concatenation (bug #39): Use strings.Builderthe not +operator to concatenate strings.

  • Useless string conversions (bug #40): Avoid []byteconversions to strings whenever possible.

  • Inefficient slice and map initialization (bugs #21 and #27): Pre-allocate slices and maps if length is known.

  • Better data structure alignment to reduce structure size (bug #94).

As part of this section, we discuss three common ways to reduce allocations:

  • change our API

  • Depends on compiler optimizations

  • Use sync.Poolother tools

12.6.1 API Changes

The first option is to work hard on the API we provide. Let's take a concrete example io.Readerinterface:

type Reader interface {
    
    
    Read(p []byte) (n int, err error)
}

ReadMethod accepts a slice and returns the number of bytes read. Now, imagine if io.Readerthe interface was designed in reverse: pass a value indicating how many bytes to read intand return a slice:

type Reader interface {
    
    
    Read(n int) (p []byte, err error)
}

Semantically, there is nothing wrong with this. But in this case, the returned slice automatically escapes to the heap. We will be in the shared situation described in the previous section.

Go 设计者使用向下共享的方法来防止自动将切片逃逸到堆中。因此,由调用者来提供切片。这并不一定意味着这个片不会被逃逸:编译器可能已经决定这个片不能留在栈上。然而,由调用者来处理它,而不是由调用的Read方法引起的约束。

有时,即使是 API 中的微小变化也会对分配产生积极的影响。当设计一个 API 时,让我们注意上一节描述的逃逸分析规则,如果需要,使用-gcflags来理解编译器的决定。

12.6.2 编译器优化

Go 编译器的目标之一就是尽可能优化我们的代码。这里有一个关于映射的具体例子。

在 Go 中,我们不能使用切片作为键类型来定义映射。在某些情况下,特别是在做 I/O 的应用中,我们可能会收到我们想用作关键字的[]byte数据。我们必须先将它转换成一个字符串,这样我们就可以编写下面的代码:

type cache struct {
    
    
    m map[string]int                                // ❶
}

func (c *cache) get(bytes []byte) (v int, contains bool) {
    
    
    key := string(bytes)                            // ❷
    v, contains = c.m[key]                          // ❸
    return
}

❶ 包含字符串的映射

❷ 将[]byte转换为字符串

❸ 使用字符串值查询映射

因为get函数接收一个[]byte切片,所以我们将其转换成一个key字符串来查询映射。

然而,如果我们使用string(bytes)查询映射,Go 编译器会实现一个特定的优化:

func (c *cache) get(bytes []byte) (v int, contains bool) {
    
    
    v, contains = c.m[string(bytes)]                         // ❶
    return
}

❶ 使用string(bytes)直接查询映射

尽管这是几乎相同的代码(我们直接调用string(bytes)而不是传递变量),编译器将避免进行这种字节到字符串的转换。因此,第二个版本比第一个快。

这个例子说明了看起来相似的函数的两个版本可能导致遵循 Go 编译器工作的不同汇编代码。我们还应该了解优化应用的可能的编译器优化。我们需要关注未来的 Go 版本,以检查是否有新的优化添加到语言中。

12.6.3 sync.Pool

如果我们想解决分配数量的问题,另一个改进的途径是使用sync.Pool。我们应该明白sync.Pool不是一个缓存:没有我们可以设置的固定大小或最大容量。相反,它是一个重用公共对象的池。

假设我们想要实现一个write函数,它接收一个io.Writer,调用一个函数来获取一个[]byte片,然后将它写入io.Writer。我们的代码如下所示(为了清楚起见,我们省略了错误处理):

func write(w io.Writer) {
    
    
    b := getResponse()       // ❶
    _, _ = w.Write(b)        // ❷
}

❶ 收到一个[]byte的响应

❷ 写入io.Writer

这里,getResponse在每次调用时返回一个新的[]byte片。如果我们想通过重用这个片来减少分配的次数呢?我们假设所有响应的最大大小为 1,024 字节。这种情况,我们可以用sync.Pool

创建一个sync.Pool需要一个func() any工厂函数;参见图 12.32。sync.Pool暴露两种方法:

  • Get() any——从池中获取一个对象

  • Put(any)——将对象返回到池中

图 12.32 定义了一个工厂函数,它在每次调用时创建一个新对象。

如果池是空的,使用Get创建一个新对象,否则重用一个对象。然后,在使用该对象之后,我们可以使用Put将它放回池中。图 12.33 显示了先前定义的工厂的一个例子,当池为空时有一个Get,当池不为空时有一个Put和一个Get

图 12.33 Get创建一个新对象或从池中返回一个对象。Put将对象返回到池中。

什么时候从水池中排出物体?没有特定的方法可以做到这一点:它依赖于 GC。每次 GC 之后,池中的对象都被销毁。

Going back to our example, assuming we can update getResponsethe function to write data to a given slice, instead of creating a slice, we can implement another version of the pool-dependent writemethod:

var pool = sync.Pool{
    
    
    New: func() any {
    
                    // ❶
        return make([]byte, 1024)
    },
}

func write(w io.Writer) {
    
    
    buffer := pool.Get().([]byte)    // ❷
    buffer = buffer[:0]              // ❸
    defer pool.Put(buffer)           // ❹

    getResponse(buffer)              // ❺
    _, _ = w.Write(buffer)
}

❶ Created a pool and set up a factory function

❷ get or create from pool[]byte

❸ The buffer is reset

❹ Put the buffer back into the pool

❺ Write the response to the provided buffer

We sync.Pooldefine a new pool using the structure and set up the factory function to create a new one with a length of 1024 elements []byte. In writethe function, we try to retrieve a buffer from the pool. If the pool is empty, the function creates a new buffer; otherwise, it chooses an arbitrary buffer from the buffer pool and returns it. A critical step is to use buffer[:0]the reset buffer, since the slice may already be in use. Then we'll call Putput the slice back into the pool.

In this new version, calls writedo not cause a new []byteslice to be created for each call. Instead, we can reuse existing allocated slices. In the worst case - for example, after a GC - the function will create a new buffer; however, the amortized allocation cost will be reduced.

To sum up, if we frequently allocate many objects of the same type, we can consider using it sync.Pool. It is a set of temporary objects that can help us avoid re-allocating the same kind of data repeatedly. And sync.Poolit can be safely used by multiple goroutines at the same time.

Next, let's discuss the concept of inlining to understand that this computer optimization is worth knowing.

12.7 #97: Don't rely on inlining

Inlining refers to replacing a function call with the function body. Now, inlining is done automatically by the compiler. Understanding the fundamentals of inlining is also a way to optimize application-specific code paths.

Let's look at a concrete example of inlining that adds sumtwo inttypes using a simple function:

func main() {
    
    
    a := 3
    b := 2
    s := sum(a, b)
    println(s)
}

func sum(a int, b int) int {
    
         // ❶
    return a + b
}

❶ The function is inlined

If we use -gcflagsrun go build, we will access the decisions the compiler summade about the function:

$ go build -gcflags "-m=2"
./main.go:10:6: can inline sum with cost 4 as:
    func(int, int) int {
    
     return a + b }
...
./main.go:6:10: inlining call to sum func(int, int) int {
    
     return a + b }

The compiler decides to inline the call to sum. Therefore, the preceding code is replaced with the following:

func main() {
    
    
    a := 3
    b := 2
    s := a + b     // ❶
    println(s)
}

❶ Replaces the call sumto

Inlining is only effective for functions with a certain complexity, also known as an inline budget . Otherwise, the compiler informs us that the function is too complex to be inlined:

./main.go:10:6: cannot inline foo: function too complex:
    cost 84 exceeds budget 80

Inlining has two main benefits. First, it removes the overhead of function calls (although the overhead has been reduced somewhat since Go 1.17 and the register-based calling convention). Second, it allows the compiler to do further optimizations. For example, after inlining a function, the compiler can decide that variables that should have originally escaped on the heap can stay on the stack.

The question is, if this optimization is automatically applied by the compiler, why should we, as Go developers, care about it? The answer lies in the concept of intermediate stack inlining.

On-stack inlining is about inlining functions that call other functions. Prior to Go 1.9, inlining only considered leaf functions. Now, due to inlining on the stack, the following foofunctions can also be inlined:

func main() {
    
    
    foo()
}

func foo() {
    
    
    x := 1
    bar(x)
}

Because foothe function is less complex, the compiler can inline its calls:

func main() {
    
    
    x := 1       // ❶
    bar(x)
}

❶ Replace with text

Thanks to intermediate stack inlining, as Go developers, we can now use the concept of fast-path inlining to distinguish between fast and slow paths and thus optimize applications. Let's look at a sync.Mutexconcrete example posted in the implementation to understand how this works.

Before the middle stack is inlined, Lockthe implementation of the method is as follows:

func (m *Mutex) Lock() {
    
    
    if atomic.CompareAndSwapInt32(&m.state, 0, mutexLocked) {
    
    
        // Mutex isn't locked
        if race.Enabled {
    
    
            race.Acquire(unsafe.Pointer(m))
        }
        return
    }

    // Mutex is already locked
    var waitStartTime int64
    starving := false
    awoke := false
    iter := 0
    old := m.state
    for {
    
    
        // ...    // ❶
    }
    if race.Enabled {
    
    
        race.Acquire(unsafe.Pointer(m))
    }
}

❶ Complex logic

We can distinguish two main paths:

  • If the mutex is not locked ( atomic.CompareAndSwapInt32true), the fast path

  • If the mutex is already locked ( atomic.CompareAndSwapInt32false), the slow path

然而,无论采用哪种方法,由于函数的复杂性,它都不能内联。为了使用中间栈内联,Lock方法被重构,因此慢速路径位于一个特定的函数中:

func (m *Mutex) Lock() {
    
    
    if atomic.CompareAndSwapInt32(&m.state, 0, mutexLocked) {
    
    
        if race.Enabled {
    
    
            race.Acquire(unsafe.Pointer(m))
        }
        return
    }
    m.lockSlow()     // ❶
}

func (m *Mutex) lockSlow() {
    
    
    var waitStartTime int64
    starving := false
    awoke := false
    iter := 0
    old := m.state
    for {
    
    
        // ...
    }

    if race.Enabled {
    
    
        race.Acquire(unsafe.Pointer(m))
    }
}

互斥体已经锁定的❶路径

由于这一改变,Lock方法可以被内联。好处是没有被锁定的互斥体现在被锁定了,而不需要支付调用函数的开销(速度提高了 5%左右)。当互斥体已经被锁定时,慢速路径不会改变。以前它需要一个函数调用来执行这个逻辑;它仍然是一个函数调用,这次是对lockSlow的调用。

这种优化技术是关于区分快速和慢速路径。如果快速路径可以内联,而慢速路径不能内联,我们可以在专用函数中提取慢速路径。因此,如果没有超出内联预算,我们的函数是内联的候选函数。

内联不仅仅是我们不应该关心的不可见的编译器优化。正如在本节中所看到的,理解内联是如何工作的以及如何访问编译器的决定是使用快速路径内联技术进行优化的一条途径。如果执行快速路径,在专用函数中提取慢速路径可以防止函数调用。

下一节将讨论常见的诊断工具,这些工具可以帮助我们理解在我们的 Go 应用中应该优化什么。

12.8 #98:不使用 Go 诊断工具

Go 提供了一些优秀的诊断工具,帮助我们深入了解应用的执行情况。这一节主要关注最重要的部分:概要分析和执行跟踪器。这两个工具都非常重要,应该成为任何对优化感兴趣的 Go 开发者的核心工具集的一部分。我们先讨论侧写。

12.8.1 概要分析

评测提供了对应用执行的深入了解。它允许我们解决性能问题、检测竞争、定位内存泄漏等等。这些见解可以通过以下几个方面收集:

  • CPU——决定应用的时间花在哪里

  • Goroutine——报告正在进行的 goroutines 的栈跟踪

  • Heap——报告堆内存分配,以监控当前内存使用情况并检查可能的内存泄漏

  • Mutex——报告锁争用,以查看我们代码中使用的互斥体的行为,以及应用是否在锁定调用上花费了太多时间

  • Block——显示 goroutines 阻塞等待同步原语的位置

Profiling is accomplished by using a tool called a profiler. Let's first look at how and when to enable it pprof; then, we discuss the most important profile types.

enablepprof

There are several ways to enable pprof. For example, we can use net/http/pprofthe package over HTTP:

package main

import (
    "fmt"
    "log"
    "net/http"
    _ "net/http/pprof"                                                      // ❶
)

func main() {
    
    
    http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
    
         // ❷
        fmt.Fprintf(w, "")
    })
    log.Fatal(http.ListenAndServe(":80", nil))
}

❶ Blank importpprof

❷ exposes an HTTP endpoint

The import net/http/pprofcauses a side effect that allows us to reach pprofthe URL, http://host/debug/pprof. Note enabling pprofis safe even in production ( go.dev/doc/diagnostics#profiling). Profiles that affect performance, such as the CPU profile, are not enabled by default and do not run continuously: they are only active for a specific period of time.

Now that we've seen how to expose an pprofendpoint, let's discuss the most common profiles.

CPU analysis

CPU Profiler is OS and signaling dependent. When it is activated, by default, the application SIGPROFsignals the operating system to interrupt every 10 ms. When the app receives one SIGPROF, it suspends the current activity and transfers execution to the profiler. Profilers collect data, such as current goroutine activity, and aggregate execution statistics that we can retrieve. Then stop and continue execution until the next one SIGPROF.

We can access /debug/pprof/profilethe endpoint to activate CPU profiling. By default, accessing this endpoint performs a 30-second CPU analysis. For 30 seconds, our app interrupts every 10 milliseconds. Note that we can change these two defaults: we can pass secondsparameters to the endpoint how long the analysis should last (for example, /debug/pprof/profile?seconds=15), and we can change the interrupt rate (even to less than 10 ms). But in most cases, 10 ms should be enough, and we should be careful not to hurt performance when reducing this value (meaning increasing the rate). After 30 seconds, we downloaded the CPU profiler results.

CPU Performance Analysis During Benchmarking

We can also use -cpuprofilethe flags to enable the CPU profiler, for example when running benchmarks:

$ go test -bench=. -cpuprofile profile.out

This command generates the same types of files that can be downloaded via /debug/pprof/profile.

From this file, we can go toolnavigate to the results using:

$ go tool pprof -http=:8080 <file>

This command opens a web UI displaying the call graph. Figure 12.34 shows an example from the application. The larger the arrow, the hotter the road. We can then browse the graph for executive insights.

Figure 12.34 The call graph of the application in 30 seconds

For example, the graph in Figure 12.35 tells us that, out of 30 seconds, decodethe method ( *FetchResponsereceiver) took 0.06 seconds. Of those 0.06 seconds, RecordBatch.decodeit took 0.02 seconds, and makemap(creating the mapping) took 0.01 seconds.

Figure 12.35 Example call graph

We can also access this kind of information from the web user interface through different representations. For example, the top view sorts functions by execution time, while the flame graph visualizes the execution time hierarchy. The UI can even display expensive parts of the source code line by line.

Note that we can also drill down into the data via the command line. However, in this section we will focus on the web UI.

With the help of this data, we can get a general idea of ​​how the app behaves:

  • Too many runtime.mallogcpairs of calls means too many small heap allocations, which we can minimize.

  • Too much time spent on channel operations or mutexes can indicate excessive contention, which can hurt your application's performance.

  • Spending too much time on syscall.Reador syscall.Writemeans the app spends a lot of time in kernel mode. Working on I/O buffering might be an avenue for improvement.

These are the insights we can gain from a CPU Profiler. Understanding the hottest code paths and identifying bottlenecks is valuable. But it will not determine the rate exceeding the configuration, because the CPU profiler is executed at a fixed speed (10 ms by default). To gain more fine-grained insight, we should use tracing, which we discuss later in this chapter.

注:我们还可以给不同的函数贴上标签。例如,想象一个从不同客户端调用的公共函数。为了跟踪两个客户端花费的时间,我们可以使用pprof.Labels

堆分析

堆分析允许我们获得关于当前堆使用情况的统计数据。与 CPU 分析一样,堆分析也是基于样本的。我们可以改变这个速率,但是我们不应该太细,因为我们降低的速率越多,堆分析收集数据的工作量就越大。默认情况下,对于每 512 KB 的堆分配,对样本进行一次分析。

如果我们到达/debug/pprof/heap/但是,我们可以使用debug/pprof/heap/?debug=0,然后用go tool(与上一节相同的命令)打开它,使用 web UI 导航到数据。

图 12.36 堆积图

图 12.36 显示了一个堆图的例子。调用MetadataResponse .decode方法导致分配 1536 KB 的堆数据(占总堆的 6.32%)。然而,这 1536 KB 中有 0 个是由这个函数直接分配的,所以我们需要检查第二个调用。TopicMetadata.decode方法分配了 1536 KB 中的 512 KB 其余的 1024 KB 用另一种方法分配。

这就是我们如何浏览调用链,以了解应用的哪个部分负责大部分堆分配。我们还可以看看不同的样本类型:

  • alloc_objects——分配的对象总数

  • alloc_space——分配的内存总量

  • inuse_objects——已分配未释放的对象数量

  • inuse_space——已分配但尚未释放的内存量

堆分析的另一个非常有用的功能是跟踪内存泄漏。对于基于 GC 的语言,通常的过程如下:

  1. 触发 GC。

  2. 下载堆数据。

  3. 等待几秒钟/几分钟。

  4. 触发另一个 GC。

  5. 下载另一个堆数据。

  6. 比较。

在下载数据之前强制执行 GC 是防止错误假设的一种方法。例如,如果我们在没有首先运行 GC 的情况下看到保留对象的峰值,我们就不能确定这是一个泄漏还是下一个 GC 将收集的对象。

使用pprof,我们可以下载一个堆概要文件,同时强制执行 GC。Go 中的过程如下:

  1. 转到/debug/pprof/heap?gc=1(触发 GC 并下载堆配置文件)。

  2. 等待几秒钟/几分钟。

  3. 再次转到/debug/pprof/heap?gc=1

  4. 使用go tool比较两个堆配置文件:

$ go tool pprof -http=:8080 -diff_base <file2> <file1>

图 12.37 显示了我们可以访问的数据类型。例如,newTopicProducer方法(左上)持有的堆内存量已经减少了(–513 KB)。相比之下,updateMetadata(右下角)持有的数量增加了(+512 KB)。缓慢增加是正常的。例如,第二个堆配置文件可能是在服务调用过程中计算出来的。我们可以重复这个过程或等待更长时间;重要的部分是跟踪特定对象分配的稳定增长。

图 12.37 两种堆配置文件的区别

注意,与堆相关的另一种类型的分析是allocs,它报告分配情况。堆分析显示了堆内存的当前状态。为了深入了解应用启动以来的内存分配情况,我们可以使用分配分析。如前所述,因为栈分配的成本很低,所以它们不是这种分析的一部分,这种分析只关注堆。

Goroutines 剖析

goroutine配置文件报告应用中所有当前 goroutines 的栈跟踪。我们可以用debug/pprof/goroutine/?debug=0,再次使用go tool。图 12.38 显示了我们能得到的信息种类。

图 12.38 Goroutine 图

我们可以看到应用的当前状态以及每个函数创建了多少个 goroutines。在这种情况下,withRecover创建了 296 个正在进行的 goroutine(63%),其中 29 个与对responseFeeder的调用相关。

如果我们怀疑 goroutine 泄密,这种信息也是有益的。我们可以查看 goroutine 性能分析器数据,了解系统的哪个部分是可疑的。

块剖析

block配置文件报告正在进行的 goroutines 阻塞等待同步原语的位置。可能性包括

  • 在无缓冲通道上发送或接收

  • 发送到完整通道

  • 从空通道接收

  • 互斥竞争

  • 网络或文件系统等待

Block analysis also records how long a goroutine waits for debug/pprof/blockaccess. This profile is useful if we suspect that blocking calls are hurting performance.

By default, blockthe profile is not enabled: we must call runtime.SetBlockProfileRateto enable it. This function controls the proportion of goroutine blocking events reported. Once enabled, the profiler will continue to collect data in the background, even if we don't call debug/pprof/blockthe endpoint. If we want to set a higher ratio, we have to be careful so as not to affect performance.

Full goroutine stack dump

If we are facing a deadlock or suspect a goroutine is blocked, full goroutine stack dump( debug/pprof/goroutine/?debug=2) creates a dump of all current goroutine stack traces. This can be helpful as a first analysis step. For example, the following dump shows that the Sarama goroutine was blocked in a channel receive operation for 1,420 minutes:

goroutine 2494290 [chan receive, 1420 minutes]:
github.com/Shopify/sarama.(*syncProducer).SendMessages(0xc00071a090,{
    
    0xc0009bb800, 0xfb, 0xfb})
    /app/vendor/github.com/Shopify/sarama/sync_producer.go:117 +0x149

Mutex Analysis

The last profile type has to do with blocking, but only with mutual exclusion. If we suspect that our application spends a lot of time waiting for a locked mutex, thereby impairing execution, we can use mutex profiling. It can be accessed via /debug/pprof/mutex.

This profile works like blocking. It is disabled by default: we must enable runtime.SetMutexProfileFractionit with , which controls the proportion of mutex contention events reported.

Here are some additional notes about profiling:

  • We didn't mention threadcreatethe section because it's been broken since 2013 ( github.com/golang/go/issues/6104).

  • Make sure to enable only one profiler at a time: for example, do not enable CPU and heap profiling at the same time. Doing so can lead to erroneous observations.

  • pprofis extensible and we can pprof.Profilecreate our own custom profiles using

We have seen the most important configuration files that can help us understand the performance of the application and possible optimization paths. In general, enabling it is recommended pprof, even in production environments, because in most cases it provides an excellent balance between its footprint and the insights we can gain from it. Some profiles, such as the CPU profile, cause performance degradation, but only if they are enabled.

Now let's look at the execution tracer.

12.8.2 Execution Tracker

The execution tracer is a tool that go toolcaptures a wide range of runtime events and makes them available for visualization. this can help:

  • Understand runtime events such as how GC performs

  • Understand how goroutines execute

  • Identify poorly parallelized executions

Let's try it with an example given in Mistake #56, "It's always faster to think concurrently." We discussed two parallel versions of the merge sort algorithm. The problem with the first version was poor parallelism, causing too many goroutines to be created. Let's see how trackers can help us verify this statement.

We'll write a benchmark for the first version and -traceexecute it with the flag to enable the execution tracer:

$ go test -bench=. -v -trace=trace.out

Note that we can also use the /debug/pprof/trace?debug=0endpoint pprofto download remote trace files. .

This command creates a trace.outfile that we can go toolopen with:

$ go tool trace trace.out
2021/11/26 21:36:03 Parsing trace...
2021/11/26 21:36:31 Splitting trace...
2021/11/26 21:37:00 Opening browser. Trace viewer is listening on
    http://127.0.0.1:54518

With the web browser open, we can click View Trace to view all traces for a specific time period, as shown in Figure 12.39. This number represents about 150 milliseconds, and we can see several useful metrics such as goroutine count and heap size. The heap size grows steadily until GC is triggered. We can also observe the activity of the Go application per CPU core. The time frame starts with user-level code; then a "stop-the-world" execution takes about 40 milliseconds across four CPU cores.

Figure 12.39 shows goroutine activity and runtime events such as GC phases

Regarding concurrency, we can see that this version uses all available CPU cores on the machine. However, Figure 12.40 zooms in on a fraction of 1 millisecond, with each bar corresponding to a goroutine execution. Having too many small vertical bars doesn't look good: it means poor parallelism of execution.

Figure 12.40 Too many small bars means poor parallel execution.

Figure 12.41 Zoom in closer to see how these goroutines are orchestrated. About 50% of the CPU time is not used to execute application code. Blank indicates the time it takes for the Go runtime to start and schedule new goroutines.

Figure 12.41 About 50% of CPU time is spent processing goroutine switches.

Let's compare this to the second parallel implementation, which is about an order of magnitude faster. Figure 12.42 zooms in again to the 1 millisecond time frame.

Figure 12.42 The number of blank spaces is significantly reduced, which proves that the CPU is more fully occupied.

Each goroutine takes more time to execute, and the amount of whitespace has been significantly reduced. As a result, the CPU spends significantly more time executing application code than in the first version. Every millisecond of CPU time is being used more efficiently, which explains the difference in the benchmarks.

Note that the granularity of tracing is per routine, not per function like CPU profiling. However, packages can be used to define user-level tasks to gain insight into each function or group of functions.

For example, suppose a function calculates a Fibonacci number and then atomicwrites it to a global variable using We can define two different tasks:

var v int64
ctx, fibTask := trace.NewTask(context.Background(), "fibonacci")     // ❶
trace.WithRegion(ctx, "main", func() {
    
    
    v = fibonacci(10)
})
fibTask.End()
ctx, fibStore := trace.NewTask(ctx, "store")                         // ❷
trace.WithRegion(ctx, "main", func() {
    
    
    atomic.StoreInt64(&result, v)
})
fibStore.End()

❶ Created a Fibonacci task

❷ Create a storage task

Using go tool, we can obtain more precise information about how these two tasks are performed. In the previous trace UI (Figure 12.42), we can see the boundaries of each task in each goroutine. In user-defined tasks, we can follow a duration distribution (see Figure 12.43).

Figure 12.43 Distribution of user-level tasks

We see that in most cases, fibonaccithe execution time of the task is less than 15 microseconds, and storethe execution time of the task is less than 6309 nanoseconds.

In the previous section, we discussed the various kinds of information we can obtain from CPU profiling. What are the main differences compared to the data we can get from user-level traces?

  • CPU performance analysis:

    • Based on samples.
    • per function.
    • Will not drop below rate (10 ms by default).
  • User level trace:

    • Not based on samples.
    • Executed routinely (unless we use runtime/tracepackages).
    • Time execution is not bound by any rate.

In conclusion, an execution tracer is a powerful tool for understanding how an application executes. As we saw in the merge sort example, we can identify poorly parallelized executions. However, the granularity of the tracer is still per-routine unless we manually use runtime/tracecomparisons with CPU profiles. We can use both profiling and execution tracers to take advantage of the standard Go diagnostic tools when optimizing an application.

The next section discusses how GC works and how to tune it.

12.9 #99: Not understanding how GC works

The Garbage Collector (GC) is a key part of the Go language that simplifies a developer's life. It allows us to track and free heap allocations that are no longer needed. Since we cannot replace every heap allocation with a stack allocation, understanding how the GC works should be part of a Go developer's toolset for optimizing applications.

12.9.1 Concept

The GC keeps a tree of object references. Go GC is based on the mark-sweep algorithm, which relies on two phases:

  • Mark phase - go through all objects in the heap and mark if they are still in use

  • Cleanup phase - traverse the reference tree from the root and release blocks of objects that are no longer referenced

When the GC runs, it first performs a set of actions that result in a stop-the-world (two stops-the-world per GC, to be precise). That is, all available CPU time is spent performing GC, pausing our application code. Following these steps, it starts the world again, resumes our application, and runs a concurrent stage at the same time. Go GC is called concurrent mark-and-sweep for this reason: its goal is to reduce the number of stop-the-world operations per GC cycle, and it runs mostly concurrently with our application.
The cleaner
Go GC also includes a method to free memory after consumption spikes. Suppose our application is based on two stages:

  • Initialization phase leading to frequent allocations and a large heap

  • Runtime phase with modest allocations and a small heap

How to deal with the fact that the big heap is only useful when the app starts up, and not after that? This is handled as part of the GC using so-called periodic cleaners . After a while, the GC detects that such a large heap is no longer needed, so it frees some memory and returns it to the OS.

Note that if the cleaner is not fast enough, we can use debug.FreeOSMemory()manual force to return the memory to the OS.

The important question is, when does the GC cycle run? Go configuration is still fairly simple compared to other languages ​​like Java. It relies on a single environment variable: GOGC. This variable defines the percentage by which the heap will grow since the last GC before another GC is triggered; the default is 100%.

Let's look at a concrete example to make sure we understand. Let's assume a GC has just been triggered and the current heap size is 128 MB. If GOGC=100, when the heap size reaches 256 MB, trigger the next garbage collection. By default, a GC is performed every time the heap size doubles. Additionally, Go will force a GC to run if no GC has been performed in the last 2 minutes.

If we profile our application with a production workload, we can fine-tune GOGC:

  • Reducing it will cause the heap to grow more slowly, increasing the pressure on the GC.

  • 相反,碰撞它会导致堆增长得更快,从而减轻 GC 的压力。

GC 痕迹

我们可以通过设置GODEBUG环境变量来打印 GC 轨迹,比如在运行基准测试时:

$ GODEBUG=gctrace=1 go test -bench=. -v

启用gctrace会在每次 GC 运行时向stderr写入跟踪。

让我们通过一些具体的例子来理解 GC 在负载增加时的行为。

12.9.2 示例

假设我们向用户公开一些公共服务。在中午 12:00 的高峰时段,有 100 万用户连接。然而,联网用户在稳步增长。图 12.44 表示平均堆大小,以及当我们将GOGC设置为100时何时触发 GC。

图 12.44 联网用户的稳步增长

因为GOGC被设置为100,所以每当堆大小加倍时,GC 都会被触发。在这种情况下,由于用户数量稳步增长,我们应该全天面对可接受数量的 GC(图 12.45)。

图 12.45 GC 频率从未达到大于中等的状态。

在一天开始的时候,我们应该有适度数量的 GC 周期。当我们到达中午 12:00 时,当用户数量开始减少时,GC 周期的数量也应该稳步减少。在这种情况下,保持GOGC100应该没问题。

现在,让我们考虑第二个场景,100 万用户中的大多数在不到一个小时内连接;参见图 12.46。上午 8:00,平均堆大小迅速增长,大约一小时后达到峰值。

图 12.46 用户突然增加

在这一小时内,GC 周期的频率受到严重影响,如图 12.47 所示。由于堆的显著和突然的增加,我们在短时间内面临频繁的 GC 循环。即使 Go GC 是并发的,这种情况也会导致大量的停顿期,并会造成一些影响,例如增加用户看到的平均延迟。

图 12.47 在一个小时内,我们观察到高频率的 GCs。

In this case, we should consider GOGCraising it to a higher value to reduce the pressure on the GC. Note that GOGCthere is no linear benefit to increasing: the larger the heap, the longer it will take to clean up. Therefore, we GOGCshould be careful when configuring when using production workloads.

In special cases where the bumps are more severe, the adjustment GOGCmay not be enough. For example, instead of going from 0 to 1 million users in an hour, we did it in seconds. During these few seconds, the amount of GC can reach a critical state, resulting in very poor performance of the application.

If we know the peak value of the heap, we can use a trick to force a large allocation of memory to improve the stability of the heap. For example, we can main.goforce the allocation of 1 GB using a global variable in:

var min = make([]byte, 1_000_000_000) // 1 GB

What's the point of such an allocation? If GOGCkept on 100, instead of triggering a GC every time the heap doubles (again, which happens very often during those few seconds), Go will only trigger a GC when the heap reaches 2 GB. This should reduce the number of GC cycles triggered when all users connect, thus reducing the impact on average latency.

We can say that this trick wastes a lot of memory when the heap size is reduced. but it is not the truth. On most operating systems, assigning this minvariable won't cost our application 1 GB of memory. Calling makeresults in mmap()a system call to , which results in lazy allocation. For example, on Linux, memory is virtually addressed and mapped through page tables. Use to mmap()allocate 1 GB of memory in virtual address space instead of physical space. Only a read or write results in a page fault, which results in an actual physical memory allocation. So even if the app starts without any connected clients, it won't consume 1 GB of physical memory.

Note that we can use pstools like this to verify this behavior.

In order to optimize the GC, it is important to understand its behavior. As Go developers, we can GOGCconfigure when to trigger the next GC cycle with In most cases, staying at 100should be enough. However, we can increase this value if our application may face request spikes causing frequent GC and latency impact. Finally, we can consider using tricks to keep the virtual heap size to a minimum during unusual request spikes.

The last section of this chapter discusses the implications of running Go in Docker and Kubernetes.

12.10 #100: Not understanding the impact of running GO in Docker and Kubernetes

According to the 2021 Go Developer Survey ( go.dev/blog/survey2021-results), writing services in Go is the most common usage. Meanwhile, Kubernetes is the most widely used platform for deploying these services. It's important to understand the implications of running Go inside Docker and Kubernetes to prevent common situations like CPU throttling.

As we mentioned in Mistake #56 "It's always faster to think concurrently", GOMAXPROCSvariables define the limit of operating system threads responsible for executing user-level code concurrently. By default it is set to the number of logical CPU cores visible to the operating system. What does this mean in the context of Docker and Kubernetes?

Suppose our Kubernetes cluster consists of eight-core nodes. When deploying a container in Kubernetes, we can define a CPU limit to ensure that the application does not consume all the host resources. For example, the following configuration limits cpu usage to 4,000 milliCPUs (or millicores), so there are four CPU cores:

spec:
  containers:
  - name: myapp
    image: myapp
    resources:
      limits:
        cpu: 4000m

We can assume that when our application is deployed, GOMAXPROCSthese constraints will be based and therefore will have values 4. But it doesn't; it's set to the number of logical cores on the host: 8. So, what's the impact?

Kubernetes 使用完全公平调度器(CFS)作为进程调度器。CFS 还用于强制执行 Pod 资源的 CPU 限制。在管理 Kubernetes 集群时,管理员可以配置这两个参数:

  • cpu.cfs_period_us(全局设置)

  • cpu.cfs_quota_us(设定每 Pod)

前者规定了一个期限,后者规定了一个配额。默认情况下,周期设置为 100 毫秒。同时,默认配额值是应用在 100 毫秒内可以消耗的 CPU 时间。限制设置为四个内核,这意味着 400 毫秒(4 × 100毫秒)。因此,CFS 将确保我们的应用在 100 毫秒内不会消耗超过 400 毫秒的 CPU 时间。

让我们想象一个场景,其中多个 goroutines 当前正在四个不同的线程上执行。每个线程被调度到不同的内核(1、3、4 和 8);参见图 12.48。

图 12.48 每 100 毫秒,应用消耗的时间不到 400 毫秒

在第一个 100 毫秒期间,有四个线程处于忙碌状态,因此我们消耗了 400 毫秒中的 400 毫秒:100%的配额。在第二阶段,我们消耗 400 毫秒中的 360 毫秒,以此类推。一切都很好,因为应用消耗的资源少于配额。

但是,我们要记住GOMAXPROCS是设置为8的。因此,在最坏的情况下,我们可以有八个线程,每个线程被安排在不同的内核上(图 12.49)。

图 12.49 在每 100 毫秒期间,CPU 在 50 毫秒后被节流。

每隔 100 毫秒,配额设置为 400 毫秒,如果 8 个线程忙于执行 goroutines,50 毫秒后,我们达到 400 毫秒的配额(8 × 50 毫秒 = 400 毫秒)。会有什么后果?CFS 将限制 CPU 资源。因此,在下一个周期开始之前,不会再分配 CPU 资源。换句话说,我们的应用将被搁置 50 毫秒。

例如,平均延迟为 50 毫秒的服务可能需要 150 毫秒才能完成。这可能会对延迟造成 300%的损失。

So, what is the solution? Focus first on Go issue 33803 ( github.com/golang/go/issues/33803). Perhaps in a future version of Go, GOMAXPROCSCFS will be supported.

One solution today is to rely on a library github.com/uber-go/automaxprocsmade by automaxprocs. We can use this library by adding a blank import to ; it is automatically set main.goto match the CPU quota of the Linux container. In the previous example, will be set to not , so we won't be able to reach a state where the CPU is throttled.go.uber.org/automaxprocsGOMAXPROCSGOMAXPROCS48

In summary, let's remember that, at the moment, Go does not support CFS. GOMAXPROCSBased on the host, not on a defined CPU limit. As a result, we may reach a state where the CPU is throttled, resulting in long pauses and significant effects such as significant latency increases. Before Go was CFS aware, one solution was to rely on automatically automaxprocssetting GOMAXPROCSto a defined quota.

Summarize

  • Knowing how to use CPU cache is very important to optimize CPU intensive applications, because L1 cache is 50 to 100 times faster than main memory.

  • Understanding cache line concepts is critical to understanding how data is organized in data-intensive applications. The CPU doesn't fetch memory word by word; instead, it typically copies blocks of memory into 64-byte cache lines. To fully utilize each individual cache line, implement spatial locality.

  • Making code predictable for the CPU is also an effective way to optimize certain functions. For example, a CPU's unit or constant strides are predictable, but non-unit strides (for example, a linked list) are unpredictable.

  • To avoid a critical step and thus only utilize a small portion of the cache, note that the cache is partitioned.

  • Knowing that lower-level CPU caches are not shared across all cores helps avoid performance-degrading patterns such as false sharing when writing concurrent code. Sharing memory is an illusion.

  • Use instruction-level parallelism (ILP) to optimize specific sections of code to allow the CPU to execute as many instructions in parallel as possible. Identifying data risks is one of the main steps.

  • Remember that in Go, primitive types are ordered according to their own size, this avoids common mistakes. For example, keep in mind that reorganizing the fields of a structure in descending order of size can result in a more compact structure (fewer memory allocations and potentially better spatial locality).

  • Understanding the fundamental difference between the heap and the stack should also be part of your core knowledge when optimizing Go applications. Stack allocation is almost free, while heap allocation is slower and relies on GC to clean up memory.

  • Reducing allocations is also an important aspect of optimizing Go applications. This can be achieved in different ways, such as carefully designing the API to prevent sharing, understanding common Go compiler optimizations, and using sync.Pool.

  • Use the fast-path inlining technique to effectively reduce the amortized time of calling functions.

  • Rely on analytics and execution trackers to understand how your app is performing and where to optimize.

  • Knowing how to tune the GC can have several benefits, such as handling sudden increases in load more efficiently.

  • To help avoid CPU throttling when deploying in Docker and Kubernetes, keep in mind that Go does not support CFS.

final words

Congratulations on completing 100 Go Mistakes and How to Avoid Them. I sincerely hope you enjoy reading this book and that it will be useful for your personal and/or professional projects.

Remember, making mistakes is part of the learning process, and as I emphasized in the preface, it is also a great source of inspiration for this book. At the end of the day, what matters is our ability to learn from it.

If you want to continue the discussion, you can follow me on Twitter: @teivah. ***

Guess you like

Origin blog.csdn.net/wizardforcel/article/details/130748636