SIGSEGV problem analysis and experience sharing of Hyperledger Fabric deployment in Alibaba Cloud environment

Recently, I received feedback from some friends in the Hyperledger community that I encountered a fatal error related to SIGSEV in the process of deploying the open source blockchain project Hyperledger Fabric on the Alibaba Cloud environment. It happens that the author has encountered and solved similar problems before, so I share it here. Let's take a look at the analysis process and experience of solving the problem at that time, hoping to bring some inspiration and help to everyone.

Problem Description

During the deployment of Hyperledger Fabric, the peer and orderer services failed to start, and an error was also reported when executing the cli-test.sh test on the cli container. The error type is signal SIGSEGV: segmentation violation. An example of the error log is as follows:

2017-11-01 02:44:04.247 UTC [peer] updateTrustedRoots -> DEBU 2a0 Updating trusted root authorities for channel mychannel
fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x1 addr=0x63 pc=0x7f9d15ded259]
runtime stack:
runtime.throw(0xdc37a7, 0x2a)
        /opt/go/src/runtime/panic.go:566 +0x95
runtime.sigpanic()
        /opt/go/src/runtime/sigpanic_unix.go:12 +0x2cc
goroutine 64 [syscall, locked to thread]:
runtime.cgocall(0xb08d50, 0xc4203bcdf8, 0xc400000000)
        /opt/go/src/runtime/cgocall.go:131 +0x110 fp=0xc4203bcdb0 sp=0xc4203bcd70
net._C2func_getaddrinfo(0x7f9d000008c0, 0x0, 0xc420323110, 0xc4201a01e8, 0x0, 0x0, 0x0)

Analysis process

We conducted in-depth analysis and experiments. Inspired by the Hyperledger Fabric bug https://jira.hyperledger.org/browse/FAB-5822 , we adopted the following workaround to solve this problem:

Add GODEBUG=netdns=go to the environment variables of peer, orderer and cli in docker compose yaml

The effect of this setting is to use the pure go resolver instead of the cgo resolver (the error thrown by the cgo resolver can be seen from the error log).

Further analyze under what circumstances golang will switch between cgo resolver and pure go resolver:

The official documentation of golang: https://golang.org/pkg/net/

Name Resolution
The method for resolving domain names, whether indirectly with functions like Dial or directly with functions like LookupHost and LookupAddr, varies by operating system.
On Unix systems, the resolver has two options for resolving names. It can use a pure Go resolver that sends DNS requests directly to the servers listed in /etc/resolv.conf, or it can use a cgo-based resolver that calls C library routines such as getaddrinfo and getnameinfo.
By default the pure Go resolver is used, because a blocked DNS request consumes only a goroutine, while a blocked C call consumes an operating system thread. When cgo is available, the cgo-based resolver is used instead under a variety of conditions: on systems that do not let programs make direct DNS requests (OS X), when the LOCALDOMAIN environment variable is present (even if empty), when the RES_OPTIONS or HOSTALIASES environment variable is non-empty, when the ASR_CONFIG environment variable is non-empty (OpenBSD only), when /etc/resolv.conf or /etc/nsswitch.conf specify the use of features that the Go resolver does not implement, and when the name being looked up ends in .local or is an mDNS name.
The resolver decision can be overridden by setting the netdns value of the GODEBUG environment variable (see package runtime) to go or cgo, as in:
export GODEBUG=netdns=go # force pure Go resolver
export GODEBUG=netdns=cgo # force cgo resolver*

Based on this clue, we compared the respective underlying configuration files of the previous successful deployment environment and the recently deployed failed environment, and finally found the differences:

In the container on the old environment (blockchain deployment is successful), view
```
# cat /etc/resolv.conf 
nameserver 127.0.0.11
options ndots:0
```

In the container on the new environment (blockchain deployment failed), see

# cat /etc/resolv.conf 
nameserver 127.0.0.11
options timeout:2 attempts:3 rotate single-request-reopen ndots:0

This difference caused the pure Go resolver to be used in the old success environment, and the cgo resolver was switched to the new failure environment because it contained the options single-request-reopen that the pure Go resolver did not support.

Note: Pure Go resolver currently only supports ndots, timeout, attempts, rotate
https://github.com/golang/go/blob/964639cc338db650ccadeafb7424bc8ebb2c0f6c/src/net/dnsconfig_unix.go

       case "options": // magic options
            for _, s := range f[1:] {
                switch {
                case hasPrefix(s, "ndots:"):
                    n, _, _ := dtoi(s[6:])
                    if n < 0 {
                        n = 0
                    } else if n > 15 {
                        n = 15
                    }
                    conf.ndots = n
                case hasPrefix(s, "timeout:"):
                    n, _, _ := dtoi(s[8:])
                    if n < 1 {
                        n = 1
                    }
                    conf.timeout = time.Duration(n) * time.Second
                case hasPrefix(s, "attempts:"):
                    n, _, _ := dtoi(s[9:])
                    if n < 1 {
                        n = 1
                    }
                    conf.attempts = n
                case s == "rotate":
                    conf.rotate = true
                default:
                    conf.unknownOpt = true
                }
            }

Further, we tried to analyze what caused the content of resolv.conf in the old and new containers to change, and found that the configuration file of the host ECS changed recently:

Failed Environment - Newly Created ECS:

# cat /etc/resolv.conf
# Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8)
#     DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN
nameserver 100.100.2.138
nameserver 100.100.2.136
options timeout:2 attempts:3 rotate single-request-reopen

Success Environment - Original ECS:

# cat /etc/resolv.conf
# Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8)
#     DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN
nameserver 100.100.2.136
nameserver 100.100.2.138

另一方面，我们也尝试分析为什么切换到cgo resolver之后会产生SIGSEGV的错误，以下这篇文章解释了static link cgo会导致SIGSEGV的错误：
https://tschottdorf.github.io/golang-static-linking-bug

而这个Hyperledger Fabric的bug则指出了Hyperledger Fabric的build（尤其是和getaddrinfo相关方法）正是static link的：
https://jira.hyperledger.org/browse/FAB-6403

至此，我们找到了问题的根源和复盘了整个问题发生的逻辑：

近期新创建的ECS主机中的resolv.conf内容发生了变化 -> 导致Hyperledger Fabric的容器内域名解析从pure Go resolver切换至cgo resolver -> 触发了一个已知的由静态链接cgo导致的SIGSEGV错误 -> 导致Hyperledger Fabric部署失败。

解决方法建议

更新Hyperledger Fabric的docker compose yaml模板，为所有Hyperledger Fabric的节点（如orderer, peer, ca, cli等）添加环境变量GODEBUG=netdns=go以强制使用pure Go resolver。