datadog ebpf模块 offset-guess.o 问题排查解决

解决好公司的好问题,一直是自己从业的准则,我认为这是一件十分幸福的事情。

一、定位问题

datadog system-probe 模块打开network-tracer后出现问题:

ERROR | (cmd/system-probe/api/module/loader.go:60 in Register) | new module `network_tracer` error: error guessing offsets: could not load bpf module for offset guessing: load BTF maps: section .maps: map tracer_status already exists

offset-guess.o 模块是偏移量猜想的一个模块,我估计datadog这么做的原因是处于linux内核兼容性的原因,但是这种兼容性的代码为什么会在这里出现问题呢?

查找代码位置,出现问题位置:

	if err := ec.loadBTFMaps(maps); err != nil {
		return nil, fmt.Errorf("load BTF maps: %w", err)
	}

loadBTFMaps

            v, ok := vs.Type.(*btf.Var)
			if !ok {
				return fmt.Errorf("section %v: unexpected type %s", sec.Name, vs.Type)
			}
			name := string(v.Name)

			// The BTF metadata for each Var contains the full length of the map
			// declaration, so read the corresponding amount of bytes from the ELF.
			// This way, we can pinpoint which map declaration contains unexpected
			// (and therefore unsupported) data.
			_, err := io.Copy(internal.DiscardZeroes{}, io.LimitReader(rs, int64(vs.Size)))
			if err != nil {
				return fmt.Errorf("section %v: map %s: initializing BTF map definitions: %w", sec.Name, name, internal.ErrNotSupported)
			}

			if maps[name] != nil {
				return fmt.Errorf("section %v: map %s already exists", sec.Name, name)
			}

会发现出现这个问题是因为btf里面的ebpf的.maps 出现了两次,那么我进一步定位elf文件看下里面的.maps情况

扫描二维码关注公众号,回复: 14709050 查看本文章
readelf -x .maps -r  /opt/datadog-agent/embedded/share/system-probe/ebpf/offset-guess.o

发现.maps出现了两次,正常情况下不应该这样

Hex dump of section '.maps':
  0x00000000 00000000 00000000 00000000 00000000 ................
  0x00000010 00000000 00000000 00000000 00000000 ................
  0x00000020 00000000 00000000                   ........


Hex dump of section '.maps':
  0x00000000 00000000 00000000 00000000 00000000 ................
  0x00000010 00000000 00000000 00000000 00000000 ................
  0x00000020 00000000 00000000                   ........

定位到出问题的文件是/opt/datadog-agent/embedded/share/system-probe/ebpf/offset-guess.o

进一步这个文件出问题的阶段可能是在编译时期,那么我们回顾ebpf 模块的整个编译周期,发现有一个错误出现了

[19/33] clang -MD -MF pkg/ebpf/bytecode/build/offset-guess-debug.bc.d -emit-llvm -D__KERNEL__ -DCONFIG_64BIT -D__BPF_TRACING__ -DKBUILD_MODNAME=\"ddsysprobe\" -Wno-unused-value -Wno-pointer-sign -Wno-compare-distinct-pointer-types -Wunused -Wall -Werror -include pkg/ebpf/c/asm_goto_workaround.h -O2 -fno-stack-protector -fno-color-diagnostics -fno-unwind-tables -fno-asynchronous-unwind-tables -fno-jump-tables -fmerge-all-constants -Ipkg/ebpf/c -isystem/usr/src/linux-headers-5.15.0-52-generic/include -isystem/usr/src/linux-headers-5.15.0-52-generic/include/uapi -isystem/usr/src/linux-headers-5.15.0-52-generic/include/generated/uapi -isystem/usr/src/linux-headers-5.15.0-52-generic/arch/x86/include -isystem/usr/src/linux-headers-5.15.0-52-generic/arch/x86/include/uapi -isystem/usr/src/linux-headers-5.15.0-52-generic/arch/x86/include/generated -Ipkg/network/ebpf/c -g -DDEBUG=1 -c pkg/network/ebpf/c/prebuilt/offset-guess.c -o pkg/ebpf/bytecode/build/offset-guess-debug.bc
[20/33] cd pkg/network/http && CC=clang go tool cgo -godefs -- -fsigned-char http_types.go | go run /home/zhanglei/data/datadog-agent/pkg/ebpf/cgo/genpost.go > http_types_linux.go
cgo-builtin-prolog:1:10: fatal error: 'stddef.h' file not found
#include <stddef.h> /* for ptrdiff_t and size_t below */

看到这个错误一切就变得非常简单了,意味着clang 版本很低和操作系统并不兼容,因为data-dog依赖genpost生成cgo文件,用于运行时编译

datadog里面的system-probe 使用的clang版本是12明显和主机的不兼容,断定关键点的位置:

    if clang_version_str != CLANG_VERSION:
        # download correct version from dd-agent-omnibus S3 bucket
        clang_url = f"https://dd-agent-omnibus.s3.amazonaws.com/llvm/clang-{CLANG_VERSION}.{arch}"
        ctx.run(f"{sudo} wget -q {clang_url} -O /opt/datadog-agent/embedded/bin/clang-bpf")
        ctx.run(f"{sudo} chmod 0755 /opt/datadog-agent/embedded/bin/clang-bpf")

    if llc_version_str != CLANG_VERSION:
        llc_url = f"https://dd-agent-omnibus.s3.amazonaws.com/llvm/llc-{CLANG_VERSION}.{arch}"
        ctx.run(f"{sudo} wget -q {llc_url} -O /opt/datadog-agent/embedded/bin/llc-bpf")
        ctx.run(f"{sudo} chmod 0755 /opt/datadog-agent/embedded/bin/llc-bpf")

datadog运行的clang版本是12,但是和我主机不兼容

二、解决问题

我是ubuntu系统,所以我只需要安装和我主机兼容的clang 和llvm版本就行了

apt install clang

我安装的是clang-14,所以我需要把/usr/bin下面的clang 替换为和我的主机兼容的

cp /usr/bin/clang-14 /usr/bin/clang

重新编译datadog的system-probe模块

invoke system-probe.build

发现没有任何报错,再次运行system-probe ,没有任何报错,发现ebpf层的网络数据可以正常采集了

猜你喜欢

转载自blog.csdn.net/qq_32783703/article/details/127690233