Is it difficult to connect service topologies in series? eBPF brings new ideas for solving problems for Didi Observable

In the last article, we talked about the practice and implementation of observability in Didi , focusing more on the correlation between different observation signals. How are the relationships between services connected in series? What kind of application does ebpf, which is currently booming in the industry, have in Didi? This article will reveal it to you.

background

Business introduction: Business interface call observation

In addition to being responsible for the construction of Didi's MTL capabilities, Didi's observable platform also involves data and service interface call observations that are more biased towards the business side.

Regarding interface call topology observation, let’s explain it here first to avoid ambiguity. The following figure describes a calling relationship:

167ef0f3c91c5d3530bc61c8b920bd46.png

A request and response process

Here we use [caller=A, caller-func=/a, callee=B, callee-func=/b], which is abbreviated as [A, /a, B, /b], and [A, /a, C, / c] to describe the action of calling B:/b and C:/c after /a of service A is triggered. When enough interface call data is obtained, by given several call entries of a certain business (such as [A, /a] in the above example), and by continuously concatenating the interface call links, we can sort out the There are several important calling links in the business.

The construction of call links is of great significance to ensuring service stability. Whether it is disaster recovery, on-demand business expansion, peak business status inspection berms, etc., they all rely on the construction of core call links. From an experience point of view, in actual fault handling and capacity evaluation, the interface-level calling topology is much more effective than the service-level or container/physical machine-level calling topology.

e6bfc8fede2641554a4c4f614049968a.png

Generally speaking, the service topology at the interface granularity can be concatenated by calling logs or calling metrics. Didi Observable earlier used a combination of call logs + call metrics to generate the service interface call topology. Later, with the advancement of unified service governance, business reporting metrics can completely cover the call relationships in the call logs, and the cost of generating interface topology has been greatly reduced. Therefore, in the scenario of interface topology generation, it has been adjusted to a metric based on service calls. data to generate.

78c39b4ce5f0832ceccc58e872cfd324.png

Schematic diagram of serial interface topology through metric

Business issue: Verification of service interface topology

It seems that calling the metric through the interface to call the link in series is a common way, but the generated results obviously have the following problems:

  • The generated data lacks a verification method. Since the data is reported by the business side's code, even if a universal SDK is introduced, the caller-func information can only rely on being actively passed in when the code is called. Judging from practical experience, the problem of missed transmission and mistransmission of caller-func is quite obvious.

  • The cost of calling relationship verification and generation is high. Relying on business code reporting means that the code needs to follow certain standards. For the more core calling links, it is relatively easy to promote code changes and has a high degree of business cooperation. However, for non-core calling links or legacy projects that have been running stably for a long time, it is difficult to promote standardized changes in the code. Manual addition requires manual sorting of the project, and there is no practical room for links with nearly a thousand calls.

The above two problems are common problems when using the metric serial service interface topology.

Judging from Didi's observable practice, when the complexity of core links reaches the order of thousands, even if there is a dedicated team to promote metric access management of business call links, a considerable proportion of call relationships will be missing. Or an error.

67c5dd8b239d9d9faacc71a94d6fc3b2.png    

  Normal results under ideal circumstances        

062efd796b77e096a35a4b2af42d90d5.png

Possible results when metric information is incorrect

In response to the problem of service interface topology verification, Didi Observable has developed a solution for non-intrusive collection of service interface topology based on eBPF (unless otherwise stated below, referred to as BPF) technology. Through the combination of metric+BPF collection, the accuracy verification of interface topology data and the supplement of missing data are realized. At the same time, observable deeper uses of BPF, such as the integration of MTL, were further explored.

plan

Introduction to BPF

BPF was originally the abbreviation of Berkeley Packet Filter. The kernel has expanded BPF since 3.15. By increasing the number of BPF program registers, expanding the memory that the BPF program can use, and adding multiple BPF events, BPF is highly customizable. sex. In order to distinguish it from the BPF before the extension, the BPF before 3.15 is called cBPF (classic BPF), and the extended BPF is called eBPF (extended BPF), and BPF has also become more of a technology than an abbreviation. Generation name.

As of the 4.18 version of the kernel, some of the event types supported by BPF and their brief introduction are as follows:

01b34802129aac4507339e7eee7967ea.png

This article covers uprobe and kprobe. Most kernel functions can be hooked through kprobe. In user-defined programs, functions existing in the symbol table can also be hooked through uprobe.

When kprobe and uprobe are triggered, only the parameters or stack information of the target function can be obtained. The following piece of code is an example of observing /bin/bash through bpftrace and observing the user's bash command by obtaining the readline return value.

 
  
#!/usr/bin/bpftrace


BEGIN
{
  printf("开始观测bash...\n使用Ctrl-C停止\n");
}


uretprobe:/bin/bash:readline
{
  printf("cmd: %s\n", str(retval));
}

Among them, the bash source code defines readline as follows. You can better understand the logic of BPF by referring to the source code of the target function.

/* Read a line of input. Prompt with PROMPT. A NULL PROMPT means none. */
extern char *readline (const char *);

After execution, when the target kernel function is executed, the trigger is as follows:

$ sudo bpftrace ./bashreadline.bt
Attaching 2 probes...
开始观测bash...
使用Ctrl-C停止
cmd: ls -l
cmd: pwd
cmd: crontab -e
cmd: clear

After eBPF was introduced in the 3.15 kernel, its functionality has been continuously expanded. One of the more significant extensions is the introduction of BTF (BPF Type Format) in the 4.18 kernel. BTF technology makes it easier to load and use BPF bytecode.

Development of BPF

To implement various functions, native BPF generally uses restricted C language to call the bpf-helpers function, and then uses LLVM to compile it into BPF-code bytecode and load it through system calls. The native C language writing method is relatively cumbersome. The iovisor project launched the bcc library to enhance the convenience of BPF development, and at the same time maintained the bpftrace tool that supports one-liner style and is extremely easy to use. The well-known cilium in the industry also maintains a cilium-ebpf. In addition to bcc, bpftrace, and cilium-ebpf, there are also tools such as coolbpf, which supports the entire production cycle, and aya, which uses rust to provide BPF support based on libc.

08a499756970fdfeb6ab0543ced6d028.png

BPF ecology, picture from ebpf.io

Using BPF to solve service interface topology issues

The previous chapter mentioned that the generated topology data cannot be verified in the service interface topology. This problem is currently solved by BPF in Didi Observable. Here we demonstrate the effect through a simple example and a solution built using the bpftrace script.

Example: simple golang service

Here is a simple golang service based on go1.16. As can be seen from the processing code, the four-tuple here is [local, /handle, local, /echo]. For the convenience of example explanation, the logic of "handle" here and the logic of downstream request are serial, and "goroutine" is not used. This is important and will be explained later.

func echo(c *gin.Context) {
  c.JSON(http.StatusOK, &Resp{
    Errno: 0,
    Errmsg: "ok",
  })


  return
}


/* 
s := http.Server{
  Addr: "0.0.0.0:9932",
}
r := gin.Default()
r.GET("/echo", echo)
r.GET("/handle", handle)
s.Handler = r
*/
func handle(c *gin.Context) {
  client := http.Client{}
  req, _ := http.NewRequest(http.MethodGet,
    "http://0.0.0.0:9932/echo", nil)
  resp, err := client.Do(req)
  if err != nil {
    fmt.Println("failed to request", err.Error())
    c.JSON(http.StatusOK, &Resp{
    Errno: 1,
    Errmsg: "failed to request",
  })
    return
  }


  respB, err := ioutil.ReadAll(resp.Body)
  if err != nil {
    fmt.Println("read resp failed")
    c.JSON(http.StatusOK, &Resp{
      Errno: 2,
      Errmsg: "failed to read request",
    })
    return
  }


  defer resp.Body.Close()


  fmt.Println("resp: ", string(respB))
  c.JSON(http.StatusOK, &Resp{
    Errno: 0,
    Errmsg: "request okay",
  })


    return
}

Collection logic and execution effects:

uprobe:./http_demo:net/http.serverHandler.ServeHTTP
{
  $req_addr = sarg3;
  $url_addr = *(uint64*)($req_addr+16);
  $path_addr = *(uint64*)($url_addr+56);
  $path_len = *(uint64*)($url_addr+64);


  // 在http请求触发处,依据pid将caller_func存储起来
  @caller_path_addr[pid] = $path_addr;
  @caller_path_len[pid] = $path_len;
  @callee_set[pid] = 0;
}


uprobe:./http_demo:"net/http.(*Client).do"
{
  // 依据 pid 获取 caller 信息
  printf("caller: \n caller_path: %s\n",
  str(@caller_path_addr[pid], @caller_path_len[pid]));
  $req_addr = sarg1;


  // 获取 callee 信息
  $addr = *(uint64*)($req_addr);
  $len = *(uint64*)($req_addr + 8);
  printf("callee: \n method: %s\n", str($addr, $len));


  $url_addr = *(uint64*)($req_addr + 16);
  $addr = *(uint64*)($url_addr + 40);
  $len = *(uint64*)($url_addr + 48);
  printf(" host: %s\n", str($addr, $len));


  $addr = *(uint64*)($url_addr + 56);
  $len = *(uint64*)($url_addr + 64);
  printf(" url: %s\n\n", str($addr, $len));


  @callee_set[pid] = 1
}


uprobe:./http_demo:"net/http.(*response).finishRequest"
{
  // 如果没有下游请求,单独输出
  if (@callee_set[pid] == 0){
    printf("caller: \n caller_path: %s\n",
    str(@caller_path_addr[pid], @caller_path_len[pid]));
    printf("callee: none\n\n");
    @callee_set[pid] = 1;
  }
}

Use the collection script to collect, the results are as follows:

# 启动采集
$ bpftrace ./http.bt
Attaching 2 probes... # 未触发请求前,停止在这里
caller: # 触发请求后,输出
caller_path: /handle
callee:
  method: GET
  host: 0.0.0.0:9932
  url: /echo
caller:
  caller_path: /echo
  callee: none


# 开始服务
$ ./http_demo &
# 触发请求
$ curl http://0.0.0.0:9932/handle

It can be seen that the bpftrace script implements the collection of four-tuple calls to the target service interface, and this is done without any code changes in the target service. BPF demonstrates its charm in the observable field.

Actual program coverage and effects

The above example demonstrates the main idea of ​​using BPF for interface topology observation. It should be noted that the example uses pid as the key in caller_map, but in actual projects, since golang goroutine and pid do not correspond one-to-one, goid needs to be used as the key.

At the same time, since new goroutine will be used in handleFunc to initiate downstream requests, BPF also needs to maintain the derivation relationship of goid to avoid the loss of caller information associated with a certain goid. In this way, for golang services, the actual processing ideas are very clear.

5d0f24fc3e4cf1f4862455fa2c65f9e2.png

Schematic diagram of BPF observation service topology

The picture above is Didi Observable’s current golang interface call observation BPF solution. To summarize the solution, its core is:

  • Information Collection. Information including caller-func, callee, callee-func and other information needs to be obtained through appropriate hook point selection.

  • Information association. Based on the characteristics of golang service, goid is used for association. This allows caller information to be associated with callee information to obtain a quadruple.

At present, Didi Observable has completed coverage of golang and PHP services based on this idea. Judging from the practical results, the effective coverage rate of this solution for the target services is about 80%. The target monitoring core call link has been manually confirmed by adding four-tuples to BPF, and there are no abnormal four-tuples. Compared with metric-based data, in some core call links, the number of new quadruple calls can reach 20%.

question

lost relevance

The above solution is indeed the most intuitive solution that can be thought of at present. The information collection part is not a big problem. Although uprobe is used, which introduces dependence on the target function parameters, as far as go1.10~go1.20 used in the actual production environment, in addition to the function calling protocol introduced by go1.17, Apart from adaptation, other necessary information remains basically unchanged.

The information correlation part is more troublesome. The existing solution is to achieve the correlation between caller information and callee information by maintaining the derived relationship of goroutine, but the reality is often not satisfactory. For example, from an actual project perspective, the following code will appear:

/*用法1:通过channel来传递request。这种场景下,事件间的关联性丢失,无法形成四元组*/


var reqChan = make(chan *http.Request, 10)


func handle(w http.ResponseWriter, req *http.Request) {
  io.WriteString(w, "Hello, World\n")
  reqChan <- req // 这里通过channel来传递请求
  return
}


func handleReq() {
  for {
    select {
    case req, ok := <-reqChan:
      if !ok {
        log.Println("channel closed")
        return
      }


      log.Println("received, ", req.Host, req.Method)
      // do some stuff
      // 即使这里存在下游请求,也无法和caller关联起来。
    }
  }
}


func main() {
  go handleReq()
  http.HandleFunc("/hello", handle)
  http.ListenAndServe("0.0.0.0:9999", nil)
  return
}


type GoroutinePool interface {
  Start() (error, bool)
  AddTask(func())
  Stop() (error, bool)
}


var pool GoroutinePool


func handle(w http.ResponseWriter, req *http.Request) {
  io.WriteString(w, "Hello, World\n")


  pool.AddTask(func() {
    // 这里由于采用了goroutine池,goroutine间的派生关系  会丢失,事件无法有效串联
    handleReq(req)
  })
  return
}


func handleReq(req *http.Request) {
  log.Println("received, ", req.Host, req.Method)
  // do some stuff
}


func main() {
  // init pool
  // pool = New()
  http.HandleFunc("/hello", handle)
  http.ListenAndServe("0.0.0.0:9999", nil)
 return
}

In the above two scenarios, because the derivation relationship of goroutine cannot be obtained, the existing solution will not be able to obtain quadruples. Similar problems will affect the collection effect of BPF. Judging from existing experience, the proportion of quadruples affected by similar code in golang projects is less than 20%.

uprobe: complexity of adaptation

After the introduction in the previous section, we can see that Didi Observable is a service interface topology observation solution based on uprobe.

The use of BPF uprobe has the characteristics of efficient data processing and intuitive overall solution. Since uprobe is closer to the user's code, it is more comfortable with problems with strong user awareness, such as slow function calls in the framework.

But most projects use kprobe more, such as many practical tools in bpftrace. Most of deepflow's observation capabilities are built on the basis of kprobe, and kindling's content involving network data processing is also processed based on kprobe.

At present, in actual use, there are still only a few projects that completely follow the uprobe construction plan. The reason is that the use of uprobe has the following two disadvantages:

  • Less versatile. From the introduction of the solution, we can see that uprobe-based solutions and languages ​​(even frameworks) are strongly related. And when the target program symbol table does not exist, uprobe cannot work. This means that if the target usage scenario is unclear, using uprobe will require adaptation to each specific scenario, and the overall investment and output will be very low.

  • Performance issues. When uprobe is triggered, it involves two switches between user mode and kernel mode, which means that when uprob is executed in a single time, its performance overhead is very high (the triggering time of a single uprobe is about 1us, while the triggering time of a single kprobe is around 100ns). When the hooked function is triggered frequently, the performance of the target process will be poor.

Despite the above-mentioned shortcomings of uprobe, Didi Observable still chose to build a solution based on uprobe, mainly because the development efficiency of uprobe is faster and the cost is lower.

Use uprobe to develop, what you see is what you get. There is no data degradation, and key information does not need to be obtained from transport layer messages. Not only does it save development time, but the processing complexity is also greatly reduced: considering a long http message, uprobe can directly obtain the required data, such as URL information, from the target function, while kprobe will trigger multiple times and need to process the message. Parse to get the required information. At present, the actual CPU overhead of ebpf-agent observable by Didi is normally less than 10% of that of a single core (general business processes, including PHP processes, routing nginx service CPUs will be higher), which affects the performance of the target process. The impact is barely noticeable.

Outlook

Requirements for user mode VM

Didi Observable uses a large number of uprobes. In an offline environment, a single physical machine normally runs more than 1,500 uprobe hook points. In the future, with the extension of BPF functions, the number of uprobe hook points will increase. Putting a large number of uprobes into the kernel not only puts stability pressure on the kernel, but also because the BPF VM runs in the kernel state, when the uprobe is triggered, the program will trigger two switches between the kernel state and the user state, causing problems to the function execution of the target process. Delay.

Both of these points make the use of user-mode VM inevitable. Only by switching uprobe to a user-mode VM for execution can the time-consuming process of uprobe be reduced, and large-scale use of uprobe will not have much impact on the target service.

MTL fusion solution based on BPF

When we revisit bpf-helpers we can see such an interesting function:

long bpf_probe_write_user(void *dst, const void *src, u32 len)
 
Description
Attempt in a safe way to write len bytes from the buffer src to dst in memory. 
It only works for threads that are in user context, 
and dst must be a valid user space address.
 
This helper should not be used to implement any kind of security mechanism because of TOC-TOU attacks, 
but rather to debug, divert, and manipulate execution of semi-cooperative processes.
 
Keep in mind that this feature is meant for experiments, 
and it has a risk of crashing the system and running programs.  
Therefore, when an eBPF program using this helper is attached, 
a warning including PID and process name is printed to kernel logs.
 
Return 0 on success, or a negative error in case of failure.

The function of this function is powerful, which means that BPF data can be directly written into the space of the target process, expanding the scope of use of BPF. In the process of MTL integration, the more difficult problem is that trace information cannot be effectively associated with metrics and logs.

6be523942eb231c6c92decda3b653670.png

Original MTL fusion solution

As shown in the figure above, when the correct trace information is not reported when the metric or log is reported, the metric and log will not be associated with the trace.

And if the processing link of each request is maintained normally by BPF, and BPF maintains the trace information of the request, the metrics and logs can naturally be associated with the trace when they are generated. The following figure shows three options for BPF enhancement:

9b262538a697116074be395554056a3f.png

BPF enhanced MTL fusion solution

af38c1e1da6e866c66d4a0731859b26d.png

MTL integration solution of BPF+SDK

8c99dddf7437a840df2d6e32a5bad29c.png

BPF-based MTL integration solution

Summarize

With various observation and collection methods, a large amount of observation data has been collected. Is this data directly delivered to users in detail, or is it aggregated according to specified dimensions and displayed? What kind of computing engine is used for aggregation, spark or flink?

The next article will show you how Didi’s observability team implements data calculations, so stay tuned.


Cloud Native Night Talk

What observable problems do you expect eBPF technology to solve? Welcome to leave a message in the comment area. If you need to further communicate with us, you can also send a private message to the backend directly.

The author will select one of the most meaningful messages and give away a Didi Yuanqi denim tote bag. The prize will be drawn at 9pm on September 21st.

c286217af04dfe0e7adccc1a12f7c7cd.png

Guess you like

Origin blog.csdn.net/DiDi_Tech/article/details/132893327