The practice of eBPF core technology in Didi cloud native

Set Didi Technology as " Star ⭐️ "

Receive article updates as soon as possible

guide

eBPF is a revolutionary technology of the Linux kernel, which can safely and efficiently expand kernel capabilities and has a wide range of applications, especially in the field of cloud-native observability, which has become a hot spot in the industry. In Didi Cloud's native environment, eBPF technology has carried out business practice and internal source co-construction. The HuaTuo eBPF platform has quickly landed and achieved initial benefits. At present, it has supported cloud-native key components, such as service access relationship topology, container security, and host security. , Network diagnosis, root cause location and other services, HuaTuo is also a boutique incubation project of Didi Open Source Committee. It is hoped that this article will provide developers in the industry with a way to quickly apply eBPF technology to cloud-native scenarios, and jointly improve the deep observability of cloud-native systems.

This article is divided into:

1. The past life of BPF technology

2. The current life of BPF technology

3. Business pain points of Didi’s production environment

  • Traffic playback test

  • Service Access Topology

  • container security

  • Kernel root cause location

4. Didi HuaTuo eBPF platform practice

  • Platform construction

  • Platform Composition

  • platform usage

5. Didi business landing practice

  • Kernel root cause location background

  • Kernel root cause location ideas

  • Kernel root cause location platform

6. Future Planning Outlook

1

A Past Life of BPF Technology

Berkeley Packet Filter BPF Berkeley packet filter, originally conceived in 1993, Steven McCanne and Van Jacobson's paper "The BSD packet filter: a new architecture for user-level packet capture". Its goal is to provide a new method of efficiently filtering network packets. Initially, BPF is only attached to the Linux kernel network socket. When the socket is bound with BPF bytecode, the byte will be executed every time a message is received. The kernel decides whether to allow network packets to pass according to the return value of the BPF bytecode.

fa928b25c66972e92fc91ebe26fdc671.png

From the perspective of the instruction set, the initial architecture of BPF is relatively simple, with only a 32-bit width accumulator A, a 32-bit width register X, and a 16x32bit array memory space. However, BPF implements four types of instructions: load, store, jump, and operation. BPF did not have a just-in-time compiler JIT at the beginning, and all these instructions were completely run in a virtual machine with a small kernel, even though the performance has completely crushed other message filters.

b25c79951c04e35e766014300478bf23.png

BPF is also more convenient to use. We can generate BPF bytecode through tcpdump, libpcap or bpf_asm, and load it into the kernel through setsockopt SO_ATTACH_FILTER. To implement ICMP packet filtering as shown below, you only need to judge whether the protocol number meets the conditions based on the amount of drift in the L2/L3 packet header.

518e943b9ca3de7a694d1cb33744ce69.png

BPF was first used in sk_filter, and later in network subsystems such as netfilter, seccomp, and team drivers, but the application direction is still in network packet processing.

4099f155619baa034f1e916e3115eaff.png

2

The present life of BPF technology

The core idea of ​​BPF realizes kernel programmability, that is, to achieve efficient expansion of kernel capabilities without changing the source code and recompiling the kernel. In 2013, Alexei Starovoitov modified BPF to add new features and improve its performance. The further development of BPF has brought huge changes to the kernel, enabling the kernel to have more powerful and programmable dynamic change capabilities. This capability will be of great value in various application scenarios that require customization, which can be used to expand functions or optimize performance. The new version is named eBPF ("extended BPF"), and the previous BPF became cBPF ("classic" BPF). In the following chapters, BPF refers specifically to this technology.

First of all, from the perspective of the instruction set, eBPF is a simplified instruction set that supports 10 64bit general-purpose registers R0-R9. Among them, the R0 register is used for the return value of the kernel function, and the R1-R5 registers are used for the parameter transfer of the kernel function, which is similar to x86_64 and aarch64. eBPF supports two instruction set encoding formats, 64-bit wide basic encoding, and 128-bit wide instruction encoding (the instruction encoding is implemented by adding a 64-bit immediate value after the basic encoding). In addition, eBPF also enriches four types of instruction sets.

96f000472657c872d1e5d8277390f08c.png

e6d3fda137933e72350e50ca232ca915.png

Just In Time, JIT compiler, BPF instructions will be dynamically compiled into physical machine native instructions by the kernel's JIT compiler, achieving "zero" loss of operating efficiency. This feature was implemented by Eric Dumazet during the cBPF period and only supports the x86-64 architecture. eBPF expands the compiler again according to the characteristics of its own instructions. eBPF instructions can not only be JITed into physical machine CPU native instructions, but also can be translated into specific instructions of some devices such as some smart network cards. In this way, devices such as smart network cards can be programmed through eBPF.

In addition to enriching and expanding the instruction level, eBPF also adds a data storage mechanism based on key-value pairs, which can be used to implement data storage and exchange in kernel user mode. eBPF has added richer BPF Helper functions, which enable BPF programs to access kernel functions and expand BPF capabilities while maintaining the security isolation between BPF programs and the kernel. 

With the enhancement of BPF functions, this technology is no longer limited to the kernel network subsystem, but also gradually applied to dynamic tracking, event detection, performance analysis and optimization, IO subsystem, process scheduling, file subsystem, etc. The application scenarios also extend from specific points to areas: observability, tracing, security, performance analysis, root cause analysis, etc. At the same time, a large number of excellent open source projects such as Bcc, Cilium, bpftrace, Falco, Katran, and Pixie have emerged.

79bab159193164eb22222bdf8eec3bbd.png

3

Business Pain Points in Didi’s Production Environment

There are many basic services in the data center. These services play an important role in stability, production efficiency, scalability, security, etc., and these infrastructures face varying degrees of challenges when they are actually implemented.

Traffic playback test

The scale of cloud native is getting bigger and bigger, and the type of business carried, business traffic and scale are also rapidly expanding. Software testing is also facing great challenges, including test environment construction and test case writing and maintenance. Traffic playback testing is a brand new testing method. Regression testing is realized through playback in the offline test environment. Traffic playback can completely reproduce complex online business access scenarios, greatly improving project iteration efficiency, accelerating business regression testing progress, and ensuring the improvement of business R&D quality. Challenges faced by the project:

  • There are many business programming languages ​​and various versions of basic libraries.

  • The business network model is not unified (php single-process processing, golang coroutine processing, other programming languages ​​multi-thread/process).

  • Customized for specific business, it is difficult to improve the coverage, and the cost is high.

  • There is a sense of business, and it is difficult to guarantee stability. 

Service Access Topology

With the rise of the microservice architecture, the number of services is increasing day by day, and the dependencies between services are becoming more and more complex. The observability of the service link is of great significance to the guarantee of business stability. It can clearly display the access relationship between service interfaces, as well as business indicators such as performance, delay, and timeout. When a problem occurs online, the problem node can be quickly located according to the service topology relationship. Challenges faced by the project:

  • There are a large number of business types, and the cost of manual sorting is high and error-prone.

  • It is very difficult to promote a specific SDK by relying on metrics and service call relationships in the log to connect in series, and it is easy to break the link in the production environment. 

container security

With the general trend of cloud native, more and more enterprises choose to embrace cloud native. Containers have become the standard for application delivery and the delivery unit of computing resources and supporting facilities in the cloud native era. The use of containers is becoming more and more popular, and container security issues are becoming more and more prominent. Didi has a complete set of solutions for container security, and some of the core pain points are solved by eBPF technology. 

Kernel root cause location

The container deployment density and oversold ratio of Didi Cloud's native platform are high, and problems caused by unreasonable use of resources are prone to occur in shared kernel container virtualization scenarios. Traditional kernel indicators are biased towards basic, overall, and coarse-grained statistical detection. At the same time, traditional positioning tools consume more resources and have a greater impact on performance. Therefore, to thoroughly determine the root cause of the kernel, it is necessary to conduct in-depth observations of the kernel subsystems and realize normalized observations to solve difficult problems such as bursts, timeouts, and glitches that occur during and after the event.

4

Didi HuaTuo eBPF platform practice

eBPF technology can well solve the above-mentioned pain points of language dependence. At the same time, the combination of eBPF and dynamic tracking technology provides support for the realization of the deep observation kernel. However, the following questions need to be answered during the actual landing process:

  • In the case of many needs, how to quickly meet these needs and improve R&D efficiency to quickly implement?

  • Although there are currently implementation cases in the industry, the scale is relatively small. How to realize the implementation from point to area, and how to ensure the stability of the host machine?

  • How to uniformly observe and guarantee the performance loss of staking points, and how to quickly roll back, downgrade, and stop losses if problems occur? 

Platform construction

Based on the above considerations, we have developed the eBPF platform. Businesses can directly use the general capabilities provided by the platform, and only need to focus on their own logic implementation. The platform construction process focuses on improving R&D efficiency, providing business perspectives, and ensuring stability and performance.

  • Improve R&D efficiency

Early users need to care about how to parse BPF bytecode, how to load it into the kernel, create KV Map and attach specific Section code to specific running nodes. Finally, users also need to pay attention to how to get data from the kernel. The important function of the platform is to shield the underlying technical details. Similar functions above only need to call the bpfload and bpfmap interfaces.

  • Provide business perspective

Businesses and platforms, business and business release rules are different. Finally, we use BPF Obj bytecode to decouple the logic of the business from the platform, and the business can call the platform interface according to the requirements. At the same time, the platform needs to consider standardization and support SEC defined by other open source components, so that it can be compatible with the existing BPF bytecode and run directly on the platform.

c3582e9342699a18e95381abff47399c.png

  • Guarantee stability

As a new technology is implemented, it is necessary to focus on the stability of the host. The guarantee of stability is mainly established from the core layer, the framework side, and the aspects of business perception.

19ee708a7137252eafca90048895d3d5.png

  • guaranteed performance

All BPF codes run in the kernel mode, so too much time-consuming will still affect the system. Event-driven, shared memory, and ringbuf production and consumption are all applied to this platform.

e3c8fa5fb717859868477fb704938b5f.png

Platform Composition

  • BPF bytecode management

The first is to analyze ELF capabilities, including the definition of SEC, Map, variables, and structures. Through the SEC, the BPF Prog type and the Hook attachment point that can be automatically loaded can be clarified. Through the definition of Map, its type, size, Key type, Value type and other information are parsed out.

  • High Performance Data Processing

The platform supports many business types, so performance guarantees need to be considered from various dimensions. First, dynamically evaluate the probe hook point on the kernel side to provide a fuse mechanism. High-performance data communication is provided on the platform side, and the ringbuf production and consumption method reduces latency.

  • stability management

The original intention of the platform design is to solve online problems and ensure business stability, so the platform itself has realized event fusing and self-healing functions. When an abnormality is detected in the host BPF, the platform will automatically unload bpf to avoid excessive load caused by the current BPF. In addition to realizing the fuse and self-healing of events, it also limits the system resources used by the platform, such as cpu and mem.

  • Container Information Management

The realization of this function is mainly based on the following factors: 1. All services run in the container, parameter transmission, and service identification require this information. 2. Kernel cgroup information needs to be aggregated with container information.

platform usage

In addition to the API interface, the platform also provides a command line mode, so that abnormal running BPF OBJ or debugging OBJ can run on the platform.

The code example is as follows:

5a0b4f6d28300da60fec00af3b4dd190.png

Compile:

clang -O2 -g -target bpf -c $(NAME).bpf.c -o $(NAME).o

Run: The platform automatically parses the structure defined by BPF and prints it to stdout.

f7f109caa56aeb5b16e877a578ffa507.png

5

Didi business landing practice

Many businesses in Didi have been connected to the HuaTuo eBPF platform, such as service test regression, container security, host security, service access topology, network diagnosis, kernel root cause location, etc. The following mainly uses kernel root cause location as an example to explain. 

Kernel root cause location background

Nowadays, reducing costs and increasing efficiency is the theme of most Internet companies. In Didi, the container deployment density and oversold ratio are high. In this scenario, it is difficult to avoid affecting other businesses due to unreasonable use of resources. . When a failure occurs, the first measure is to stop the loss, and the container will drift, so the problem site will be lost. It is difficult to reproduce the problem offline, and the cost of manpower and machines is very high. In addition, the occasional glitches, time-consuming, and timeout problems on the line have no rules to follow. In summary, normal observation is a very strong requirement.

Kernel root cause location ideas

According to the three pillars of observability (as shown in the figure below), we first establish kernel depth observation indicators, which are more fine-grained and can reflect the health status of each subsystem of the kernel. When establishing these indicators, the rationality and performance impact will be evaluated from different angles, and finally the normal observation will be realized. Secondly, the event-driven implementation obtains the kernel exception log context, and we are concerned about the collection of errors in the abnormal path, slow path, and other error states of the kernel subsystems. Exception events are the key to solving the root cause of kernel problems. With the establishment of the underlying basic capabilities, we can summarize, analyze, and finally give an analysis report on this information.

cc99a487d0bf575d4c1b05b7a9b8ee23.png

Kernel root cause location platform

40f5957b6168f708aaf1d6c683b45c43.png

We divide the kernel root cause location platform into four parts:

  • Kernel data collection . This part mainly implements the collection of kernel core indicators and the collection of kernel exception context.

  • Kernel data aggregation . This part mainly implements the summary of kernel observation data and container/Pod information and uploads them to the storage device.

  • Data analysis layer . This part mainly processes the collected data and provides analysis services.

  • Data display layer . Mainly analyze and diagnose center, observation center, log center, analysis report, and alarm center, etc.

6

Future Planning Outlook

The eBPF core technology has been implemented in a large-scale multi-scenario in Didi Cloud's native scenarios. In the future, the HuaTuo eBPF platform will serve more business lines. Today, we are looking for a suitable foundation to incubate projects, and build and share with industry developers. In recent years, although eBPF technology has made great progress, its functions in some scenarios are not perfect enough, such as performance optimization, CPU scheduling, etc., and in-depth exploration in offline mixed deployment scenarios. I look forward to continuing with you in the future. Exchange discussion.

Guess you like

Origin blog.csdn.net/DiDi_Tech/article/details/131546111