Optimize load balancer with ebpf

I. Introduction

It’s been a long time since I wrote an article. I’ve been busy making decisions recently, and I’m so tired. I don’t know which way is the best for me. Risks and benefits coexist. Is it safe to play or take a risk? Knowing how to choose, left and right ideas have been fighting.

Closer to home, I have been learning ebpf for a while, and I feel that I still know a little bit, but the gap is still a bit big. This article is an experimental article for learning ebpf courses. It is mainly based on ebpf network programs, which are more difficult than before. Big, coupled with new learning, can only start from the imitation experiment, the experiment comes from "ebpf core technology and actual combat" written by Mr. Ni Pengfei in Geek Time.

Second environment preparation

2.1 Install the test environment

Deploy the entire network architecture diagram as follows:450432a48435bf14be257236548511a1.png

Docker environment installation script:

# Webserver (响应是hostname,如 http1 或 http2)
docker run -itd --name=http1 --hostname=http1 feisky/webserver
docker run -itd --name=http2 --hostname=http2 feisky/webserver
# Client
docker run -itd --name=client alpine
# Nginx
docker run -itd --name=nginx nginx

Description:

What is docker alpine? The Alpine operating system is a security-oriented lightweight Linux distribution. It is different from the usual Linux distributions. Alpine uses musl libc and busybox to reduce the system size (5M size) and runtime resource consumption, but its functions are much more complete than busybox, so it is getting more and more support from the open source community. favor. While keeping slim, Alpine also provides its own package management tool apk, which can query package information through https://pkgs.alpinelinux.org/packages website, or directly query and install various software through the apk command.

Check the IP address information of the docker container:

root@ubuntu-lab:/home/miao# IP1=$(docker inspect http1 -f '{
    
    {range .NetworkSettings.Networks}}{
    
    {.IPAddress}}{
    
    {end}}')
root@ubuntu-lab:/home/miao# IP2=$(docker inspect http2 -f '{
    
    {range .NetworkSettings.Networks}}{
    
    {.IPAddress}}{
    
    {end}}')
root@ubuntu-lab:/home/miao# echo $IP1
172.17.0.2
root@ubuntu-lab:/home/miao# echo $IP2
172.17.0.3
root@ubuntu-lab:/home/miao# IP3=$(docker inspect nginx -f '{
    
    {range .NetworkSettings.Networks}}{
    
    {.IPAddress}}{
    
    {end}}')
root@ubuntu-lab:/home/miao# echo $IP3
172.17.0.5
root@ubuntu-lab:/home/miao

2.2 nginx configuration update

# 生成nginx.conf文件
cat>nginx.conf <<EOF
user  nginx;
worker_processes  auto;

error_log  /var/log/nginx/error.log notice;
pid        /var/run/nginx.pid;

events {
    worker_connections  1024;
}

http {
   include       /etc/nginx/mime.types;
   default_type  application/octet-stream;

    upstream webservers {
        server $IP1;
        server $IP2;
    }

    server {
        listen 80;

        location / {
            proxy_pass http://webservers;
        }
    }
}
EOF

Update configuration:

# 更新Nginx配置
docker cp nginx.conf nginx:/etc/nginx/nginx.conf
docker exec nginx nginx -s reload

Three Principles

3.1 Network sending between containers

5e919c987fb000db09ce1570100973ed.png
Sending packets between containers

As shown in the figure above, under normal circumstances, the load balancer will send the message to the queue associated with the socket, pass through the protocol stack, then pass through virtual network card 1, forward it to virtual network card 2, and then pass through the protocol stack again Processing, remove the header information, the data packet is sent to socket 2, after two times of protocol stack processing, in fact, it is completely unnecessary, you can bypass the protocol stack like the process of the purple arrow in the figure, so as to upgrade the same host machine Network forwarding performance issues between containers.

3.2 Program principle description

According to my understanding, in simple terms, first we save the newly created socket into a mapping table called BPF_MAP_TYPE_SOCKHASH type, as shown in the figure below, the key is a quintuple, and the value is the file descriptor of the socket. key is defined as follows:

struct sock_key
{
    __u32 sip;    //源IP
    __u32 dip;    //目的IP
    __u32 sport;  //源端口
    __u32 dport;  //目的端口
    __u32 family; //协议
};
95e56c72b6ba71d4eeb395f71bdcbdde.png
Mapping Diagram

After having this data, we transfer the five-tuple information for the newly sent data, that is, the source IP and the destination IP are exchanged, and the source port and the destination port are exchanged, so that the five-tuple information of the opposite end is obtained. , and then through a function bpf_msg_redirect_hash to complete. Simply put, the message on the current socket is forwarded to the socket in the socket mapping, which magically bypasses the protocol stack.

long bpf_msg_redirect_hash(struct sk_msg_buff *msg, struct bpf_map *map, void *key, u64 flags)

Description
This helper is used in programs implementing policies at the socket  level.  If  the
message  msg  is allowed to pass (i.e. if the verdict eBPF program returns SK_PASS),
redirect it to the socket referenced by map (of  type  BPF_MAP_TYPE_SOCKHASH)  using
hash  key.  Both  ingress  and  egress  interfaces  can be used for redirection. The
BPF_F_INGRESS value in flags is used to make the distinction (ingress  path  is  se‐
lected  if  the  flag is present, egress path otherwise). This is the only flag sup‐
ported for now.

Return SK_PASS on success, or SK_DROP on error.

3.3 Use the ebpf type

Different ebpf types of programs have different helper functions that can be used. For the convenience of operation, two different types of ebpf programs are used here:

  1. BPF_PROG_TYPE_SOCK_OPS This type is the ebfp program type for building mappings of quintuples and sockets. (socket operations event triggers execution)

  2. BPF_PROG_TYPE_SK_MSG This type is used to capture the data packets sent in the socket and forward them according to the above socket mapping. (the sendmsg system call triggers the execution)

Description of different types of ebpf program hook points:f1e70a384b8945cc6ed8f168a00fcc09.png

Four code summary

4.1 Saving of socket mapping data

The header file defines sockops.h

#ifndef __SOCK_OPS_H__
#define __SOCK_OPS_H__

#include <linux/bpf.h>

struct sock_key {
 __u32 sip;
 __u32 dip;
 __u32 sport;
 __u32 dport;
 __u32 family;
};

struct bpf_map_def SEC("maps") sock_ops_map = {
 .type = BPF_MAP_TYPE_SOCKHASH,
 .key_size = sizeof(struct sock_key),
 .value_size = sizeof(int),
 .max_entries = 65535,
 .map_flags = 0,
};

#endif    /* __SOCK_OPS_H__ */

The program for creating sockets and socket mappings, the file name is: sockops.bpf.c

#include <linux/bpf.h>
#include <bpf/bpf_endian.h>
#include <bpf/bpf_helpers.h>
#include <sys/socket.h>
#include "sockops.h"

SEC("sockops")
int bpf_sockmap(struct bpf_sock_ops *skops)
{
 /* 包如果不是ipv4的则忽略*/
 if (skops->family != AF_INET) {
  return BPF_OK;
 }

 /* 只有新创建的主动连接或被动连接才更新 */
 if (skops->op != BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB
     && skops->op != BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB) {
  return BPF_OK;
 }

 struct sock_key key = {
  .dip = skops->remote_ip4,
  .sip = skops->local_ip4,
  /* convert to network byte order */
  .sport = bpf_htonl(skops->local_port),
  .dport = skops->remote_port,
  .family = skops->family,
 };

 bpf_sock_hash_update(skops, &sock_ops_map, &key, BPF_NOEXIST);
 return BPF_OK;
}

char LICENSE[] SEC("license") = "Dual BSD/GPL";

The key is:

bpf_sock_hash_update(skops, &sock_ops_map, &key, BPF_NOEXIST);

4.2 Forwarding of socket data

Use the saved socket mapping data and combine the bpf helper function to forward the message. File name: sockredir.bpf.c

#include <linux/bpf.h>
#include <bpf/bpf_endian.h>
#include <bpf/bpf_helpers.h>
#include <sys/socket.h>
#include "sockops.h"



SEC("sk_msg")
int bpf_redir(struct sk_msg_md *msg)
{
    // 源和目标要反转,因为我们先对端发的
    struct sock_key key = {
        .sip = msg->remote_ip4,
        .dip = msg->local_ip4,
        .dport = bpf_htonl(msg->local_port),
        .sport = msg->remote_port,
        .family = msg->family,
    };
    // 将套接字收到的消息转发
    bpf_msg_redirect_hash(msg, &sock_ops_map, &key, BPF_F_INGRESS);
    return SK_PASS;
}

char LICENSE[] SEC("license") = "Dual BSD/GPL";

Compile command:

clang -g -O2 -target bpf -D__TARGET_ARCH_x86 -I/usr/include/x86_64-linux-gnu -I. -c sockops.bpf.c -o sockops.bpf.o

clang -g -O2 -target bpf -D__TARGET_ARCH_x86 -I/usr/include/x86_64-linux-gnu -I. -c sockredir.bpf.c -o sockredir.bpf.o

Convert the bpf program into bpf bytecode with two lines of commands.

4.3 Load ebpf program

In the past, the functions provided by BCC's python code or libbpf library, this time using bpftool to load and mount the ebpf program, which is exciting, I finally saw how to make the ebpf program run for a long time, our commands used to run on the front end , It will drop when the program is stopped, this is not.

Load the sockops program:

sudo bpftool prog load sockops.bpf.o /sys/fs/bpf/sockops type sockops pinmaps /sys/fs/bpf

Load sockops.bpf.o into the kernel and fix it to the BPF file system. After the command ends, the ebpf program continues to run in the background. can be seen:

root@ubuntu-lab:/home/miao/jike-ebpf/balance# bpftool prog show 
992: sock_ops  name bpf_sockmap  tag e37ef726a3a85a2e  gpl
        loaded_at 2022-06-12T10:43:09+0000  uid 0
        xlated 256B  jited 140B  memlock 4096B  map_ids 126
        btf_id 149

The above only loads the ebpf program, but it is not bound to the kernel event. The sockops program can be mounted in the cgroup subsystem, so that it will take effect for all programs running in the cgroup. It is really a magical thing. Two steps: 1. View the mounted cgroup path of the current system

root@ubuntu-lab:/home/miao/jike-ebpf/balance# mount | grep cgroup
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)
  1. mount:

sudo bpftool cgroup attach /sys/fs/cgroup/ sock_ops pinned /sys/fs/bpf/sockops

Loading and mounting of forwarder:

sudo bpftool prog load sockredir.bpf.o /sys/fs/bpf/sockredir type sk_msg map name sock_ops_map pinned /sys/fs/bpf/sock_ops_map
sudo bpftool prog attach pinned /sys/fs/bpf/sockredir msg_verdict pinned /sys/fs/bpf/sock_ops_map

There are still many differences with the above mount command, including the different types of bpf, one is sockops type and the other is sk_msg type; the two programs also communicate through sock_ops_map, which is bound through path mapping.

Five-run optimized load balancer performance comparison

5.1 Before optimization

In order to verify whether there is an improvement, it is necessary to test the performance under the original load balancing architecture without any modification: download the test tool and test on the client side:

# 进入client容器终端,安装curl之后访问Nginx
docker exec -it client sh 

# 安装和验证
/ # apk add curl wrk --update

/ # curl "http://172.17.0.5"

If it is determined to be normal, install the performance testing tool wrk, and test as follows:

/ # apk add wrk --update
/ # wrk -c100 "http://172.17.0.5"

The output is as follows:

/ #  wrk -c100 "http://172.17.0.5"
Running 10s test @ http://172.17.0.5
  2 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    32.81ms   28.30ms 252.86ms   87.21%
    Req/Sec     1.75k   612.19     3.26k    67.35%
  34406 requests in 10.10s, 5.41MB read
Requests/sec:   3407.42
Transfer/sec:    549.05KB

The average delay is 32.81ms, the average number of requests per second is 3407.42, and the average request size is 1.75

5.2 After optimization

docker exec -it client sh
 /# wrk -c100 "http://172.17.0.5"

The result is as follows:

Running 10s test @ http://172.17.0.5
  2 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    29.21ms   27.98ms 294.16ms   89.78%
    Req/Sec     2.06k   626.54     3.25k    68.23%
  40389 requests in 10.07s, 6.36MB read
Requests/sec:   4010.77
Transfer/sec:    646.27KB

In comparison, the delay has dropped from 32.81ms to 29.21ms, and the average number of requests per second has increased from 3407 to 4010, an increase of 17%, which is still acceptable.

The curl scope is also normal:

/ # curl "http://172.17.0.5"
Hostname: http1

/ # curl "http://172.17.0.5"
Hostname: http2

During the execution of the test, we can view the values ​​​​in the map:

root@ubuntu-lab:/home/miao/jike-ebpf/hello# sudo bpftool map dump name sock_ops_map
key:
ac 11 00 05 ac 11 00 03  00 00 c7 60 00 00 00 50
02 00 00 00
value:
No space left on device
key:
ac 11 00 05 ac 11 00 04  00 00 00 50 00 00 e0 86
02 00 00 00
value:
No space left on device
key:
ac 11 00 05 ac 11 00 04  00 00 00 50 00 00 e0 88
02 00 00 00

Ignore No space left on device, this is a problem with the ebpf version, the value of the key corresponds to the value of the quintuple, and it will not be seen after the test.

Six data cleaning

# cleanup skops prog and sock_ops_map
sudo bpftool cgroup detach /sys/fs/cgroup/ sock_ops name bpf_sockmap
sudo rm -f /sys/fs/bpf/sockops /sys/fs/bpf/sock_ops_map

# cleanup sk_msg prog
sudo bpftool prog detach pinned /sys/fs/bpf/sockredir msg_verdict pinned /sys/fs/bpf/sock_ops_map
sudo rm -f /sys/fs/bpf/sockredir

Uninstall the original mount point, and then delete some files to delete the ebpf program.

Delete the docker container:

docker rm -f http1 http2 client nginx

Guess you like

Origin blog.csdn.net/mseaspring/article/details/125252762