[Xinchuang] JED on Kunpeng (ARM) tuning steps and results | JD Cloud technical team

Background of the project

Based on the country's vigorous promotion of the Xinchuang project, for independent and controllable technological development, basic components will gradually be replaced by domestic components. Therefore, starting from the database, the elastic library JED is deployed on the domestic Huawei Kunpeng machine (based on ARM architecture) for adjustment. Excellent, performance comparison with Intel (X86).

Physical machine configuration

Processor manufacturer Architecture design CPU model CPU Turbo frequency memory frequency operating system
Huawei ARM kunpeng920-7262C 128C none 3200 MT/s Euler
Intel X86 platium-8338C-3rd 128C turn on 3200 MT/s hundreds 8
Intel X86 platium-8338C-3rd 128C turn on 3200 MT/s hundreds 8

Database configuration

Deploy computer room Langfang
Deployment method container
Gateway configuration 16C/12G disk:/export:30G
database schema 1 cluster, one master and one slave
Database configuration 8C/24G disk:/export:512G

Tuning results

Before tuning: When the background pressure is 50%, JED on Kunpeng's read performance is 58% of Intel's, and its write performance is 68%.

After tuning: JED on Kunpeng's read performance reaches 99% of Intel's, write performance reaches 121% of Intel's, and reaches 113% when the read and write mix is ​​7:3. TP99 and response time performance are better, and the database CPU usage reaches 100% at this time. 100%. The main scenarios and performance data recorded during the tuning process are as follows:

Specific tuning process

1. BIOS optimization

The computer room needs to be modified and the host machine needs to be restarted.

Expectation: CPU prefetching has an impact on database performance and needs to be turned off ; Power Policy is Performance out of the box; Smmu does not need to be turned off

2. Change the host pagesize from 4K to 64K.

Original configuration:

The page table size has an impact on database performance. Please confirm whether the page table sizes of x86 and Kunpeng host systems are consistent. Changing the page table size of the host operating system requires recompiling the kernel. Operations on different OSs are different. Please contact the operation and maintenance team. Contact changes

rpm -ivh http://storage.jd.local/k8s-node/kernel/5.10-jd_614-arm64/kernel-5.10.0-1.64kb.oe.jd_614.aarch64.rpm --force

3. Optimization within the host OS

3.1 Turn off the firewall

The online machine has been shut down and no modification is required.

systemctl status firewalld.service 
systemctl stop firewalld.service 
systemctl disable firewalld.service
systemctl status firewalld.service 


3.2 Network kernel parameter optimization (will become invalid after the host is restarted)

Read and write performance has not been significantly improved and has not been changed.

echo 1024 >/proc/sys/net/core/somaxconn
echo 16777216 >/proc/sys/net/core/rmem_max
echo 16777216 >/proc/sys/net/core/wmem_max
echo "4096 87380 16777216" >/proc/sys/net/ipv4/tcp_rmem 
echo "4096 65536 16777216" >/proc/sys/net/ipv4/tcp_wmem 
echo 360000 >/proc/sys/net/ipv4/tcp_max_syn_backlog


3.3 IO parameter optimization

Performance has not been improved and has not been changed.

echo deadline > /sys/block/nvme0n1/queue/scheduler;
echo deadline > /sys/block/nvme1n1/queue/scheduler;
echo deadline > /sys/block/nvme2n1/queue/scheduler;
echo deadline > /sys/block/nvme3n1/queue/scheduler;
echo deadline > /sys/block/sda/queue/scheduler;
echo 2048 > /sys/block/nvme0n1/queue/nr_requests;
echo 2048 > /sys/block/nvme1n1/queue/nr_requests;
echo 2048 > /sys/block/nvme2n1/queue/nr_requests;
echo 2048 > /sys/block/nvme3n1/queue/nr_requests;
echo 2048 > /sys/block/sda/queue/nr_requests


3.4 Cache parameter optimization

Performance has not been improved and has not been changed.

echo 5 >/proc/sys/vm/dirty_ratio; 
echo 1 > /proc/sys/vm/swappiness

3.5 Network card interrupts and binds core

The overall solution has not been implemented, but the number of ethxx network card queues can be modified.

ethtool -l ethxxx Check the number of ethxxx network card queues

ethtool -L ethxxx combined 8 ethxxx The number of network card queues needs to be set to 8, which is consistent with x86. After modification, the performance is improved (all traffic enters from the eth network card).

systemctl stop irqbalance
systemctl disable irqbalance

ethtool -L eth0 combined 1 #将网卡eth0的队列配置为 combined 模式,将所有队列合并为一个。 
#eth0 修改为实际使用的网卡设备名 这项参数对性能有影响

# 查看网卡队列信息
ethtool -l ethxxx

netdevice=eth0
cores=31
#查看网卡所属的 NUMANODE
cat /sys/class/net/${netdevice}/device/numa_node
#查看网卡中断号
cat /proc/interrupts | grep $(ethtool -i $netdevice | grep -i bus-info | awk -F ': ' '{print $2}') | awk -F ':' '{print $1}'
# 网卡中断绑核
for i in `cat /proc/interrupts | grep $(ethtool -i $netdevice | grep -i bus-info | awk -F ': ' '{print $2}')| awk -F ':' '{print $1}'`;do echo ${cores} > /proc/irq/$i/smp_affinity_list;done
netdevice=eth0
# 查看绑核后的结果
for i in `cat /proc/interrupts | grep $(ethtool -i $netdevice | grep -i bus-info | awk -F ': ' '{print $2}')| awk -F ':' '{print $1}'`;do cat /proc/irq/$i/smp_affinity_list;done

netdevice=eth1
cores=31
# 网卡中断绑核
for i in `cat /proc/interrupts | grep $(ethtool -i $netdevice | grep -i bus-info | awk -F ': ' '{print $2}')| awk -F ':' '{print $1}'`;do echo ${cores} > /proc/irq/$i/smp_affinity_list;done

# 查看绑核后的结果
for i in `cat /proc/interrupts | grep $(ethtool -i $netdevice | grep -i bus-info | awk -F ': ' '{print $2}')| awk -F ':' '{print $1}'`;do cat /proc/irq/$i/smp_affinity_list;done


4. Bind the business container to NUMA (upgrade the scheduler and deploy the co-location agent)

Before platform deployment, if you need to test, you can modify the cgroup configuration of the container to bind cores to isolate the CPU and memory from NUMA where background pressure is located.

Example operations are as follows

# 进入业务容器cgroup配置地址
cd /sys/fs/cgroup/cpuset/kubepods/burstable/poded***********/7b40a68a************
# 停docker,如果重启会重置cgroup配置
systemctl stop docker
# 压测过程中注意观察配置文件是否生效,如果docker服务会不停重启,可写个小脚本一直停服务或者覆盖写cgroup配置
echo 16-23 > cpu.set
echo 0 > mem.set

5. mysql-crc32 soft-edited to hard-edited for ARM

Compiled on the database side and can be deployed uniformly

cd /mysql-5.7.26
git apply crc32-mysql5.7.26.patch

6. mysqld feedback compilation optimization

Compiled on the database side and can be deployed uniformly

Need to confirm using openEuler gcc 10.3.1

https://gitee.com/openeuler/A-FOT/wikis/README

Environment preparation (executed in test environment and compilation environment)

git clone https://gitee.com/openeuler/A-FOT.git
yum install -y A-FOT (仅支持 openEuler 22.03 LTS) 
yum -y install perf


Modify the configuration file a-fot.ini (executed in the test environment and compilation environment)

cd /A-FOT
vim ./a-fot.ini # 修改内容如下
# 文件和目录请使用绝对路径
# 优化模式(AutoFDO、AutoPrefetch、AutoBOLT、Auto_kernel_PGO)(选择 AutoBolt) opt_mode=AutoBOLT
# 脚本工作目录(用来编译应用程序/存放 profile、日志,中间过程文件可能会很大,确保有 150G 的空 间)
work_path=/pgo-opt
# 应用运行脚本路径(空文件占位即可,使用 chmod 777 /root/run.sh 赋予可执行权限) run_script=/root/run.sh
# GCC 路径(bin、lib 的父目录,修改成所要使用的 gcc 的目录)
gcc_path=/usr
# AutoFDO、AutoPrefetch、AutoBOLT
# 针对应用的三种优化模式,请填写此部分配置
# 应用进程名
application_name=mysqld
# 二进制安装后可执行文件
bin_file=/usr/local/mysql-pgo/bin/mysqld
# 应用构建脚本路径(文件内填写源码编译 mysql 的相关命令,   赋予可执行权限)
chmod 777 /root/ build.sh
build_script=/root/build.sh
# 最大二进制启动时间(单位:秒)
max_waiting_time=700
# Perf 采样时长(单位:秒)(设置采样时间为 10min)
perf_time=600
# 检测是否优化成功(1=启用,0=禁用) check_success=1
# 构建模式 (Bear、Wrapper) build_mode=Wrapper
# auto_kernel_PGO
# 针对内核的优化模式,请填写此部分配置
# 内核 PGO 模式(arc=只启用 arc profile,all=启用完整的 PGO 优化) pgo_mode=all
# 执行阶段(1=编译插桩内核阶段,2=编译优化内核阶段) pgo_phase=1
# 内核源码目录(不指定则自动下载) kernel_src=/opt/kernel
# 内核构建的本地名(将根据阶段添加"-pgoing"或"-pgoed"后缀) kernel_name=kernel
# 内核编译选项(请确保选项修改正确合法,不会造成内核编译失败) #CONFIG_...=y
# 重启前的时间目录(用于将同一套流程的日志存放在一起)
last_time=
# 内核源码的 Makefile 地址(用于不自动编译内核的场景) makefile=
# 内核配置文件路径(用于不自动编译内核的场景) kernel_config=
# 内核生成的原始 profile 目录(用于不自动编译内核的场景) data_dir=


/root/build.sh (reference content is as follows)

cd /mysql-8.0.25 rm -rf build mkdir build
cd build
cmake .. -DBUILD_CONFIG=mysql_release -DCMAKE_INSTALL_PREFIX=/usr/local/mysql-pgo -
DMYSQL_DATADIR=/data/mysql/data -DWITH_BOOST=/mysql-8.0.25/boost/boost_1_73_0 make -j 96
make -j 96 install


Feedback compilation

1. Compile for the first time

You can skip it and just put A-FOT in the docker executable mysql process.

2. Data collection (executed in test environment)

Modify the /A-FOT/a-fot file as follows: comment out functions 409 and 414-416

Start the mysqld process, and at the same time the pressure end starts to put pressure on mysql, causing mysqld to start processing business

Execute ./a-fot and the following echo will appear on the screen.

After success, you can observe the profile.gcov file in the corresponding /pgo-opt directory.

After opening, the following content will appear:

3. Manually merge into Profile for compilation

cmake .. -DCMAKE_INSTALL_PREFIX=/usr/local/mysql-5.7.26-pgo/ -
DMYSQL_DATADIR=/data/mysql/data -DSYSCONFDIR=/usr/local/mysql-5.7.26- pgo/etc -DWITH_INNOBASE_STORAGE_ENGINE=1 - DWITH_PERFSCHEMA_STORAGE_ENGINE=1 - DWITH_BLACKH0LE_ST0RAGE_ENGINE=1 -DDEFAULT_CHARSET=utf8 - DDEFAULT_COLLATION=utf8_general_ci - DMYSQL_UNIX_ADDR=/data/mysql/tmp/mysql.sock -DENABLED_LOCAL_INFILE=ON -DENABLED_PROFILING=ON - DWITH_DEBUG=0 -DMYSQL_TCP_PORT=3358 - DCMAKE_EXE_LINKER_FLAGS="-ljemalloc" -Wno-dev -DWITH_BOOST=/mysql-5.7.26/boost/boost_1_59_0 -Wno-dev -DCMAKE_CXX_FLAGS="-fbolt-use=Wl,-q" -DCMAKE_CXX_LINK_FLAGS="-Wl,-q"

PATH_OF_PROFILE 改成 profile 所在的原始路径

7. Go version upgrade and feedback compilation

Database-related agents only need to operate if they use go.

7.1 Upgrade golang to 1.21

7.2 Go PGO optimization

1. Add import _ "net/http/pprof" to the code of the import pprof program

2. Start the program and perform a stress test

  1. After the pressure is started, perform the following operations to collect profiling files. Second is the collection time, and the unit is s curl -o cpu.pprof http://localhost:8080/debug/pprof/profile?seconds=304. According to the generated cpu.pprof Recompile the binary mv cpu.pprof default.pgo Recompile the program with the –pgo option go build –pgo=auto

Regarding performance improvement, the official data given by Golang is:

In Go 1.21, benchmarks on a representative set of Go programs show that building with PGO can improve performance by approximately 2-7%.

Author: JD Retail Zhu Chen

Source: JD Cloud Developer Community Please indicate the source when reprinting

The author of the open source framework NanUI switched to selling steel, and the project was suspended. The first free list in the Apple App Store is the pornographic software TypeScript. It has just become popular, why do the big guys start to abandon it? TIOBE October list: Java has the biggest decline, C# is approaching Java Rust 1.73.0 Released A man was encouraged by his AI girlfriend to assassinate the Queen of England and was sentenced to nine years in prison Qt 6.6 officially released Reuters: RISC-V technology becomes the key to the Sino-US technology war New battlefield RISC-V: Not controlled by any single company or country, Lenovo plans to launch Android PC
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4090830/blog/10117112