[Vehicle Performance Optimization] Run threads & processes on the desired CPU core

In the development of in-vehicle Android applications, there may be a strange requirement: when interacting with the user, the application needs to run at full speed to ensure the smoothness of the interaction. However, if the application enters the background, it needs to run at idle speed to free up more resources to ensure the system or front desk. Application fluency. Based on this requirement, we need to implement a framework that can dynamically adjust application execution efficiency.

As we all know, the currently most widely used automotive SOC - Qualcomm Snapdragon 8155, adopts a 1+3+4 8-core design, in which the large core is clocked at 2.96GHz, three high-performance cores are clocked at 2.42GHz, and four low-power cores are clocked at 2.96GHz. The core frequency is 1.8GHz.

If we can run the program's process or thread on a designated CPU core, in principle, we can dynamically adjust the execution efficiency of the application. To achieve this requirement, a Linux function - sched_setaffinity.

The chip specification data here comes from the Chinese Internet, and there is a lot of discrepancy with the actual frequency of the mass-produced Snapdragon SA8155P that I have personally come into contact with.

Introduction to sched_setaffinity

Before introducing it sched_setaffinity, a new concept needs to be introduced - CPU affinity .

CPU affinity

CPU affinity refers to the tendency of a process or thread to execute on one or certain CPU cores when running, rather than randomly or frequently switching between different cores. CPU affinity can improve the performance of a process or thread because it takes advantage of the locality of the CPU cache and reduces the overhead of cache invalidation and process migration.

CPU affinity is divided into soft affinity and hard affinity :

Soft affinity is the default feature of the Linux kernel process scheduler. It will try to keep the process running on the CPU core that last ran, but this is not guaranteed because the load balancing of each core must also be considered.
Hard affinity is an API provided by the Linux kernel to users, which allows users to explicitly specify which CPU cores a process or thread can run on, or be bound to a specific core.

On Linux kernel systems, to set or get CPU affinity, you can use the following functions:

sched_setaffinity(): Sets the CPU affinity mask of a process or thread, indicating which cores it can run on.
sched_getaffinity(): Gets the CPU affinity mask of a process or thread, indicating which cores it can currently run on.
CPU_ZERO(): Macro for operating the CPU affinity mask, used to clear whether a certain core is in the mask.
CPU_SET(): Macro for operating the CPU affinity mask, used to set whether a certain core is in the mask.
CPU_CLR(): Macro for operating the CPU affinity mask, used to clear whether a certain core is in the mask.
CPU_ISSET(): Macro for operating the CPU affinity mask, used to check whether a certain core is in the mask.

Usage

Step 1: Create a cpu_set_ttype variable mask to represent the CPU affinity mask.

Step 2: Then use CPU_ZEROthe CPU_SETsum macro to clear and set the mask so that only the bit corresponding to the core is 1 and the other bits are 0.

Step 3: Call sched_setaffinitythe function to set the CPU affinity of the current thread. If successful, return 0, otherwise return -1.

    // cpu 亲和性掩码
    cpu_set_t mask;
    // 清空
    CPU_ZERO(&mask);
    // 设置 亲和性掩码
    CPU_SET(core, &mask);
    // 设置当前线程的cpu亲和性
    if (sched_setaffinity(0, sizeof(mask), &mask) == -1) {
        return -1;
    }

sched_setaffinityThe principle of the function is to specify which CPU cores it can run on by setting the CPU affinity mask of the process or thread. The CPU affinity mask is a bitmap, and each bit corresponds to a CPU core. If a certain bit is 1, it means that the process or thread can run on that core, otherwise it cannot.

sched_setaffinityFunctions can be used to improve the performance of a process or thread and avoid frequent switching between different cores.

sched_setaffinityThe prototype of the function is as follows:

int sched_setaffinity(pid_t pid, size_t cpusetsize, const cpu_set_t *mask);

pid : Indicates the ID of the process or thread to be set. If it is 0, it indicates the current process or thread;

cpusetsize : Indicates the length of the data pointed to by the mask pointer, usually sizeof(cpu_set_t);

mask : It is a pointer to type cpu_set_t. cpu_set_t is an opaque structure used to represent the CPU affinity mask. Some macros need to be used to operate it, such as CPU_ZERO, CPU_SET, CPU_CLR, etc.

sched_setaffinityThe function returns 0 when successful and -1 when failed, and sets errno to the corresponding error code. Possible error codes are:

EFAULT: mask pointer is invalid
EINVAL: There are no valid CPU cores in mask
EPERM: The caller does not have sufficient permissions

Android implementation

In Android applications we need to use JNI to call sched_setaffinityfunctions. Use AndroidStudio to create a default NDK project. The Cmake script is as follows:

cmake_minimum_required(VERSION 3.22.1)

project("socaffinity")

add_library(${CMAKE_PROJECT_NAME} SHARED
        native-lib.cpp)

target_link_libraries(${CMAKE_PROJECT_NAME}
        android
        log)

Native-lib source code is as follows:

#include <jni.h>
#include <unistd.h>
#include <pthread.h>

// 获取cpu核心数
int getCores() {
    int cores = sysconf(_SC_NPROCESSORS_CONF);
    return cores;
}

extern "C" JNIEXPORT jint JNICALL Java_com_wj_socaffinity_ThreadAffinity_getCores(JNIEnv *env, jobject thiz){
    return getCores();
}
// 绑定线程到指定cpu
extern "C" JNIEXPORT jint JNICALL Java_com_wj_socaffinity_ThreadAffinity_bindThreadToCore(JNIEnv *env, jobject thiz, jint core) {
    int num = getCores();
    if (core >= num) {
        return -1;
    }
    cpu_set_t mask;
    CPU_ZERO(&mask);
    CPU_SET(core, &mask);
    if (sched_setaffinity(0, sizeof(mask), &mask) == -1) {
        return -1;
    }
    return 0;
}

// 绑定进程程到指定cpu
extern "C"
JNIEXPORT jint JNICALL
Java_com_wj_socaffinity_ThreadAffinity_bindPidToCore(JNIEnv *env, jobject thiz, jint pid,
                                                     jint core) {
    int num = getCores();
    if (core >= num) {
        return -1;
    }
    cpu_set_t mask;
    CPU_ZERO(&mask);
    CPU_SET(core, &mask);
    if (sched_setaffinity(pid, sizeof(mask), &mask) == -1) {
        return -1;
    }
    return 0;
}

Then encapsulate the JNI calling method in an independent singleton, as shown below:

object ThreadAffinity {

    private external fun getCores(): Int

    private external fun bindThreadToCore(core: Int): Int

    private external fun bindPidToCore(pid: Int, core: Int): Int

    init {
        System.loadLibrary("socaffinity")
    }

    fun getCoresCount(): Int {
        return getCores()
    }

    fun threadToCore(core: Int, block: () -> Unit) {
        bindThreadToCore(core)
        block()
    }

    fun pidToCore(pid: Int, core: Int){
        bindPidToCore(pid, core)
    }

}

Through the above code, we have implemented the simplest demo for modifying CPU affinity. Next we run the test.

Run tests

Suppose there are two tasks that require intensive calculations, namely Task1 and Task2. The logic is to calculate the cumulative sum from 0 to 1000000000, and then output the consumed time on the console. The test code is as follows:

override fun onCreate(savedInstanceState: Bundle?) {
    super.onCreate(savedInstanceState)
    binding = ActivityMainBinding.inflate(layoutInflater)
    setContentView(binding.root)
    task1()
    task2()
}

// 耗时任务1
private fun task1() {
    Thread {
        var time = System.currentTimeMillis()
        var sum = 0L
        for (i in 0..1000000000L) {
            sum += i
        }
        time = System.currentTimeMillis() - time
        Log.e("SOC_", "start1: $time")
        runOnUiThread {
            binding.sampleText.text = time.toString()
        }
    }.start()
}

// 耗时任务2
private fun task2() {
    Thread {
        var time = System.currentTimeMillis()
        var sum = 0L
        for (i in 0..1000000000L) {
            sum += i
        }
        time = System.currentTimeMillis() - time
        Log.e("SOC_", "start2: $time")
        runOnUiThread {
            binding.sampleText.text = time.toString()
        }
    }.start()
}

Scenario 1: Directly execute time-consuming tasks without any processing

In this scenario, we do not perform any additional operations. Thread scheduling adopts the default method of the Android kernel, and the following results are obtained:

Time-consuming tasks are executed on different CPUs, and the CPU peak is about 207 / 600% .

Task1 took 4037ms and Task2 took 4785ms .

Scenario 2: Binding processes to small cores

In this scenario, we use ThreadAffinity to bind the application process to CPU5 (on my device, CPU4 and CPU5 are both small cores).

class MyApp: Application() {

    override fun onCreate() {
        // 注意确定你的CPU核心 大核心、小核心的标号。
        ThreadAffinity.pidToCore(android.os.Process.myPid(), 5)
        super.onCreate()
    }

}

Time-consuming tasks are basically concentrated on CPU5, and the CPU peak at this time is about 102 / 600% .

Task1 took 18276ms and Task2 took 18272ms . It can be seen that although this method significantly reduces the CPU peak value, the task execution efficiency also drops sharply.

Scenario 3: Binding processes and time-consuming tasks to large cores

In this scenario, the process is bound to CPU2, and Task1 and Task2 are bound to CPU0 and CPU1 respectively (on my device, CPU0-CPU3 are all large cores).

class MyApp: Application() {

    override fun onCreate() {
        // 注意确定你的CPU核心 大核心、小核心的标号。
        ThreadAffinity.pidToCore(android.os.Process.myPid(), 2)
        super.onCreate()
    }
}

private fun start1() {
    // 将线程绑定到核心0上
    ThreadAffinity.threadToCore(0) {
        Thread {
            var time = System.currentTimeMillis()
            var sum = 0L
            for (i in 0..1000000000L) {
                sum += i
            }
            time = System.currentTimeMillis() - time
            Log.e("SOC_", "start1: $time")
            runOnUiThread {
                binding.sampleText.text = time.toString()
            }
        }.start()
    }
}

private fun start2() {
    // 将线程绑定到核心1上
    ThreadAffinity.threadToCore(1) {
        Thread {
            var time = System.currentTimeMillis()
            var sum = 0L
            for (i in 0..1000000000L) {
                sum += i
            }
            time = System.currentTimeMillis() - time
            Log.e("SOC_", "start2: $time")
            runOnUiThread {
                binding.sampleText.text = time.toString()
            }
        }.start()
    }
}

Time-consuming tasks are basically concentrated on CPU0 and CPU1. At this time, the CPU peak value is about 193/600% .

Jul-21-2023 10-15-25.gif

Task1 took 3193ms and Task2 took 3076ms . It can be seen that compared with the default performance scheduling of the Android kernel, manually allocating cores can achieve higher execution efficiency.

Based on the above three situations, we can draw the following conclusions:

Binding processes to small cores will significantly reduce peak CPU consumption and suppress applications from consuming system resources, but it will also slow down the execution efficiency of applications.
Assigning threads to different threads for execution can improve the execution efficiency of the application without increasing the CPU peak as much as possible.

Summarize

This article introduces the method of dynamically adjusting CPU affinity. It was originally my personal attempt to optimize the performance of in-vehicle Android applications. It is somewhat "experimental". I believe that the specific shortcomings will be further improved in future applications. Appearance, so currently for reference only.

Please pay attention to the following two points. First, if you need to use it in your project, remember to coordinate with all application developers and use it on a small scale as much as possible for some very performance-sensitive applications to prevent a large number of applications from competing for a certain CPU situation. Second, the method introduced in this article is not applicable to mobile phones, because the modification of the kernel by mobile phone manufacturers results in inconsistent CPU scheduling strategies among devices of different brands, and may fail when used on mobile phones.

The above is all the content of this article. Thank you for reading. I hope it will be helpful to you.

Source code address in this article: https://github.com/linxu-link/SocAffinity

References

CPU affinity in Linux (affinity)

The use and mechanism of CPU affinity

C++ performance juicer CPU affinity