Take a look at how Douyin optimizes threads

background

Recently, I have been researching and studying some large-scale apps. When I was researching a certain music app, I found that it had done some optimization work on threads, and the problem it solved was also the problem I encountered when I was doing online freeze optimization. , so an in-depth analysis of its specific implementation scheme has been made. This article is a summary of the research on its related source code plus personal understanding.

question

Create thread lag

In Java, the real kernel thread is created when the start function is executed. For the specific process of nativeCreate, please refer to my previous analysis article Android virtual machine thread startup process analysis . Assuming that you have already understood it here, we can know that the bottom layer of the start() function involves a series of operations, including stack memory space allocation, kernel thread creation and other operations. These operations may take a long time in some cases, such as Since in the Linux system, the creation of all system threads is implemented by a dedicated thread queue at the kernel layer, is it possible that there will be a long time-consuming problem due to the long queue and problems with kernel scheduling? The specific reason is that this kind of problem has not been reproduced offline, so I can only guess boldly, but I did collect some cases online. The following is a sample of a blocking scene collected online:

So is it okay not to create other threads directly in the main thread, but to use the thread pool to schedule tasks directly? Let's look at the source implementation of ThreadPoolExecutor.execute(Runnable command)

It can be known from the documentation that the execution of the execute function will create a (JavaThread) thread in many cases, and after tracking its internal implementation, it can be found that after the Java thread object is created, the start function will also be executed in the current thread immediately.

Let's take a look at a scene collected online where the main thread uses the thread pool to schedule tasks and still suffers from freezes.

Problem with too many threads

In the ART virtual machine, every time a thread is created, an independent Java stack space needs to be allocated for it. When the Java layer does not explicitly set the stack space size, the native layer will allocate the default stack space size in the FixStackSize function .

从这个实现中，可以看出每个线程至少会占用1M的虚拟内存大小，而在32位系统上，由于每个进程可分配的用户用户空间虚拟内存大小只有3G，如果一个应用的线程数过多，而当进程虚拟内存空间不足时，创建线程的动作就可能导致OOM问题.

另一个问题是某些厂商的应用所能创建的线程数相比原生Android系统有更严格的限制，比如某些华为的机型限制了每个进程所能创建的线程数为500, 因此即使是64位机型，线程数不做控制也可能出现因为线程数过多导致的OOM问题。

优化思路

线程收敛

首先在一个Android App中存在以下几种情况会使用到线程

通过 Thread类直接创建使用线程
通过 ThreadPoolExecutor 使用线程
通过 ThreadTimer 使用线程
通过 AsyncTask 使用线程
通过 HandlerThread 使用线程

线程收敛的大致思路是, 我们会预先创建上述几个类的实现类，并在自己的实现类中做修改，之后通过编译期的字节码修改，将App中上述使用线程的地方都替换为我们的实现类。

使用以上线程相关类一般有几种方式：

直接通过 new 原生类创建相关实例
继承原生类，之后在代码中使用 new 指令创建自己的继承类实例

因此这里的替换包括：

修改类的继承关系，比如将所有继承 Thread类的地方，替换为我们实现的 PThread
修改上述几种类直接创建实例的地方，比如将代码中存在 new ThreadPoolExecutor(..) 调用的地方替换为我们实现的 PThreadPoolExecutor

通过字码码修改，将代码中所有使用线程的地方替换为我们的实现类后，就可以在我们的实现类做一些线程收敛的操作。

Thread类线程收敛

在Java虚拟机中，每个Java Thread 都对应一个内核线程，并且线程的创建实际上是在调用 start()函数才开始创建的，那么我们其实可以修改start()函数的实现，将其任务调度到指定的一个线程池做执行, 示例代码如下

class ThreadProxy : Thread() {
    override fun start() {
        SuperThreadPoolExecutor.execute({
            this@ThreadProxy.run()
        }, priority = priority)
    }
}
复制代码

线程池线程收敛

由于每个ThreadPoolExecutor实例内部都有独立的线程缓存池，不同ThreadPoolExecutor实例之间的缓存互不干扰，在一个大型App中可能存在非常多的线程池，所有的线程池加起来导致应用的最低线程数不容小视。

另外也因为线程池是独立的，线程的创建和回收也都是独立的，不能从整个App的任务角度来调度。举个例子: 比如A线程池因为空闲正在释放某个线程，同时B线程池确可能正因为可工作线程数不足正在创建线程，如果可以把所有的线程池合并成一个统一的大线程池，就可以避免类似的场景。

核心的实现思路为:

首先将所有直接继承 ThreadPoolExecutor的类替换为继承 ThreadPoolExecutorProxy，以及代码中所有new ThreadPoolExecutor(..)类替换为 new ThreadPoolExecutorProxy(...)
ThreadPoolExecutorProxy 持有一个大线程池实例 BigThreadPool ，该线程池实例为应用中所有线程池共用，因此其核心线程数可以根据应用当前实际情况做调整，比如如果你的应用当前线程数平均是200，你可以将BigThreadPool 核心线程设置为150后，再观察其调度情况。
在 ThreadPoolExecutorProxy 的 addWorker 函数中，将任务调度到 BigThreadPool中执行

AsyncTask 线程收敛

对于AsyncTask也可以用同样的方式实现，在execute1函数中调度到一个统一的线程池执行


public abstract class AsyncTaskProxy<Params,Progress,Result> extends AsyncTask<Params,Progress,Result>{

    private static final Executor THREAD_POOL_EXECUTOR = new PThreadPoolExecutor(0,20,
            3, TimeUnit.MILLISECONDS,
            new SynchronousQueue<>(),new DefaultThreadFactory("PThreadAsyncTask"));


    public static void execute(Runnable runnable){
        THREAD_POOL_EXECUTOR.execute(runnable);
    }

    /**
     * TODO 使用插桩 将所有 execute 函数调用替换为 execute1
     * @param params  The parameters of the task.
     * @return This instance of AsyncTask.
     */
    public AsyncTask<Params, Progress, Result> execute1(Params... params) {
        return executeOnExecutor(THREAD_POOL_EXECUTOR,params);
    }


}
复制代码

Timer类

Timer类一般项目中使用的地方并不多，并且由于Timer一般对任务间隔准确性有比较高的要求，如果收敛到线程池执行，如果某些Timer类执行的task比较耗时，可能会影响原业务，因此暂不做收敛。

卡顿优化

针对在主线程执行线程创建可能会出现的阻塞问题，可以判断下当前线程，如果是主线程则调度到一个专门负责创建线程的线程进行工作。

    private val asyncExecuteHandler  by lazy {
        val worker = HandlerThread("asyncExecuteWorker")
        worker.start()
        return@lazy Handler(worker.looper)
    }


    fun execute(runnable: Runnable, priority: Int) {
        if (Looper.getMainLooper().thread == Thread.currentThread() && asyncExecute
        ){
            //异步执行
            asyncExecuteHandler.post {
                mExecutor.execute(runnable,priority)
            }
        }else{
            mExecutor.execute(runnable, priority)
        }

    }
复制代码

32位系统线程栈空间优化

在问题分析中的环节中，我们已经知道每个线程至少需要占用 1M的虚拟内存，而32位应用的虚拟内存空间又有限，如果希望在线程这里挤出一点虚拟内存空间来，可以参考微信的一个方案，其利用PLT hook需改了创建线程时的栈空间大小。

而在另一篇 juejin.cn/post/720930… 技术文章中，也介绍了另一个取巧的方案：在Java层直接配置一个负值，从而起到一样的效果

OOM了? 我还能再抢救下！

针对在创建线程时由于内存空间不足或线程数限制抛出的OOM问题，可以做一些兜底处理, 比如将任务调度到一个预先创建的线程池进行排队处理, 而这个线程池核心线程和最大线程是一致的因此不会出现创建线程的动作，也就不会出现OOM异常了。

另外由于一个应用可能会存在非常多的线程池，每个线程池都会设置一些核心线程数，要知道默认情况下核心线程是不会被回收的，即使一直处于空闲状态，该特性是由线程池的 allowCoreThreadTimeOut控制。

该参数值可通过 allowCoreThreadTimeOut(value) 函数修改

从具体实现中可以看出，当value值和当前值不同且 value 为true时会触发 interruptIdleWorkers()函数, 在该函数中，会对空闲Worker 调用 interrupt来中断对应线程

因此当创建线程出现OOM时，可以尝试通过调用线程池的 allowCoreThreadTimeOut 来触发 interruptIdleWorkers 实现空闲线程的回收。具体实现代码如下:

因此我们可以在每个线程池创建后，将这些线程池用弱引用队列保存起来，当线程start 或者某个线程池execute 出现OOM异常时，通过这种方式来实现线程回收。

线程定位

线程定位主要是指在进行问题分析时，希望直接从线程名中定位到创建该线程的业务，关于此类优化的文章网上已经介绍的比较多了，基本实现是通过ASM 修改调用函数，将当前类的类名或类名+函数名作为兜底线程名设置。这里就不详细介绍了，感兴趣的可以看 booster 中的实现

字节码修改工具

前文讲了一些优化方式，其中涉及到一个必要的操作是进行字节码修改，这些需求可以概括为如下

替换类的继承关系，比如将所有继承于 java.lang.Thread的类，替换为我们自己实现的 ProxyThread
替换 new 指令的实例类型，比如将代码中所有 new Thread(..) 的调用替换为 new ProxyThread(...)

针对这些通用的修改，没必要每次遇到类似需求时都进行插件的单独开发，因此我将这种修改能力集成到 LanceX插件中，我们可以通过以下注解方便实现上述功能。

替换 new 指令

@Weaver
@Group("threadOptimize")
public class ThreadOptimize {

    @ReplaceNewInvoke(beforeType = "java.lang.Thread",
    afterType = "com.knightboost.lancetx.ProxyThread")
    public static void replaceNewThread(){
    }

}
复制代码

这里的 beforeType表示原类型，afterType 表示替换后的类型，使用该插件在项目编译后，项目中的如下源码

会被自动替换为

替换类的继承关系

@Weaver
@Group("threadOptimize")
public class ThreadOptimize {

    @ChangeClassExtends(
            beforeExtends = "java.lang.Thread",
            afterExtends = "com.knightboost.lancetx.ProxyThread"
    )
    public void changeExtendThread(){};

    

}
复制代码

这里的beforeExtends表示原继承父类，afterExtends表示修改后的继承父类，在项目编译后，如下源码

会被自动替换为

总结

本文主要介绍了有关线程的几个方面的优化

主线程创建线程耗时优化
线程数收敛优化
线程默认虚拟空间优化
OOM优化

这些不同的优化手段需要根据项目的实际情况进行选择，比如主线程创建线程优化的实现方面比较简单、影响面也比较低，可以优先实施。而线程数收敛需要涉及到字节码插桩、各种对象代理复杂度会高一些，可以根据当前项目的实际线程数情况再考虑是否需要优化。

线程OOM问题主要出现在低端设备或一些特定厂商的机型上，可能对于某些大厂的用户基数来说有一定的收益，如果你的App日活并没有那么大，这个优化的优先级也是较低的。

Historical articles in the performance optimization column:

article	address
Another posture to monitor Android Looper Message scheduling	juejin.cn/post/713974…
The method of collecting the CPU usage of the system in the higher version of Android	juejin.cn/post/713503…
Implementation and Application of Method Trace on Android Platform	juejin.cn/post/710713…
How Android solves the problem of Caton and ANR caused by using SharedPreferences	juejin.cn/post/705476…
Performance monitoring based on JVMTI	juejin.cn/post/694278…

References

1. A certain sound App

2. Kernel thread creation process

3. juejin.cn/post/720930…Virtual memory optimization: thread + multi-process optimization

4. github.com/didi/booste…