Baidu App Startup Performance Optimization Practice

insert image description here

I. Introduction

Startup performance is one of the core indicators of Baidu App. Users expect apps to be responsive and load quickly. Apps that take too long to start up don't meet this expectation and may disappoint users. This poor experience can lead to users giving your app a low rating in the app store , or even ditch your app entirely. The optimization of startup performance has become the most critical part of experience optimization. Baidu App continues to invest in this direction, continuously optimizes, and improves user experience.

Startup performance optimization is divided into overview, tools, optimization, and anti-deterioration. This article mainly explains the content related to performance optimization. For previous published articles, please refer to:

Baidu App low-end machine optimization - startup performance optimization (overview)

Baidu App Android Startup Performance Optimization-Tools

Baidu App Performance Optimization Tool - Thor Principle and Practice

2. Optimization theory

The cognition of startup performance optimization determines the direction and thinking of startup performance optimization, which in turn determines the effect of optimization. Many developers' understanding of the startup process comes from a description of the startup process in the Google Developer Documentation:

picture

1. Create an application object;

2. Start the main thread;

3. Create the main activity;

4. Extended view;

5. Layout screen;

6. Perform initial drawing.

Once the application process finishes drawing for the first time, the system process replaces the currently displayed background window with the main activity. At this point, the user can start using the app.

The above mainly introduces the various stages of the application startup process, but it is only a general overview. In fact, there are many startup methods, and it is very likely that the logic executed in different startup paths is different. Therefore, the cognition of the whole path is in the optimization process. Played a very important role, as shown in the figure below:

picture

During the start-up process, clicking the desktop icon is the mainstream cold start method, while Push start, browser start and other off-device conversions are also relatively common start-up methods. The start-up process of various start-up methods can basically be disassembled into: process creation , frame loading, home page rendering, and preloading four links. The startup performance optimization mainly faces not only the path of clicking the desktop icon, but also the optimization of the entire startup path to achieve the ultimate optimization of the experience.

The start-up process also needs to be understood at the system level, so as to dig out optimization points and explore the limits of optimization. The start-up process is very complicated and requires the cooperation of many system-level processes to complete the display of the page for normal use by users. The following figure shows the start-up process of clicking the icon:

picture

The startup process can be roughly summarized as:

1. The Launcher notifies AMS to start the main Activity of the APP;

2. ActivityManagerService (hereinafter referred to as AMS) records the activity information to be started, and notifies the Launcher to enter the pause state;

3. After the Launcher enters the pause state, it notifies the AMS that it has been paused, and starts to start the App;

4. The App has not been opened, and AMS starts a new process, and creates an ActivityThread object in the new process, and executes the main function method in it;

5. After the App main thread is started, notify AMS and pass it to applicationThread for communication;

6. AMS notifies App to bind Application and start MainActivity;

7. Start MainActivitiy, create and associate Context, and finally call the onCreate method to finally complete page drawing and screen uploading.

The functions of the main process are mainly:

1. Launcher process: It is the desktop process of the mobile phone, which is responsible for receiving the user's click event and notifying the event to AMS

2. SystemServer process: responsible for application startup process scheduling, process creation and management, window creation and management (StartingWindow and AppWindow), etc. The core services are AMS and WMS (WindowManagerService);

3. Zygote process: Create an application process through fork, and the Zygote process will create a virtual machine when it is initialized, and load the required system class library and resource files into the memory at the same time. After Zygote forks out the child process, the child process will also get a virtual machine with basic resources loaded, thereby accelerating the startup of the application process;

4. SurfaceFlinger process: mainly related to application rendering, such as Vsync signal processing, window synthesis processing, frame buffer management, etc.

With the overall cognition and vision, we can stand at a higher point of view, think and analyze performance bottlenecks more deeply, such as the rationality of mobile phone load, system resource usage, etc., and consider the optimization method of startup performance more comprehensively , to achieve the ultimate optimization of startup performance.

3. Optimize landing

The optimization of the startup performance of Baidu App is roughly divided into three parts: general optimization, basic mechanism optimization and underlying technology optimization.

3.1 General optimization

If it is in the early stage of business development, and the rapid iteration of the business is relatively fast, the optimization at this time will be relatively simple, and it is very likely that the optimization effect will increase the startup speed by seconds in a short period of time. The optimization of startup performance is also based on the understanding of cold startup and sorting out startup tasks to achieve the goal of rapid optimization. You can use performance tools, such as the Trace tool and Thor Hook tool mentioned above, to find time-consuming problems, evaluate whether they can be optimized through delay, asynchrony, deletion, etc., and evaluate work priorities based on input-output conditions to achieve rapid optimization. The purpose of starting performance.

picture

With the gradual expansion of business in the startup scene, Handbai has gradually grown into an aircraft carrier-level mobile application that carries the most business and has a huge volume. It is impossible to completely remove the preloading of huge business or solve it through asynchrony. This part is in the process of startup performance optimization. The big problem we are facing requires a mechanism to solve the problem of business preloading in batches. Therefore, the scheduling mechanism in the basic mechanism is gradually derived to deal with the preloading requirements of different businesses during the startup process.

3.2 Basic Mechanism Optimization

Basic mechanism optimization is mainly divided into scheduling mechanism optimization and basic component performance optimization.

3.2.1 Task scheduling optimization

There are many businesses, and the execution requirements of preloading tasks are different. To balance startup performance and business preloading, Baidu App needs to build a task scheduling framework, so that business parties can quickly optimize performance problems through access.

picture

The overall construction of task scheduling is as follows, and it is still in rapid iteration:

picture

Intelligent scheduling can make different scheduling responses according to task input and information input, such as:

1. Personalized scheduling strategy: If it is identified that the business preloading task ID matches the user's behavior habits, the task will be initialized in advance, and the task priority will be improved. At the same time, when the user enters the corresponding page of the business, the non-business Relevant tasks need to be avoided;

2. Hierarchical experience strategy: If it is identified that there is a corresponding scheduling strategy in the specified model configuration, it will execute the corresponding scheduling capabilities, such as immediate scheduling, delayed scheduling, no scheduling, etc., which are mainly used for experience degradation;

3. Refined scheduling strategy: Finely schedule business preloading tasks in different scenarios. For example, in the splash screen scenario, it will identify the business information related to the splash screen and do preloading. When the scene is called outside the terminal, it will identify the business information of the landing page And do corresponding preloading;

4. Priority-based delayed scheduling: There are a large number of task initializations that depend on delayed scheduling, and it is necessary to ensure orderly control of business initialization. Therefore, adding the concept of priority to delayed scheduling can also be prioritized in delayed scheduling. Allow higher priority tasks to be executed faster;

5. Homepage UI Parallel Rendering Scheduling: It mainly serves the commercial splash screen business in the cold start phase. Whether the commercial splash screen needs to be displayed and which material to display is determined by the real-time network request in the cold start phase. It is necessary to improve the commercial network as much as possible during the cold start phase. The available time of the request, thereby improving the success rate of network requests, is currently available in the Baidu App. The homepage can be initialized first, but the screen is not displayed. When the homepage rendering service is submitted, it will be checked whether the commercial splash screen is displayed, so that it can be provided to the commercial network. Request for more available time without blocking the initialization of the homepage. This technology greatly improves the success rate of commercial network requests and brings about an increase in commercial revenue.

Since there are many details involved in the scheduler framework, here is a brief introduction to the design of one of the schedulers: hierarchical experience scheduler.

picture

It is mainly divided into 3 modules, model rating, hierarchical configuration and hierarchical scheduling mechanism, to achieve the optimal experience on mobile phones with different configurations.

  • Model Rating:

  • Calculating scoring information through device information, called static scoring;

  • Calculating scoring information through performance indicators, called dynamic scoring;

  • According to the model training scoring information, the final model score is obtained;

  • Grading configuration:

  • Cloud configuration table: Provides a graded configuration table for each service level according to equipment rating conditions. The table supports dynamic update and incremental update, and the update takes effect in a timely manner on the backend.

  • Local preset table: a configuration table will be preset locally for the first installation;

  • Based on the rating information of the model and the classification configuration information, the control strategy is obtained;

  • Hierarchical Scheduling:

  • The business side controls different business logics based on model ratings to achieve the optimal experience of all functions of high-end phones, good experience of some functions of mid-range phones, and smooth experience of core functions of low-end phones. Select the lazy loading strategy on the terminal machine, and select the off state on the low-end machine;

3.2.2 KV storage optimization

SharedPreferences is a lightweight storage class on the Android platform. It is used to save the configuration information of the application. Its essence is to save the xml file of the data in the form of "key-value" pairs. The file is saved in /data/data/pkg/shared_prefs In the directory, the advantage is stored in the form of key-value pairs, which is convenient to use and easy to understand; but the disadvantages of SharedPreferences are obvious, such as slow read and write performance, IO read and write uses xml data format, and the efficiency of full update is low; multi-process support is poor, Stored data is easy to lose; many threads are created, resulting in poor performance.

poor read performance

Each time an SP file is loaded, a sub-thread will be created. The source code is as follows:

private final Object mLock = new Object();
private boolean mLoaded = false;
private void startLoadFromDisk() {
    synchronized (mLock) {
        mLoaded = false;
    }
    new Thread("SharedPreferencesImpl-load") {
        public void run() {
            loadFromDisk();
        }
    }.start();
}

But if the loading is not completed when the key-value is obtained, it will wait for the SP file to be loaded:

public String getString(String key, @Nullable String defValue) {    synchronized (mLock) {        awaitLoadedLocked();        String v = (String)mMap.get(key);        return v != null ? v : defValue;    }}

poor write performance

SP adopts the XML format, and each write is a full update, which is inefficient. There are two ways to write:

  • commit: Block the current thread mode. After the modification is submitted to the memory, wait for the IO to complete. If the main thread uses the commit mode, it is very likely to be stuck;

  • apply: does not block the current thread, but there are hidden pits, which may cause the main thread to freeze. The main reason is that the apply method adds the write Runnable to the QueueWork, and when the life cycle of the four major Android components is rotated, it will check Whether QueueWork is completed, if not, it will wait, the code is as follows:

  public void handlePauseActivity(IBinder token, boolean finished, boolean userLeaving,        int configChanges, PendingTransactionActions pendingActions, String reason) {        ......        // 确保写任务都已经完成        QueuedWork.waitToFinish();        ......    }}

Therefore, you can see a lot of SharedPreferences stacks in ANR/stuck monitoring. It seems that the stacks are system-level stacks, but they are actually problems introduced by the SP apply method. The stacks are as follows:

java.lang.Object.wait(Native Method) 
java.lang.Thread.parkFor$(Thread.java: ) 
sun.misc.Unsafe.park(Unsafe.java: )
java.util.concurrent.locks.LockSupport.park(LockSupport.java: ) 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java: ) 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java: )
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java: ) 
java.util.concurrent.CountDownLatch.await(CountDownLatch.java: ) 
android.app.SharedPreferencesImpl$EditorImpl$1.run(SharedPreferencesImpl.java: ) 
android.app.QueuedWork.waitToFinish(QueuedWork.java: ) 
android.app.ActivityThread.handleServiceArgs(ActivityThread.java: )
android.app.ActivityThread. - wrap21(ActivityThread.java) 
android.app.ActivityThread$H.handleMessage(ActivityThread.java: ) 
android.os.Handler.dispatchMessage(Handler.java: ) 
ndroid.os.Looper.loop(Looper.java: ) 
ndroid.app.ActivityThread.main(ActivityThread.java: )
java.lang.reflect.Method.invoke(Native Method) 
com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java: ) 
com.android.internal.os.ZygoteInit.main(ZygoteInit.java: )

Poor multi-process support

When using the MODE_MULTI_PROCESS field, it is actually unreliable, because there is no proper mechanism in Android to prevent conflicts caused by multiple processes, the application should not use it, and it is recommended to use ContentProvider. From the above introduction, we know that when multiple processes access the SharedPreferences identified by {MODE_MULTI_PROCESS}, conflicts will occur. For example, in process A, a key is clearly set, and it jumps to process B to fetch it, but it prompts null mistake.

picture

3.2.2.1 Optimization scheme design

At present, major manufacturers have also optimized SP to a certain extent. There are conservative optimizations, which are optimized based on the current mechanism of SP, mainly to solve the ANR problem caused by writing; there are also subversive optimizations, and the more representative ones are MMKV and Data Store. , but after evaluation, there may be certain problems. Therefore, in the optimization of Baidu App, we also learned from the mainstream processing methods in the industry, and finally adopted two optimization methods:

  • Provide subversive optimization components: UniKV, which completely solves a series of problems of native SP, the core scene has the ultimate experience, and the business side actively connects;

  • Optimize the system SP mechanism to solve pain points such as ANR when writing, and mainly serve SP files that are not connected to UniKV;

3.2.2.1.1 UniKV design

hierarchical design

picture

1: Business use directly depends on UniKV, and UniKV inherits SharedPreferences and aligns with native SP interfaces;

2: The project includes native implementation and UniKV implementation, the code directly relies on the native implementation, and replaces it with UniKV implementation when compiling and packaging to ensure the output capability of the business platform;

File Storage Format Design

Divided file header, data block. The first 40 bytes of the file mainly store the version number, write-back times, reserved fields, disaster recovery data length, disaster recovery CRC, actual data length, and actual CRC.

picture

1: The space is allocated in units of 4KB, and the minimum space occupied is 4KB. The file is mapped through mmap, and the operating system is responsible for writing data into the file;

2: Data recovery can be done through disaster recovery data length and disaster recovery CRC;

3: Function expansion can be done through reserved fields, such as whether the migration from SP is successful or not;

picture

The main data body is stored in the data block, written in the form of append, and the data will be sorted out if necessary

1: Support type storage, align with SP native getAll interface;

2: The supported types are: BOOL, INT, FLOAT, DOUBLE, SHORT, LONG, STRING, STRING_ARRAY, BYTE_ARRAY9 types, compared with the native SP implementation support more types;

data migration

The data migration process needs to read the SP content first, and then write the KV file, which will take a long time. The KV file will be available only after the writing is completed. This will cause hidden dangers online and needs to be resolved.

picture

The data migration in UniKV adopts a method that does not affect the business. If the migration is completed, the KV file will be used directly. If the migration is not completed, the SP file will continue to be used and the data migration Runnable will be submitted to the thread pool. In order to avoid data loss due to changes in SP files during data migration, register the data monitoring of SP file changes. The migration completion flag is stored in a reserved field. Usually, when data is migrated, a flag is needed to save the flag whether the migration is complete or not, and other files need to be imported to save it. Here, the reserved field in UniKV solves this problem very well.

Multi-process implementation

Use mmap mechanism + custom file lock to realize data synchronization between processes, mmap file to the memory space of each process, custom file lock mainly realizes recursive lock and lock upgrade, multi-process shared lock when reading, multi-process writing Exclusive locks. Native file locks do not support recursive locks. Upgrading and upgrading are prone to deadlocks or locks will be completely released. Therefore, custom file locks implement inter-process data synchronization. Regarding the implementation of multi-process, I mainly learned the multi-process implementation logic of MMKV. If you are interested, you can refer to: https://github.com/Tencent/MMKV/wiki/android_ipc

achieve effect

Completely solve the performance problem of native SP, significantly improve read and write performance, support multi-process read and write, reduce thread creation, and the overall performance indicators and business indicators have been significantly optimized.

picture

3.2.2.1.2 System SP Mechanism Optimization

Some SPs are used in plug-ins and third-party SDKs, so UniKV cannot be used for unified optimization. It is necessary to provide a solution to optimize the native SP mechanism.

Optimization:

picture

At present, Baidu App has not been optimized on Android 12. The main reason is that the implementation method of Android 12 has changed, the proxy method is relatively complicated, and the cost is relatively high, and the ANR problems caused by SP are few, so it has not been optimized yet.

Optimization effect:

This solution optimizes the overall situation. In addition to the significant drop in the ANR indicator, DAU and retention are also positive. Some students may worry about whether the data writing will be affected after optimization. We have found that the timeliness of SP writing has not changed significantly through monitoring, but the success rate of writing is positive, and the low-end machine has improved significantly, indicating that SP optimization reduces ANR. occurs, more tasks are executed, and the write success rate increases.

3.2.3 Lock optimization

Multi-thread performance tuning is an inevitable topic in performance optimization. In order to achieve thread synchronization, a synchronization lock mechanism (Synchronized synchronization lock, Lock synchronization lock, etc.) is added. Although the birth of synchronization locks ensures the atomicity of operations and the safety of threads , but (compared to the case without locking) caused a decrease in program performance. Therefore, one of the things we have to do here is "lock optimization", that is, to ensure the realization of the lock function (that is, to ensure the safety of multi-threaded operations) and to improve program performance (that is, not to let the program lose too much efficiency because of security) ).

Common lock optimization methods:

picture

The following is an optimization item to introduce the actual optimization implementation of Baidu App in lock optimization.

In the early stage of the project, through the analysis of the Trace tool, it was found that there were many "monitor contentation XXX". This part of the information is the lock-related information output by the Android ART virtual machine, which will include the thread holding the lock, the method, the thread waiting for the lock, etc. lock method. Specifically as shown in the figure below:

picture

After analysis, it is mainly caused by the incorrect use of the synchronized keyword during the initialization of AB of the basic component. It is necessary to optimize the performance of AB and upgrade the architecture if necessary. After analysis, the AB basic components have performance problems in multi-threading and file IO performance, so the AB basic components have been refactored and upgraded to completely solve the performance problems.

picture

After optimization, reading and writing are implemented without locks, which completely solves the lock synchronization problem of ABTest components for business use; compatible with old and new AB data, caches experimental switch and experimental sid data, and stores them in JSON/PB data format. The first read performance is 118ms, Optimized 95% (Xiaomi 5 machine).

3.2.4 Other basic mechanism optimization

In the startup performance optimization of Baidu App, a lot of basic mechanism-related optimizations have been carried out, such as: thread optimization, IO optimization, SO optimization, main thread priority optimization, ContentProvider optimization, class/image preload optimization, image pre-upload GPU optimization etc.

thread optimization

Write plug-ins through the Hook capability, discover irregularities in the use of threads, and formulate thread use specifications, such as:

1: It is forbidden to set the thread priority privately in the business;

2: Provide a unified thread pool to avoid a thread pool for each business;

3: The thread pool/task scheduler scheduling is preferred, and the business prohibits the creation of threads/thread pools alone;

4: The thread pool needs to avoid frequent creation of threads and standardize parameters.

I/O optimization

Write plug-ins through the Hook capability and find unreasonable IO problems, mainly including:

  1. The reading and writing time of the main thread exceeds 100ms. If the reading and writing time of the main thread is too long, the main thread will take a long time, and in serious cases, it may cause ANR problems;

  2. The reading and writing buffer is too small. If the buffer is too small, it will cause too many system calls and memory copies, and too many read/write times, which will affect performance.

SO optimization

Write plug-ins through Hook capability, discover SO loading problems, optimize unnecessary SO loading process, and try to solve the necessary loading through asynchronous thread advance strategy to achieve the purpose of optimizing performance.

Binder optimization

Write plug-ins through the Hook capability, discover problems related to Binder communication, optimize unnecessary Binder communication, and optimize performance through memory caching, file persistence, etc. if necessary.

main thread priority

The priority of the main thread determines the resources allocated by the system to the main thread. If there is a problem with the thread priority and it is changed to a low priority, it is very likely that the CPU time slice will not be obtained and the running will be slow. In the troubleshooting of the priority of the main thread, the most representative one is that the business sets the priority by mistake when setting the priority for the relevant sub-thread. The problem occurs in the following ways:

Thread t = new Thread();t.start();t.setPriority(3);

Android's bizarre trap - setting the thread priority caused by the WeChat carton tragedy :

https://mp.weixin.qq.com/s/oLz_F7zhUN6-b-KaI8CMRw

When Baidu App checks the priority setting, the native library also has the logic to change the thread priority, which also needs to be actively corrected, such as part of the logic of the facebook react library:

picture

ContentProvider/FileProvider optimization

Between Application.attachbaseContext and Application.onCreate, the installContentProviders method will be executed. In this method, the ContentProvider/FileProvider declared in the AndroidManifest will be executed. Generally, the time-consuming one is FileProvider. The main reason is that FileProvider has IO operations during initialization. The main optimization is to remove the ContentProvder/FileProvider and control it through the android:process attribute, or initialize it in the process through lazy loading.

Image prepareToDraw optimization

picture

In the Trace tool, you will see time-consuming issues related to uploading XXX Texture when syncFrameState is executed in RenderThread. First, check the width and height of the picture displayed in the trace to ensure that the size of the picture is not much larger than the displayed area. It is also possible to trigger the Bitmap to upload the GPU operation in advance through the prepareToDraw method. This method can make the Bitmap complete in advance when the RenderThread is idle. Ideally, the image loading library will help you do this; if you want to control the image loading yourself, or you need to ensure that the Bitmap upload is not triggered when drawing, you can call prepareToDraw directly in the code.

Some students may be confused, this optimization does not optimize the main thread, will it optimize the startup performance? The answer is that the main thread can be optimized. In the first few frames of startup, each frame takes a long time, and the task of each frame runs as DrawFrame Task in RenderThread. If the task of the previous frame is not completed, it will be blocked. The drawing of the current frame is reflected in the main thread that the draw process is slowed down, such as the execution time of nSyncAndDrawFrame is too long.

3.3 Optimization of the underlying mechanism

Mainly by exploring the underlying technology to achieve the goal of optimizing performance indicators and leveraging business value. This direction is relatively risky and costly. The final decision needs to be made based on specific manpower conditions and optimization effects.

Baidu App has tried VerifyClass optimization, CPU Booster optimization, GC-related optimization, etc., and is still exploring some technical points. This part of optimization is basically global optimization, which will be announced in the follow-up fluency topic.

Four. Summary

Start-up performance optimization is a relatively complicated technical direction. Not only are there many businesses that are inextricably linked to start-up performance, but there are also many system behaviors that deserve attention and investment during the start-up process. At present, the start-up performance of Baidu App has gradually improved. Entering the bottleneck period, how to break the bottleneck and closely integrate with the business is a challenge and an opportunity to start performance optimization. The optimization of startup performance is a process of continuous learning, continuous subversion, and continuous improvement. There may be many challenges and many opportunities in the process. Therefore, startup performance optimization is never-ending and has a long way to go.

—— END——

References:

1. Douyin startup optimization

https://heapdump.cn/article/3624814

2. Kuaishou TTI Governance Experience Sharing

https://zhuanlan.zhihu.com/p/422859543

3. Analysis of Android startup optimization

https://juejin.cn/post/7183144743411384375

4、MMKV:

https://github.com/Tencent/MMKV/wiki/android_ipc

Recommended reading:

From php5.6 to golang1.19 - the road to the performance transition of Library App

Application practice of light sweeping motion effect on mobile terminal

Android SDK security hardening issues and analysis

Large-scale quantitative practice of search semantic model

How to design an efficient distributed log service platform

Multimodal Semantic Matching Model in Video and Image Retrieval: Principles, Implications, Applications and Prospects

Guess you like

Origin blog.csdn.net/lihui49/article/details/131653586