Dewu App Android Cold Start Optimization-Application

Preface

The cold start indicator is a very important indicator in the App experience. In e-commerce Apps, it has a decisive impact on users' retention intention. Usually it refers to the time it takes from the start of the App process to the appearance of the first frame of the homepage, but from the perspective of user experience, it should be from the time the user clicks the App icon to the end of the complete display of the homepage content.

The design of allocating the startup phase work into tasks and constructing a directed acyclic graph is already the standard configuration of the startup framework of component-based apps at this stage. However, due to the performance bottleneck of the mobile terminal, improper use of high-concurrency designs often results in lock-in Time-consuming problems such as contention and disk IO blocking occur frequently. How to make further progress, maximize the use of limited resources during the limited time in the startup phase, and reduce the time consumption of the main thread as much as possible while ensuring the stability of business functions is the topic that this article will discuss.

This article will introduce how we reduced the online startup index of Dewu App by 10% and the offline index by 34% through unified management and control of system resources in the startup phase, on-demand allocation and peak-staggered loading, and achieved the same results in the same type of Promoted to Top 3 among e-commerce apps .

1. Indicator selection

Traditional performance monitoring indicators usually use the attachBaseContext callback of Application as the starting point and the execution of the decorView.postDraw task on the homepage as the end time point. However, this does not measure the time consuming for dex loading and contentProvider initialization.

Therefore, in order to be closer to the real user experience, on the basis of the startup speed monitoring indicator, we added an offline user somatosensory indicator. By analyzing the screen recording file frame by frame, we found the App icon and clicked the animation to start playing (the icon darkens) as The starting frame, the first frame in which the homepage content appears is used as the ending frame, and the calculated result is used as the startup time.

Example: The startup process is from 03:00 to 03:88, so the startup time is 880ms.1.png

2. Application optimization

App may land on different homepages (community/transaction/H5) in different business scenarios, but the process of running Application is basically fixed and rarely changes, so Application optimization is our first choice.

The startup framework task of Dewu App has undergone many rounds of optimization in recent years. Conventional trace capture to find time-consuming points and asynchronousization can no longer bring obvious benefits. It must be explored from the perspective of lock competition and CPU utilization. Regarding optimization, this type of optimization may not have particularly obvious short-term benefits, but in the long run it can avoid many degradation problems in advance.

1.WebView optimization

When the app calls the construction method of webview for the first time, it will start the system's initialization process of webview, which usually takes 200+ms. The conventional idea of ​​​​such a time-consuming task is to directly throw it into the child thread for execution, but the chrome kernel has added Very many thread checks, so that the webview can only be used in the thread that constructs it.01.png

In order to speed up the startup of H5 pages, Apps usually choose to initialize webview and cache it in the Application stage. However, the initialization of webview involves cross-process interaction and file reading, so any shortage of CPU time slice, disk resources and binder thread pool will cause It is time-consuming and bloated, and the Application phase has many tasks, so it is easy for the above resource shortages to occur.

02.png

Therefore, we split the webview into three steps and spread them out at different stages of startup. This can reduce the time-consuming expansion problem caused by competition for resources, and can also greatly reduce the probability of ANR.

04.png

1.1 Task splitting

a. provider preloading

WebViewFactoryProvider is an interface class used to interact with the webview rendering process. The first step in webview initialization is to load the apk file of the system webview, build the classloader and create a static instance of WebViewFactoryProvider by reflection. This operation does not involve thread checking, so we It can be directly handed over to the child thread for execution.

10.png

b. Initialize the webview rendering process

This step corresponds to the WebViewChromiumAwInit.ensureChromiumStartedLocked() method in the chrome kernel. It is the most time-consuming part of webview initialization, but it is executed continuously with the third step. Code analysis found that in the interface exposed by WebViewFactoryProvider to the application, the getStatics method will trigger the ensureChromiumStartedLocked method.

At this point, we can only initialize the webview rendering process by executing WebSettings.getDefaultUserAgent().150.png

c. Construct webview

i.e. new Webview()

1.2 Task allocation

In order to minimize the main thread time consumption, our task arrangement is as follows:

  • a.provider is preloaded, can be executed asynchronously, and does not have any pre-dependencies, so it can be executed at the earliest time point in the Application stage.
  • b. Initializing the webview rendering process must be in the main thread, so it is placed after the end of the first frame of the homepage.
  • c. Construct the webview, which must be in the main thread. When the second step is completed, post to the main thread for execution. This ensures that it is not in the same message as the second step and reduces the chance of ANR.

1.3 Summary

Although we have split the webview initialization into three parts, the second step, which takes the most time, may still reach the ANR threshold on low-end machines or in extreme cases, so we have made some restrictions. For example, the current device will collect statistics And record the time taken for complete initialization of webview. Only when the time taken is lower than the threshold of configuration delivery, the above segmented execution optimization is enabled.

If the App is opened through push, delivery and other channels, the generally opened page is most likely an H5 marketing page. Therefore, this type of scenario is not suitable for the above-mentioned segmented loading, so it is necessary to hook the messageQueue of the main thread and parse out the intent information of the startup page. , and then make a judgment.

Limited by the screen-opening ad function, we can currently only enable this optimization for startup scenarios without screen-opening ads. We will later plan to use the gap between ad countdowns to perform step 2 to cover scenarios with screen-opening ads.

170.png

2.ARouter optimization

In the current era of popular componentization, routing components are almost a must-have basic component for all large-scale Android apps. Currently, Dewu uses the open source ARouter framework.

The design of the ARouter framework is that by default, the first routing level in the registered path in the annotation (such as trade in "/trade/homePage") will be the Group where the routing information is located, and routing information of the same Group path will be merged into the final Synchronous registration is performed in the generated registration function of the same class. In large projects, for complex business lines, the same Group may contain hundreds of registration information, and the registration logic execution process takes a long time. Taking Dewu as an example, the business line with the most routes has already spent a lot of time initializing routes. Arrived at 150+ms.

190.png

The registration logic of the route itself is lazy loaded, that is, when the first routing component under the corresponding Group is called, the route registration operation will be triggered. However, ARouter uses the SPI (Service Discovery) mechanism to help business components expose some interfaces to the outside world, so that they can call some business layer sights without relying on business components. When developing these services, developers will generally habitually follow the interfaces to which they belong. The component sets the routing path for it, so that when these services are constructed for the first time, routing loading under the same Group will also be triggered.

In the Application stage, some interfaces in the services of the business module must be used, which will trigger the route registration operation in advance. Although this operation can be performed in an asynchronous thread, most of the work in the Application stage requires access to these services. Therefore, when the initial construction time of these services increases, the overall startup time will inevitably increase accordingly.

2.1 ARouter Service routing separation

The original intention of ARouter using SPI design is for decoupling. The role of Service should only be to provide interfaces. Therefore, an empty implemented Service should be added specifically to trigger route loading. The original Service needs to be replaced with a Group, which will only be used later. Provide an interface so that other tasks in the Application phase do not need to wait for the completion of the route loading task.

001.png

2.2 ARouter supports concurrent loading of routes

After we implemented route separation, we found that the total time taken to load existing hotspot routes was greater than the time spent by Application. In order to ensure that the loading of routes was completed before entering the splash screen page, the main thread had to sleep and wait for the loading of routes to be completed.

Analysis shows that ARouter's route loading method adds a class lock because it needs to load routes into maps in the warehouse class. These maps are thread-unsafe HashMap, which means that all route loading operations are actually executed serially, and There is lock competition, which ultimately leads to the cumulative time consumption being greater than the application time consumption.

002.pngAnalyzing the trace shows that the time consuming is mainly due to frequent calls to the loadInto operation of the loading route. Analyzing the role of the lock here, it can be seen that the main purpose of adding class locks is to ensure the thread safety of the map operation in the warehouse WareHouse.

003.png

Therefore, we can downgrade the class lock and lock the class object GroupMeta (this class is the class generated by ARouter apt, corresponding to the ARouter$$Provider$$xxx class in the apk) to ensure thread safety during the route loading process. As for the The previous thread safety issues with map operations can be completely solved by replacing these maps with concurrentHashMap. In extreme concurrency situations, there will be some thread safety issues, which can also be solved by adding empty judgments as shown in the figure.

009.png

010.pngAt this point, we have implemented concurrent loading of routes, and then we group the services to be preloaded reasonably according to the bucket effect, and then put them into coroutines for concurrent execution to ensure that the overall time is minimized.

011.png

012.png

3. Lock optimization

Most of the tasks performed in the Application phase are the initialization of the basic SDK. Its running logic is usually relatively independent, but there will be dependencies between SDKs (for example, the hidden library will depend on the network library), and most of them will involve reading files and loading so. Library and other operations, in order to compress the time-consuming of the main thread in the Application stage, time-consuming operations will be put into sub-threads to run concurrently as much as possible to make full use of CPU time slices, but this will inevitably lead to some lock competition issues.

3.1 Load so lock

The System.loadLibrary() method is used to load the so library in the current apk. This method locks the Runtime object, which is equivalent to a class lock.

The basic SDK is usually designed to write the load so operation into the static code block of the class to ensure that the so library is prepared before the SDK initialization code is executed. If this basic SDK happens to be a basic library such as a network library and will be called by many other SDKs, multiple threads will compete for this lock at the same time. In the worst case scenario, when IO resources are tight, reading so files becomes slow, and the main thread is the last one in the lock waiting queue, then the startup time will be far longer than expected.

034.png

To this end, we need to uniformly control and converge the operations of loadSo into one thread for execution, forcing them to run in a serial manner, so as to avoid the above situation. It is worth mentioning that the so file in webview.apk will also be loaded during the previous webview provider preloading process, so you need to ensure that the preloadProvider operation is also placed in this thread.

The loading operation of so will trigger the JNI_onload method of the native layer, and some so may perform some initialization work therein. Therefore, we cannot directly call the System.loadLibrary() method to load so, otherwise problems may occur with repeated initialization.

We finally adopted the class loading method, which is to move all the codes loaded by these SOs into the static code blocks of related classes, and then trigger the loading of these classes. The class loading mechanism is used to ensure that the loading operations of these SOs will not Repeat the execution, and the order in which these classes are loaded should also be arranged according to the order in which these so are used.78.png

In addition, it is not recommended that the loading task of so be executed concurrently with other tasks that require IO resources . According to the actual measurement in Dewu App, the time-consuming difference between the two cases is huge.

4. Start framework optimization

The current common startup framework design is to allocate the work in the startup phase to a group of task nodes, and then construct a directed acyclic graph based on the dependencies of these task nodes. However, with the business iteration, some historical task dependencies have been There is no need to exist, but it will slow down the overall startup speed.

Most of the work in the startup phase is the initialization of the basic SDK, which often has complex dependencies between them. In order to compress the time-consuming of the main thread when doing startup optimization, we usually find out the time-consuming tasks of the main thread and throw them into The child thread executes it, but in the Application stage with complex dependencies, if you just throw it into asynchronous execution, you may not get the expected benefits.

99.png

After we completed the webview optimization, we found that the startup time did not directly reduce the webview initialization time as expected, but was only about half of the expected amount. After analysis, we found that our main thread task depends on the sub-thread task, so when When the sub-thread task is not completed, the main thread will sleep and wait.

And the reason why webview is initialized at this point in time is not because of dependency restrictions, but because the main thread happens to have a relatively long sleep time that can be utilized during this period, but the workload of asynchronous tasks is much greater than that of the main thread. , even if seven sub-threads are running concurrently, the time-consuming task is greater than that of the main thread.

Therefore, if you want to further expand your benefits, you must optimize the task dependencies in the startup framework.

66.png

671.jpeg

The first picture above is a directed acyclic graph of tasks in the startup phase of Dewu App before optimization. The red box indicates that the task is executed on the main thread. We focus on tasks that block main thread task execution.

It can be observed that there are several exits and tasks with particularly many entrances on the dependency links of the main thread tasks. Many exits indicate that such tasks are usually very important basic libraries (such as the network library in the figure), and many entrances indicate that this task There are too many pre-dependencies, and the time point when it starts to execute fluctuates greatly. The combination of these two points shows that the time at which this task ends is very unstable and will directly affect subsequent main thread tasks.

The main ideas for optimizing this type of tasks are:

  • Dismantle the task itself and divide out operations that can be executed in advance or delayed. However, before dividing out, it is necessary to consider whether there is any time slice left in the corresponding time period, or whether it will aggravate the competition for IO resources;
  • Optimize the predecessor tasks of the task so that the execution of the task ends as early as possible, which can reduce the time it takes for subsequent tasks to wait for the task;
  • Remove unnecessary dependencies. For example, the initialization of the embedded library only requires registering a listener to the network library, but does not initiate a network request. (recommend)

It can be seen that in our second directed acyclic graph after optimization, the dependency levels of tasks are significantly reduced, and tasks with particularly large entrances and exits basically no longer appear.

044.png

320.png

Comparing the traces before and after optimization, we can also see that the task concurrency of the sub-threads has been significantly improved, but the higher the task concurrency, the better. On low-end machines where the time slice itself is insufficient, the higher the concurrency, the worse the performance. Because it is easier to solve problems such as lock competition and IO waiting, it is necessary to leave a certain gap and conduct sufficient performance testing on mid- to low-end machines before going online, or use different task arrangements for mid- to low-end machines.

3. Home Page Optimization

1. Time-consuming optimization of general layout

The system parses the layout by reading the layout xml file through the inflate method and parsing it to build the view tree. This process involves IO operations and is easily affected by the device status. Therefore, we can parse the layout file through apt during compilation to generate the corresponding view construction class. . Then asynchronously execute the methods of these classes in advance during runtime to build and assemble the view tree, which can directly optimize the time-consuming page inflate.

601.png

602.png

2. Message scheduling optimization

During the startup phase, we usually register some ActivityLifecycleListener to monitor the page life cycle, or post some delayed tasks to the main thread. If there are time-consuming operations in these tasks, it will affect the startup speed, so you can hook the main thread. Message queue, move the page life cycle callback and page drawing related msg to the head of the message queue, so as to speed up the display of the first frame of the home page.

102.png

Please look forward to the follow-up content of this series for details.

4. Stability

Performance optimization can only be regarded as the icing on the cake for the App, and stability is the life redline. Optimization and transformation are initiated in the Application stage where the execution time is very early. The degree of stability risk is very high, so it is necessary to prepare for crash protection. When optimizing, even if there are unavoidable stability issues, negative impacts must be minimized.

1. Crash protection

Since the tasks performed in the startup phase are all important basic library initializations, it is of little significance to identify and eat exceptions when a crash occurs, because there is a high probability that it will cause subsequent crashes or functional abnormalities, so our main protection work is to prevent problems. Subsequent hemostasis .

The configuration center SDK is usually designed to read the cached configuration from a local file, and then refresh it after the interface request is successful. Therefore, if a crash occurs after the configuration is hit during the startup phase, the new configuration cannot be pulled. In this case, you can only clear the App cache or uninstall and reinstall it, which will cause very serious user losses.

109.png        crash fallback

Add try-catch protection to all changes. After catching the exception, report the hidden point and write the crash mark bit into MMKV. In this way, the device will no longer enable startup optimization-related changes in the current version, and then throw out The original abnormality caused him to collapse. As for native crash, you can perform the same operation in the native crash callback of Crash monitoring.

1100.png

Running status detection

We can catch Java Crash by registering unCaughtExceptionHandler, but native crash needs to be captured with the help of crash monitoring SDK. However, crash monitoring may not be initialized at the earliest time point of startup, such as the preloading of Webview's Provider and the preloading of so library. They are all earlier than crash monitoring, and these operations all involve native layer code.

In order to avoid the risk of crash in this scenario, we can bury the MMKV mark bit at the starting point of the Application and change it to another state at the end point, so that some code whose execution time is earlier than the configuration center can obtain this mark bit. Determine whether the last run was normal. If some unknown crashes occurred during the last startup (such as a native crash that occurred before the crash monitoring was initialized), then this flag can be used to turn off startup optimization changes in time.

Combined with the automatic restart operation after a crash, the crash is not actually observable from the user's perspective, but it is felt that the startup time is about 1-2 times as long as usual.

0456.png

Configure validity period

Online technology changes usually configure the sampling rate and combine it with random numbers to gradually increase the volume. However, the design of configuring the SDK is usually to take the last local cache by default. When an online crash or other fault occurs, it can be rolled back in time. configuration, but the design of the cache will cause users to experience at least one crash due to the cache.

To this end, we can add a matching expiration timestamp to each switch configuration, and limit the current volume switch to only take effect before this timestamp. This ensures that the bleeding can be stopped in time when encountering faults such as online crashes, and the timestamp is accurate. The design can also avoid crashes caused by the lag in online configuration taking effect.

457.png

From the user's perspective, comparison before and after adding the configuration validity period:

678.jpeg

5. Summary

So far, we have analyzed the common cold start time-consuming cases in Android Apps. However, the biggest pain point in startup optimization is often the App's own business code. Tasks should be allocated reasonably based on business needs. If you blindly rely on preloading , Delayed loading and asynchronous loading cannot fundamentally solve the time-consuming problem, because the time-consuming time does not disappear but is transferred, followed by low-end machine startup degradation or abnormal functions.

Performance optimization requires not only the user's perspective, but also an overall view. If all time-consuming tasks are thrown after the first frame because the startup indicator is considered to be the end of the first frame of the homepage, it will inevitably cause lags or even ANR in the user's subsequent experience. . Therefore, when splitting tasks, you not only need to consider whether it will compete with concurrent tasks for resources, but also consider whether the functional stability and performance of each stage of startup and a period of time after startup will be affected. Verify everything, at least make sure there is no performance degradation.

1. Prevent deterioration

Startup optimization is by no means a one-time task. It requires long-term maintenance and polishing. A technical modification of the basic library may bring the indicators back to before liberation overnight, so deterioration prevention must be implemented as soon as possible.

By adding buried points at key points, when online indicators are found to be degraded, the approximate location of the degraded code (such as xxActivity's onCreate) can be quickly located and alerted. This not only helps R&D quickly locate problems, but also avoids specific online scenarios. Indicator degradation cannot be reproduced offline because the time-consuming fluctuation range of a single startup can be up to 20%. If you directly conduct trace analysis, it may be difficult to locate the approximate range of degradation.

For example, when two startups are used for trace comparison, one of the file reading operations is obviously slowed down due to IO blocking, while the other time the IO is normal. This will mislead developers to analyze these normal codes, and actually lead to Degraded code may just be masked by fluctuations.

2. Outlook

For ordinary scenarios started by clicking on the icon, the complete initialization work will be performed in the Application by default. However, some deeper functions, such as customer service center and editing of delivery addresses, will not work even if the user directly enters these pages as quickly as possible. It requires at least 1 second of operation time, so the initialization work related to these functions can be postponed until after the Application, or even changed to lazy loading, depending on the importance of the specific functions.

The launch scenarios of recall/recruitment through delivery and push usually account for a small proportion, but their business value is much greater than ordinary scenarios. Since the current startup time mainly comes from webview initialization and some homepage preloading-related tasks, if starting the landing page does not require all basic libraries (such as H5 pages), then we can delay loading of all the tasks that are not required. In this way, the startup speed can be greatly increased, and it can be started in a real second.

*Text/Jordas

This article is original to Dewu Technology. For more exciting articles, please see: Dewu Technology official website

Reprinting without the permission of Dewu Technology is strictly prohibited, otherwise legal liability will be pursued according to law!

Spring Boot 3.2.0 is officially released. The most serious service failure in Didi’s history. Is the culprit the underlying software or “reducing costs and increasing laughter”? Programmers tampered with ETC balances and embezzled more than 2.6 million yuan a year. Google employees criticized the big boss after leaving their jobs. They were deeply involved in the Flutter project and formulated HTML-related standards. Microsoft Copilot Web AI will be officially launched on December 1, supporting Chinese PHP 8.3 GA Firefox in 2023 Rust Web framework Rocket has become faster and released v0.5: supports asynchronous, SSE, WebSockets, etc. Loongson 3A6000 desktop processor is officially released, the light of domestic production! Broadcom announces successful acquisition of VMware
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5783135/blog/10314022