How to protect basic performance from the startup speed of iOS App

1 Introduction

Launching is the first impression an App gives to users. The launching speed of an App is not only a matter of user experience, but also often determines whether it can acquire more users. So at a certain stage, app startup optimization is a must. App startup is basically divided into the following two types

1.1 Cold start

Before the app is clicked to start, its process is not in the system, and the system needs to create a new process and assign it to start. This is a complete boot process.

Performance: App starts for the first time, restarts, updates, etc.

1.2 Warm start

After the App is cold-started, the user withdraws the App to the background. When the App process is still in the system, the user restarts the process of entering the App. This process does very little.

So we mainly talk about the optimization of cold start

2 Start process

2.1 What does the APP start?

To optimize the startup speed, we need to know what the general process of the startup process is, what has been done, and whether it can be optimized in a targeted manner.
The following figure is a detailed breakdown of the startup process

Click on the icon to create a process
mmap the main binary, find the path to dyld
mmap dyld, set the entry address to _dyld_start

dyld is an auxiliary program for startup, which is in-process, that is, it will load dyld into the address space of the process at startup, and then hand over the subsequent startup process to dyld. There are two main versions of dyld: dyld2 and dyld3.

Before iOS 12, it was mainly dyld2. Since iOS 13, Apple has enabled dyld3 for third-party apps. The most important feature of dyld3 is the startup closure. The closure is stored in the tmp/com.apple.dyld directory of the sandbox. Remember not to clean this directory when cleaning the cache.

The closure mainly contains the following contents:

dependsends, depends on the list of dynamic libraries
fixup: the address of bind & rebase
initializer-order: initialization call order
optimizeObjc: Metadata for Objective C
Others: main entry, uuid, etc.

The part above the dotted line in the above figure is out-of-process, which will be executed when the app is downloaded and installed and the version is updated, and the data will be read directly from the cache to speed up the loading speed

This information is required for every startup. Storing the information in a cache file can avoid parsing every time, especially the runtime data (Class/Method…) parsing of Objective-C takes time, so it is an optimization for startup speed

4. Bring in the unloaded dynamic library mmap, the number of dynamic libraries will affect this stage

dyld obtains the list of dependent dynamic libraries that need to be loaded from the header of the main execution file, and then it needs to find each dylib, and the dylib files that the application depends on may depend on other dylibs, so what needs to be loaded is a set of recursively dependent dynamic libraries

5. Cyclic load and mmap the dynamic library collection into the virtual memory, do fixup for each Mach-O, including Rebase and Bind.

To do bind and rebase for each binary, the time is mainly spent on Page In, and it is the metadata of objc that affects the number of Page In

Rebase adjusts the pointing of the pointer inside the Image. In the past, the dynamic library will be loaded to the specified address, and all pointers and data are correct for the code, but now the address space layout is randomized (ASLR), so it needs to be corrected according to the random offset at the original address, that is to say, when Mach-O mmaps to the virtual memory, the starting address will have a random offset slide, and the internal pointer needs to be added to this slide.
Bind is to correctly point the pointer to the content outside the Image. These pointers pointing to the outside are bound by the name of the symbol (symbol). Dyld needs to search in the symbol table to find the corresponding implementation of the symbol. For external functions such as printf, its address is known only at runtime. bind is to point the pointer to this address. This is also the core that we can use fishhook to hook some dynamic symbols later

As shown in the figure below, when compiling, the string 1234 is at 0x10 of __cstring, so the pointer of the DATA segment points to 0x10. But there is an offset slide=0x1000 after mmap, and the address of the string at runtime is 0x1010 at this time, so the pointer to the DATA segment is wrong. The process of rebase is to change the pointer from 0x10, plus slide to 0x1010. The address of the runtime class object is already known, and bind is to point isa to the actual memory address.

6. Initialize the runtime of objc, since most of the closures have been initialized, only register sel and load category here

7.+load and static initialization are called. In addition to the time-consuming method itself, it may also cause a large number of Page In. If dispatch_async is called, it will delay the execution of the runloop after startup. If static initialization is triggered, it will be delayed until runtime execution

8. Initialize UIApplication and start the Main Runloop. You can use the runloop to count the time spent on the first screen in the previous chapters, or you can do some warm-up tasks after the startup

9. Execute will/didFinishLaunch, here is mainly time-consuming business code. The business code of the home page is executed at this stage, that is, before the first screen rendering, which mainly includes: reading and writing operations of the configuration files required for the first screen initialization; reading the big data of the first screen list; a large number of calculations for the first screen rendering; SDK initialization; for large-scale component projects, it also includes many moudle startup add-ons

10. Layout, viewDidLoad and Layoutsubviews will be called here, too much Autolayout will affect this part of the time

11. Display, drawRect will call

12.Prepare, image decoding occurs in this step

13. Commit, the first frame rendering data is packaged and sent to RenderServer, and the GPU rendering pipeline process is followed, and the startup is completed

(tips: 2.2.10-2.2.13 here is mainly part of the process of the graphics rendering pipeline, the Application generates the primitive stage (CPU stage)). Subsequent will be handed over to a separate RenderServer process, and then call the rendering framework (Metal/OpenGL ES) to generate a bitmap, put it in the frame buffer, the hardware reads the frame buffer content according to the clock signal, and completes the screen refresh

2.2 Start time statistics of each stage

The detailed elaboration of the process of starting each stage in the previous section can be roughly divided into six stages (WWDC2019):

Through the statistical analysis of the duration of each stage, optimize and then compare.

You can set the environment variables DYLD_PRINT_STATISTICS and DYLD_PRINT_STATISTICS_DETAILS in Xcode to see the startup phase and the corresponding time-consuming (environment variables become invalid after iOS15)

You can also see the startup time through Xcode MetricKit itself: Open Xcode -> Window -> Origanizer -> Launch Time

It is best if the company has a corresponding mature monitoring system. Here we mainly use manual non-invasive burying points to count the start-up time, and perform statistical analysis on the start-up process pre main-> after main

2.1.1 Process creation time management

Get the timestamp of process creation through sysctl system call

#import <sys/sysctl.h>
#import <mach/mach.h>


+ (BOOL)processInfoForPID:(int)pid procInfo:(struct kinfo_proc*)procInfo
{
    int cmd[4] = {CTL_KERN, KERN_PROC, KERN_PROC_PID, pid};
    size_t size = sizeof(*procInfo);
    return sysctl(cmd, sizeof(cmd)/sizeof(*cmd), procInfo, &size, NULL, 0) == 0;
}


+ (NSTimeInterval)processStartTime
{
    struct kinfo_proc kProcInfo;
    if ([self processInfoForPID:[[NSProcessInfo processInfo] processIdentifier] procInfo:&kProcInfo]) {
        return kProcInfo.kp_proc.p_un.__p_starttime.tv_sec * 1000.0 + kProcInfo.kp_proc.p_un.__p_starttime.tv_usec / 1000.0;
    } else {
        NSAssert(NO, @"无法取得进程的信息");
        return 0;
    }

2.1.2 main() execution time management

// main之前调用
// pre-main()阶段结束时间点：__t2
void static __attribute__ ((constructor)) before_main()
{
  if (__t2 == 0)
  {
    __t2 = CFAbsoluteTimeGetCurrent() + kCFAbsoluteTimeIntervalSince1970;
  }
}

2.1.3 First screen rendering time management

The end point of the startup corresponds to the first frame when the Launch Image disappears as perceived by the user

iOS 12 and below: viewDidAppear of root viewController

iOS 13+：applicationDidBecomeActive

Apple's official statistical method is the first CA::Transaction::commit, but the corresponding implementation is inside the system framework, but we can find the closest time point

Through the analysis and debugging of the Runloop source code, we found that CFRunLoopPerformBlock, kCFRunLoopBeforeTimers and CA::Transaction::commit() are the latest time points, so we can click here.

Specifically, you can obtain callbacks at these two time points by registering a block or BeforeTimer Observer with Runloop in didFinishLaunch. The code is as follows:

//注册block
CFRunLoopRef mainRunloop = [[NSRunLoop mainRunLoop] getCFRunLoop];
CFRunLoopPerformBlock(mainRunloop,NSDefaultRunLoopMode,^(){
    NSTimeInterval stamp = [[NSDate date] timeIntervalSince1970];
    NSLog(@"runloop block launch end:%f",stamp);
});

Observer of BeforeTimer

//注册kCFRunLoopBeforeTimers回调
CFRunLoopRef mainRunloop = [[NSRunLoop mainRunLoop] getCFRunLoop];
CFRunLoopActivity activities = kCFRunLoopAllActivities;
CFRunLoopObserverRef observer = CFRunLoopObserverCreateWithHandler(kCFAllocatorDefault, activities, YES, 0, ^(CFRunLoopObserverRef observer, CFRunLoopActivity activity) {
    if (activity == kCFRunLoopBeforeTimers) {
        NSTimeInterval stamp = [[NSDate date] timeIntervalSince1970];
        NSLog(@"runloop beforetimers launch end:%f",stamp);
        CFRunLoopRemoveObserver(mainRunloop, observer, kCFRunLoopCommonModes);
    }
});
CFRunLoopAddObserver(mainRunloop, observer, kCFRunLoopCommonModes);

In summary, analyze the average startup time of existing project versions:

[Function name: +[LaunchTrace mark]_block_invoke][Line number: 54]————App start————Time-consuming: pre-main:4.147820 [Function name:+[LaunchTrace mark]_block_invoke][Line number: 55]————App start————Time-consuming:didfinish:0.654687 [Function name:+[LaunchTrace mark]_block _invoke][Line number: 56]————App start————Time-consuming: total
:
4.802507

3 Start optimization

In the previous section, we mainly analyzed the App startup process and duration statistics. The following is the direction we want to optimize. We should optimize each stage as much as possible. Of course, we should not over-optimize. The corresponding problems of different stages and different scales of the project will be different, so we should do targeted analysis and optimization.

3.1 Pre Main Optimization

3.1.1 Adjust dynamic library

After checking the existing projects, they are basically linked with dynamic libraries, a total of 48, so the idea is as follows

Reduce dynamic libraries, convert self-owned dynamic libraries to static libraries
The existing library is managed by CocoaPods, so modify the Xcode config through the hook pod construction process to change the Mach-O type of some pods to Static Library;
At the same time, analyze the ROI of some dynamic libraries with large codes, and analyze whether the replacement logic can be implemented in the code without relying on it, so as to delete some dynamic libraries with low ROI
Merge dynamic library
At present, the dynamic library introduced by the project is relatively simple, and there is no merge item. For some medium and large-scale projects, there are many infrastructure UI libraries of their own, many of which are too scattered. What needs to be done is to aggregate when they can be aggregated. For example, XXTableView, XXHUD, and XXLabel are recommended to be merged into one XXUIKit; for example, some tool libraries can also be aggregated into one according to the actual situation.
Dynamic library lazy loading
After analyzing the scale of the current project stage, it is not necessary to lazy load the dynamic library. After all, optimization should consider the benefits, and it is only a reference for optimization ideas
Normal dynamic libraries are directly or indirectly linked by the main binary, so these dynamic libraries will be loaded at startup. If it is only packaged into the App and does not participate in the link, then it will not be automatically loaded at startup. When the content in the dynamic library is needed at runtime, it will be lazy loaded manually.
The runtime is loaded by -[NSBundle load], which essentially calls the underlying dlopen.

3.1.2 rebase&binding&objc setup stage

The time-consuming loading of irrelevant Class and Method symbols will also bring additional startup time; so we need to reduce the number of pointers in the __DATA section; analysis of the project code finds that there are many similar categories, and each category may only have one function function, so the category is merged according to the project situation analysis

In addition to the time-consuming method itself, +load will also cause a large number of Page In, and the existence of +load will also impact the stability of the App, because the crash cannot be captured.
Many of the projects are similar to the following load function logic. After specific analysis, many of them can be used as launchers for governance and management, and the runloop is idle to execute.
Delayed loading after the first screen

The other type is the load logic operation: one of many componentized communication decoupling solutions is to bind the protocol and the class in the load function. This part can be migrated to the compile time by using the clang attribute:

typedef struct{
    const char * cls;
    const char * protocol;
}_di_pair;
#if DEBUG
#define DI_SERVICE(PROTOCOL_NAME,CLASS_NAME)\
__used static Class<PROTOCOL_NAME> _DI_VALID_METHOD(void){\
    return [CLASS_NAME class];\
}\
__attribute((used, section(_DI_SEGMENT "," _DI_SECTION ))) static _di_pair _DI_UNIQUE_VAR = \
{\
_TO_STRING(CLASS_NAME),\
_TO_STRING(PROTOCOL_NAME),\
};\
#else
__attribute((used, section(_DI_SEGMENT "," _DI_SECTION ))) static _di_pair _DI_UNIQUE_VAR = \
{\
_TO_STRING(CLASS_NAME),\
_TO_STRING(PROTOCOL_NAME),\
};\
#endif

The principle is simple: the macro provides the interface, the class name and protocol name are written into the specified segment of the binary during compilation, and the relationship is read out at runtime to know which class the protocol is bound to.

offline code

Dead code removal is basically the lowest ROI among all performance optimization methods. However, almost all technical means with high ROI are one-time optimization solutions, and it will be relatively weak to optimize after several version iterations. In contrast, code-specific detection and removal offers a lot of room for optimization over a long period of time

Detection method: Statically scan the Mach-O file to make a difference between classlist and classrefs to form a preliminary set of useless classes, and perform secondary adaptation according to the characteristics of the business code

Of course, there are other commonly used technical means including AppCode tool detection, IndexStoreDB-based and online statistics such as Pecker.

However, the above scheme is not suitable for Swift’s detection scheme (different from OC storage), here you can refer to github.com/wuba/WBBlad…

Detected the project and found that there are still many useless classes:

Then secondary analysis verification, optimization

3.1.3 Binary rearrangement

The mapping from virtual memory to physical memory in the iOS system is based on the page as the smallest unit. When a process accesses a virtual memory page but the corresponding physical memory does not exist, a Page Fault page fault interrupt will occur, (corresponding to the File Backed Page In of System Trace) and then the operating system loads the data into the physical memory. If it has already been loaded into the physical memory, it will trigger Page Cache Hit. The latter is faster, which is one of the reasons why hot start is faster than cold start.

Although the processing speed of page fault interruption exception is very fast, there may be thousands (or even more) Page Faults during the startup process of an App, and the accumulation of this time will be more obvious.

Based on the above principle, our goal is to increase Page Cache Hit and reduce Page Fault at startup, so as to achieve the purpose of optimizing startup time. We need to
determine which symbols are executed at startup, and gather the memory of these symbols together as much as possible to reduce the number of pages occupied, so as to reduce the number of Page Fault hits

Programs are executed sequentially by default:

If the methods to be used to start are in the two pages Page1 and Page2 (method1 and method3), in order to execute the corresponding code, the system must perform two Page Faults.

If we rearrange the methods so that method1 and method3 are in one Page, then there will be less Page Fault.

Use the System Trace tool in Instruments to see the current page fault loading situation

There is a point to note here. In order to ensure that the app is a real cold start, the memory needs to be cleared, otherwise the result will be inaccurate. The picture below is the result I got when I killed the app directly and reopened it.

It can be seen that it is a bit different from the first test. After killing the app, we can reopen multiple other apps (as many as possible), or uninstall and reinstall, so that when the app is reopened, it will cold start

In summary, what we need to do is to arrange the function symbols called at startup in a centralized manner to reduce the number of page fault interrupts

Get startup code execution order
To determine which functions (which symbols are used) are called when the App is started, here is a recommended tool AppOrderFiles ( https://github.com/yulingtianxia/AppOrderFiles ), using Clang SanitizerCoverage, through compiler instrumentation, to obtain the symbol order of the calling function (of course, we can also modify Write Link Map File to YES in Build Settings, and a Link Map symbol table txt will be generated after compilation for analysis. , create our own order file) After the App starts, output the order file in the viewDidLoad method of the first screen VC.

The output file is in the App sandbox. It is more convenient to run it with the simulator. The file app.order is obtained, which contains the sorted list of symbols. According to the execution order of the App, if the project is relatively large, it will take longer.

Put the order file in the project directory and configure it in Xcode Build Setting -> Order File -> $(PROJECT_DIR)/xxx.order

Verify\Comparison
There is a Write Link Map File in Build Setting in Xcode, which can generate the option of Link Map file, the path is as follows

Link Map file
Intermediates.noindex/xxxx.build/Debug-iphoneos/xxx.build/xxx-LinkMap-normal-arm64.txt
generates app file path
Products/Debug-iphoneos/xxx.app

Here we only pay attention to the symbol table Symbols of the Link Map File. The order here is the order corresponding to the Mach-O file. If it is consistent with the order of xxx.order, it means that the change is successful.

Test the comparison before and after modification through the System Trace tool again

Compared before and after optimization, page fault interrupts are significantly reduced

Obtain function call symbols, use Clang instrumentation to directly hook to Objective-C methods, Swift methods, C functions, and Blocks, without distinction

3.2 After Main Optimization

This part is a big optimization item. The actual scene needs us to analyze according to our own specific projects, but generally follow some of the same ideas

3.2.1 Function/method optimization

Postpone & reduce I/O operations
Here, the startup logic analysis of the project after main does not involve IO operations and has not been optimized
Control the number of threads
The number of threads in the start-up phase of the project is small and necessary, and the impact is not large, so it will not be moved, but it will be analyzed and managed according to the respective project conditions
Startup add-on governance
Here are mainly some infrastructure and third-party/group SDK initialization tasks and startup add-ons of various business component projects, including the logic of the previous part of the load function, which is placed here for scheduling management.
We can use this part as a starter for maintenance and monitoring to prevent deterioration.
Launcher self-registration, registration items include startup operation closure, startup execution priority, whether the startup operation is executed in the background, etc.
The self-registration service is nothing more than: "Startup Item: Startup Closure" such a binding implementation, so it can be similar to the idea mentioned above (class-protocol binding), write this part of the operation into the DATA segment of the executable file, and then fetch the data from the DATA segment to perform corresponding operations (calling functions) during runtime, which can also cover all startup stages, such as the stage before main( ) .
After analyzing the project, lower the priority of non-essential startup items such as keyboard initialization, map positioning, feedback, and non-homepage module initialization and delay execution.
serial -> parallel synchronous -> asynchronous
For some time-consuming operations asynchronous and parallel operations, the execution of the main thread is not blocked
Method time-consuming statistical analysis
Count the time-consuming business code in the startup process and analyze and manage the time-consuming methods
High frequency method calls
Some methods are not time-consuming individually, but frequent calls will show time-consuming, we can add memory cache, of course, specific analysis of specific scenarios
Use the time of the splash screen page to do some pre-construction of the home page UI
In the project, there is a startup splash screen page, and there is also a gap between the first startup pop-up privacy page to do some front-page operations.

Use this period of time to build the home page UI, pre-download the first screen network data, cache, start the Flutter engine, etc.

3.2.2 First Screen Rendering Optimization

Screen display follows a set of graphics rendering pipeline to complete the final display work:

1.Application phase (in-app):

Handle Events：

In this process, the click event will be processed first, and the layout and interface level of the page may need to be changed during this process.

Commit Transaction：

At this time, the app will process the pre-calculation of the display content through the CPU, such as layout calculation, image decoding and other tasks, and then package the calculated layers and send them to the Render Server. (The core Core Animation is responsible for)

Commit Transaction This part mainly performs four specific operations: Layout, Display, Prepare, Commit, etc., and finally forms a transaction, which is submitted for rendering through CA::Transaction::commit()

Layout：

Related to building views, layoutSubviews, addSubview methods add subviews, AutoLayout calculates the frame of each view according to Layout Constraint, text calculation (size), etc.

layoutSubviews: will be called at this stage, but meet the conditions such as frame, bounds, transform attribute change, add or delete view, explicitly call setNeedsLayout, etc.

Display：

Draw the view: Hand it over to Core Graphics to draw the view, and get the primitives data of the primitives. Note that it is not bitmap data. The bitmap is obtained by combining primitives in the GPU stage. But if the drawRect: method is rewritten, this method will directly call the Core Graphics drawing method to obtain the bitmap data, and at the same time, the system will apply for an additional piece of memory to temporarily store the drawn bitmap, causing the drawing process to be transferred from the GPU to the CPU, which leads to a certain loss of efficiency. At the same time, this process will use additional CPU and memory, so efficient drawing is required, otherwise it will easily cause CPU freeze or memory explosion.

Prepare：

The additional work of Core Animation is mainly image decoding and conversion, try to use the format supported by the GPU, Apple recommends JPG and PNG

For example, displaying pictures in UIImageView will go through the following process: loading, decoding, and rendering simply means converting ordinary binary data (stored in dataBuffer data) into RGB data (stored in ImageBuffer), which is called image decoding, and it has the following characteristics:

The decode decoding process is a time-consuming process, and it is completed in the CPU. That is, it is completed in our part of prepare.

The memory size occupied by the decoded RGB image is only related to the pixel format of the bitmap (RGB32, RGB23, Gray8 ...) and the width and height of the image. The common bitmap size: the size of each pixel is width height, and has nothing to do with the size of the original compression format PNG and JPG .

2. GPU rendering stage:

It is mainly the operation of some primitives, geometry processing, rasterization, pixel processing, etc., without going into details one by one, the work we can do in this part of the operation is limited after all.

Therefore, the optimization points we can roughly do are as follows:

Pre-rendering\Asynchronous rendering:
The general idea is to draw all the views into a bitmap in the sub-thread, and then return to the contents assigned to the layer by the main thread
Image asynchronous decoding:
Note that this is not to load the image into the asynchronous thread to generate a UIImage or CGImage in the asynchronous thread and then set it to the UIImageView in the main thread, but to draw the image to the CGBitmapContext in the sub-thread first, and then directly create the image from the bitmap. Commonly used image frames are similar.
load on demand
Delay loading of views that are not required or are more complex than the first screen, reducing the layer level of the first screen layer
other:
Off-screen rendering minimizes the number of transparent views and some other details should also be paid attention to

4 achievements

After a series of optimizations, there is still some speed improvement. Although the project is not yet a large-scale project, early and continuous optimization can prevent business iterations from becoming difficult to start.

iPhone 7p multiple mean

before optimization

[Function name: +[LaunchTrace mark]_block_invoke][Line number: 54]————App start————Time-consuming: pre-main:4.147820 [Function name:+[LaunchTrace mark]_block_invoke][Line number: 55]————App start————Time-consuming:didfinish:0.654687 [Function name:+[LaunchTrace mark]_block _invoke][Line number: 56]————App start————Time-consuming: total
:
4.802507

Optimized

[Function name: +[LaunchTrace mark]_block_invoke][Line number: 54]————App start————Time-consuming: pre-main:3.047820 [Function name:+[LaunchTrace mark]_block_invoke][Line number: 55]————App start————Time-consuming:didfinish:0.254687 [Function name:+[LaunchTrace mark]_block _invoke][Line number: 56]————App start————Time-consuming: total
:
3.302507

The average drop in the pre-main stage is about 20%, the average drop in the after-main stage is about 60%, and the overall average drop is 30%.
Of course, it is still in the unlaunched version. After the launch, use the monitoring platform to use more online data and more models to better analyze and optimize

5 summary

The bottleneck of startup speed is not a one-day cold, and needs to be continuously optimized. The continuous construction and optimization of the monitoring system is also indispensable. The analysis of daily online data prevents the startup speed from deteriorating in the rapid iteration of the business. The introduction of dynamic libraries, new +load and static initialization, and the addition of startup tasks must be added to the Code Review mechanism, and the optimized startup architecture will escort these basic performances.

Author: JD Logistics Peng Xin

Source: JD Cloud developer community Ziqishuo Tech