JD Financial App Crash Governance Practice

I. Introduction

At the beginning of 2020, the user scale of the JD Financial App has far exceeded that of a few years ago, and the daily activity has also doubled. At the same time, we have also noticed that the app crash rate has been increasing slightly with the version iteration. When we realized that the crash of the App had hurt the user's daily experience, the crash rate had reached a few per thousand. The crash rate is an important indicator to measure the quality of the App, which not only affects the stability of the App, but also directly affects the user experience and business growth. If a crash occurs during startup, the app may be uninstalled directly, which will further cause poor word-of-mouth and brand value decline. Therefore, while the financial app is developing rapidly, it also pays more attention to quality construction.

The rising crash rate of JD Financial App is inseparable from the rapid development of App business. More and more complex business scenarios, logical coupling between multiple businesses, and the expansion of App functions all make programs more prone to errors. Some ancient codes have been slowly affected after multiple business iterations, and errors in some special scenarios take a long time and only appear when a large number of users use them, which makes the error repair less timely . The crash problem that appeared in the grayscale became a state of waiting to be observed after the search failed, and it became prominent when the number of users increased significantly after going online. The slow accumulation of crashes made the crash rate become a very glaring number in a certain version. Based on this situation, the team internally decided to thoroughly manage the situation at that time and find a way to maintain it.

The crash management of the JD Financial App lasted for several versions, and the top 20 crashes were basically repaired. However, the repair of the crash is not all smooth sailing. Some problems that are difficult to reproduce will be solved after repairing, observing and repairing. During the period of repairing the original problems, the App business is also continuously updated and brings some new problems. The R&D team pays special attention to the emerging problems, and uses the grayscale release stage to eliminate the problems in the bud. In the end, the financial app stabilized the crash rate below one in 10,000.

According to the 2020 mobile industry performance experience report, the average crash rate of the App industry is 0.29%, the average crash rate of the Android-side industry is 0.32%, and the average crash rate of the iOS-side application industry is 0.10%.

insert image description here
The JD Finance App has undergone high-quality continuous repairs, and the crash rate is two orders of magnitude lower than the industry average, and has remained stable at 0.007% for a long time.

The user crash rate data comes from the APM performance monitoring system.
insert image description here
The crash rate of the JD Financial App is far superior to the industry level, which is inseparable from the in-depth technical exploration of the R&D team. Basic knowledge of crashes is a prerequisite for technical exploration. This article will explain the basic knowledge of crashes from shallow to deep and share the solution process of typical crash cases.

2. Definition of crash

1. The reason for the crash

A crash is an explicit response of the CPU to an exception, and the CPU's exception handling is based on interrupts. Interruption means that the CPU suspends the program being executed, saves the scene and then executes the corresponding processing program. After processing the event, it returns to the breakpoint and continues to execute the "interrupted" program.

Introduced in the relevant information of the operating system: interrupt (interrupt) and exception (exception) have different meanings in different CPU architectures.

  • For example, in the Intel architecture, the interrupt processing entry is defined by the interrupt dispatch table (IDT) in the operating system kernel. There are 255 interrupt vectors in the IDT, of which the first 20 are defined as exception processing entries. That is, the interrupt contains an exception.
  • In the ARM architecture, the entry of interrupt processing is in the exception vector (exception vector), and 3 of the 8 exception vectors are related to interrupts, that is, exceptions include interrupts.

Regardless of how to define interrupts and exceptions, when an exception occurs in the CPU, it will transfer control from the program before the exception to the exception handler, and the CPU will obtain no lower execution rights. Will switch to kernel mode and execute the corresponding exception handler. The life cycle of an instruction in a classic CPU five-stage pipeline is [fetch, decode, execute, access memory, write back], and CPU exceptions may occur at each stage, such as under the ARM architecture:

  • "Data abort" exception generated in the "execution" phase: If the address of the processor data access instruction does not exist, or the address does not allow the current instruction to access, a data abort exception is generated.
  • "Prefetch abort" abnormality generated in the "instruction fetch" phase: If the address of the processor prefetched instruction does not exist, or the address does not allow the current instruction to access, the memory will send an abort signal to the processor, but when the prefetched instruction is executed, an instruction prefetch abort exception is generated.

The handlers corresponding to the two exceptions will directly or indirectly call the exception_triage() function of the Mach kernel, and pass EXC_BAD_ACCESS as an input parameter, and exception_triage() will use the Mach message passing mechanism to deliver the exception.

2. How does the crash happen in the iOS system

In the iOS system kernel (Mach), exceptions are handled through the basic setting "message passing mechanism" in the kernel. The exception is not more complicated than a message. The exception is thrown by the wrong thread and task through msg_send(), and then handled by a The program captures via msg_recv(). A handler can handle the exception, it can clean the exception, and it can decide to terminate the application.

For the App, when the App tries to do something that is not allowed, such as the CPU cannot execute certain codes (accessing invalid memory, modifying the read-only storage area, etc.), or triggers certain policies of the operating system (high memory usage, App startup time is too long, etc.), the operating system will protect the user experience by terminating your App.

In some development languages, some programming objects will stop the program running and crash when they encounter errors. For example, accessing an array out of bounds in Object-C/Swift, NSArray/Array will trigger a crash and stop the program from running.

2. Several types of common crashes

1. Wild pointer

A wild pointer points to an uncertain memory address, and various uncertain situations may occur when accessing this memory address through a wild pointer. If this memory address is not covered, there will not necessarily be a problem. If it has been covered or allocated as an inaccessible space, the program will crash directly. If it is judged that the crash is caused by wild pointers, then the current crash code is probably not the cause of the crash, and the real cause of the crash needs to be found by analyzing the calling relationship.

In C language, wild pointers often occur when the initial value (random address) is not assigned after the variable is declared or the pointer is not emptied after release. In Object-C, wild pointers often occur in multiple threads, and the variable accessed by the current thread is accessed by another thread. freed. The default value of the pointer in Object-C is nil, which is the same as NULL in the C language, which means that the pointer does not point to any memory space. The results of wild pointer errors are usually variable or memory access exceptions, and the common crash type is EXC_BAD_ACCESS memory errors.

2. Deadlock

In the iOS system, using dispatch_sync to execute synchronization tasks on the main thread will cause deadlock. If the task runs on the main thread (sample code below) due to some complex business logic scenarios, it will cause the application to crash.

-(void)sceneAnalysis { 
  dispatch_sync(dispatch_get_main_queue(), ^{ 
    NSLog(@"Sync Task Result"); 
  }); 
  NSLog(@"Do Other Tasks"); 
}

The program will be stuck at the first line of the function body, and the error message is as follows: Thread 1: EXC_BAD_INSTRUCTION (code=EXC_I386_INVOP, subcode=0x0)

The above sample code is the most typical deadlock of the main thread. A synchronization task "NSLog(@"Sync Task Result")" is added to the main queue, so the main thread will suspend the current code to execute the block code block and wait for the dispatch_sync function Continue execution after returning. But the main queue (main queue) is a serial queue following the principle of first-in-first-out. Currently, the main thread is executing the sceneAnalysis function, and dispatch_sync needs to wait for the sceneAnalysis function to complete. sceneAnalysis and dispatch_sync wait for each other, which causes a deadlock.

3、watchdog

If the app takes a long time to perform a certain task (starting, terminating, or responding to system events), the operating system will terminate the current process. The most obvious sign of a crash triggered by the watchdog mechanism is the error code: 0x8badf00d. Usually the crash log will look like this:

insert image description here
When the App startup time exceeds the maximum allowable value (generally 20s), the watchdog mechanism will be triggered in the iOS system to terminate the process immediately. It is worth noting that the crash triggered by the watchdog will not be collected in the error monitoring developed by itself, and the crash log can be obtained in the crash device. Apple turns off the watchdog mechanism when using the simulator, and the watchdog mechanism will not be triggered in Debug mode. Therefore, in the development process, the App startup process must be concise and loaded on demand.

3. Practical cases of financial app crash management

1. Wild pointer problem caused by multithreading

In the financial APP, the long connection technology is used to update the market index information. The following legend:

insert image description here
I have been paying attention to the grayscale of the function, and no crashes have been found during the period. However, with the opening of new versions and the continuous increase in the number of active users, the APM performance monitoring platform began to find occasional crashes. From the crash log, the crash occurred in the MQTTClient open source library. Through communication, it was found that other business departments used the MQTTClient open source library to have the same problem.

insert image description here
Due to the online crash, on the APM performance monitoring platform, the risk assessment through the number of crashes and stack information determined that this is a non-essential sporadic problem, so the R&D team began to locate the problem and find the cause. Through the continuous tracking of crashes on the APM performance monitoring platform, after collecting data samples, we can analyze the common operations of users before the App crashes—before and after app switching. This is very suspicious operation path information.

insert image description here
The long links in the Jingdong Finance App use the open source MQTT protocol. The R&D team inquired about related issues and solutions in the project open source community. Although there are similar problems in the community, because the library has not been updated and maintained for more than 2 years, no solution can be found. .

Therefore, we turned our attention back to the business usage scenarios and the source code of MQTTClient. The function that caused the crash in the source code is as follows, here is the callback timing of the NSStream stream object sending the message by MQTT.

​- (void)stream:(NSStream *)sender handleEvent:(NSStreamEvent)eventCode

In actual business scenarios, MQTTClient works by connecting when running in the foreground and actively disconnecting when it retreats to the background. So the R&D team wanted to reproduce this problem by switching between the front and back, and simulated the App entering the front and back scenes through code high-frequency simulation, and finally reproduced this problem in debug mode.

insert image description here
The crash occurred in the internal thread of MQTT. The financial app was disconnected when switching the background. The MQTTCFSocketEncoder object was released in the external thread (the thread where the object was created). The release of the MQTTCFSocketEncoder object and the stream processing current queue were not consistent, and the current thread failed to synchronize. state, still accessing this address causes the "wild pointer" to crash.

Repair solution: By "reserving" the self object, the reference count of the heap memory to which the object belongs is increased to prevent the heap memory from being reclaimed by the system during the execution of the callback function. After that, the scene was simulated again through high-frequency calls, and it was not reproduced. The problem is thus solved.

- (void)stream:(NSStream *)sender handleEvent:(NSStreamEvent)eventCode { 
  MQTTCFSocketDecoder *strongDecoder = self; 
  (void)strongDecoder; 
  //其他代码。。。
}

After solving the problem, the financial app and other business teams will be updated uniformly within the team through managed updates. Also submitted a PR in the community. In the future, during the use of long connections, the R&D team will continue to pay attention to the problems discovered.

Another obvious example of wild pointers in development is the notification center NSNotificationCenter. When registering a notification, the notification center will save the memory address of the receiving object, but will not add 1 to the reference count of the receiving object (unsafe_unretained). When the object is released, the original memory address may have been reused. When the notification center sends a notification, the message will still be sent to the saved memory address, but the saved memory address is no longer the original object, the received message cannot be processed, and a crash error occurs in the program.

Crashes caused by wild pointers often occur, and Apple has optimized the use of the notification center in iOS9. Versions after iOS9 will automatically remove all notifications when the object is released. The premise is that the object can be released normally. In the same way, the delegate and dataSource in the tabview were modified with unsafe_unretained before iOS9, and modified to use weak to modify the anti-wild pointer in iOS9 and later (the pointer modified with weak will be automatically set to nil after the object is released). If the minimum version supported by the app is 8.0, you need to pay extra attention to the problem of wild pointers on the delegate.

2. Excessive release caused by multi-threaded shared resources

Another error encountered in development is also caused by multi-threading, which is a relatively rare set method assignment error. The app uses the open source lottie framework instead of GIF images to do complex atmosphere animations to reduce memory consumption. When obtaining the local lottie file, because it is a time-consuming operation, a sub-thread is created to obtain and decompress it locally, and return to the main thread for rendering after completion. In addition, when the app starts, it will send a network request to pull the latest lottie file. If the network request is smooth and fast, the interface will display the latest lottie file first.

After checking the crash statistics log and business code, it was soon suspected that the error was caused by multi-threading. Reading the local lottie file is thread 1, and the network request is thread 7. After obtaining the file path, both thread 1 and thread 7 will call the handleCacheFilePath function to obtain the lottie file. The handleCacheFilePath function code is as follows:

-(void)handleCacheFilePath:(NSString *)filePath {
​
    if (!filePath) {
        return;;
    }
​
    dispatch_async(dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0), ^{
​
        NSData *zipData = [NSData dataWithContentsOfFile:filePath];
​
        /*省略zip解压等其他操作 ... */
​
        manager.lottieData = zipData;
​
        dispatch_async(dispatch_get_main_queue(), ^{
            //回主线程
        });
    });
}

After decompression, assign the json data of lottie to self.lottieData, and wait for the lottie animation to be displayed on the next interface. During actual operation, crashes may occur in some cases. Because the App business is complex and the triggering scenarios are extremely harsh, the lottie acquisition and display modules are moved to the new Domo for reproduction.

The content of the crash log after parsing is as follows:
insert image description here
A look at the crash type EXC_BAD_ACCESS (SIGSEGV), shows a memory error. Application multi-threading errors usually lead to some memory problems, crash logs are very similar to memory errors. The error type is usually EXC_BAD_ACCESS (SIGSEGV). Through analysis, the crash occurred in the assignment of manager.lottieData = zipData. Continue to look at the crash information, which shows that the cause of the crash is the excessive release of variables.

insert image description here
We know that assigning a value to a variable in Object-C is to call the setter method, so why is it over-released? Let's see how the underlying source code of OC is handled (source code link: https://opensource.apple.com/source/objc4/ objc4-723/runtime/objc-accessors.mm.auto.html). Through the assembly code, it is found that the set method calls the objc_setProperty_nonatomic function of the OC runtime, and part of the source code of Apple's run time is open source, so check the source code directly.

void objc_setProperty_nonatomic(id self, SEL _cmd, id newValue, ptrdiff_t offset)
{
    reallySetProperty(self, _cmd, newValue, offset, false, false, false);
}

The reallySetProperty function is actually called in objc_setProperty_nonatomic.

static inline void reallySetProperty(id self, SEL _cmd, id newValue, ptrdiff_t offset, bool atomic, bool copy, bool mutableCopy)
{
    if (offset == 0) {
        object_setClass(self, newValue);
        return;
    }
​
    id oldValue;
    id *slot = (id*) ((char*)self + offset);//计算偏移量获取指针地址
​
    if (copy) {
        newValue = [newValue copyWithZone:nil];
    } else if (mutableCopy) {
        newValue = [newValue mutableCopyWithZone:nil];
    } else {
        if (*slot == newValue) return;
        newValue = objc_retain(newValue);//retain新值newValue
    }
​
    if (!atomic) {//非原子属性
        oldValue = *slot;//第一步
        *slot = newValue;//第二步
    } else {
        spinlock_t& slotlock = PropertyLocks[slot];
        slotlock.lock();
        oldValue = *slot;
        *slot = newValue;        
        slotlock.unlock();
    }
​
    objc_release(oldValue);//释放旧值 引用计数-1
}

The key to the problem lies in lines 20-31 of the reallySetProperty function. If it is a nonatomic operation, directly assign the slot to the oldValue object, then pay the new value to the slot, and finally decrement the oldValue release reference count by one. If it is an atomic operation (atomic modification), a spinlock will be added before the variable is read. After the spinlock spinlock is acquired by the current thread, another thread cannot acquire the spinlock and can only wait in place.

Because lottieData is a shared variable, and it is decorated with nonatomic nonatomic, the assignment of thread 1 will enter the nonatomic condition, assuming that when thread 1 finishes executing the assignment of oldValue = *slot, the time slice is exhausted. At this time, the CPU scheduling starts to execute the No. 7 thread, and the No. 7 thread will also perform the same operation to assign a value to oldValue. After the assignment is completed, execute the objc_release(oldValue) command to release the memory space pointed to by oldValue. At this time, the memory space pointed to by oldValue has been released. After the No. 7 thread is executed, the CPU is turning to execute the No. 1 thread. When the No. 1 thread executes the objc_release(oldValue) method again, it will crash. The reason is to perform a release operation on a memory that has already been released. This also corresponds to the overrelease_error error in the crash stack.

Solution: After a deep understanding of the principle, the solution to the problem will be a matter of course. lottieData is a shared variable held by the manager, which can be changed to use atomic decoration to prevent multi-thread competition. Because the variable set and get methods modified by atomic will add a spin lock, if it is a scene with frequent reading, the spin lock will consume more CPU resources. Fortunately, there are not many usage scenarios of lottieData combined with business analysis, and it will not cause excessive waste of CPU resources. In addition, you can modify the code logic to declare lottieData as a temporary variable, and use the temporary variable as the return value of the function to solve the problem of multi-thread competition.

3. Method call exception

For the kaleidoscopic variety of crashes, method call exceptions are one of the easiest problems to fix. In the app, some methods are not implemented due to negligence, and some object methods do not exist due to memory release. We will not discuss the solution, because developers have already known and solved it many times. Here we mainly explore the process of "unrecognized selector sent to instance" exception throwing.

We all know that the essence of calling an object's method in OC is to send a message to the object. Method calls are translated into objc_msgSend functions during compilation. The first required parameter is the message receiver, the second required parameter is the method name, followed by the passed parameters.
objc_msgSend(id self, SEL op, … )

Here are the steps to send a message:

(1) Detect whether the selector needs to be ignored. If there is a garbage collection mechanism in a Mac OX system, the retain/release function will be ignored.
(2) Check whether the response object is nil. Sending a message to a nil object will be ignored by the runtime system.
(3) From the cache, it is realized through the IMP function pointer search method, and if it exists, the method is executed.
(4) If it cannot be found in the cache, it will search from the method list of Class, and recursively search the method list of the parent class.
(5) If none are found, enter the dynamic method resolution and message forwarding process.

insert image description here
Send a message to an object through the objc_msgSend function. If the object cannot be processed after multiple processes, an exception will be thrown. Before crashing, the runtime system of OC will go through the following two steps:

  • DynamicMethod Resolution (Dynamic Method Resolution)
    takes an object method as an example. The system will call resolveInstanceMethod: to dynamically add a method for the object. If the return value is yes, it will search for the instance method again. If the return value is No, it will enter the message forwarding process.
+ (BOOL)resolveInstanceMethod:(SEL)sel {
    if (sel == @selector(handleOpenPage:)) {
        IMP imp = class_getMethodImplementation([self class], @selector(openNewPage:));
        class_addMethod([self class], sel, imp, "v@:");
        return YES;
    }
    return [super resolveInstanceMethod:sel];
}

The above example dynamically adds the implementation (openNewPage:) to the handleOpenPage: method of the instance object. Where "v@:" represents the return value and parameters. The meaning of each character can be viewed in Type Encodings. (Type Encodings link: https://developer.apple.com/library/archive/documentation/Cocoa/Conceptual/ObjCRuntimeGuide/Articles/ocrtTypeEncodings.html)

  • Message Forwarding (message forwarding)
    message forwarding will call the forwardingTargetForSelector method to obtain a new target as a receiver and re-execute the selector. If it is an object method, it needs to be overridden - (id)forwardingTargetForSelector:(SEL)aSelector method. If it is a class method, override the + (id)forwardingTargetForSelector:(SEL)aSelector method.
- (id)forwardingTargetForSelector:(SEL)aSelector {
    if(aSelector == @selector(handleOpenPage:)){
        return _otherObject;
    }
    return [super forwardingTargetForSelector:aSelector];
}

If the returned object is invalid (nil or the same as the old receiver), enter the forwardInvocation process. This method can be overridden in the program to define forwarding logic. The anInvocation parameter is the object generated when the runtime system calls the methodSignatureForSelector: method to obtain the method signature. When rewriting forwardInvocation:, you must also rewrite the methodSignatureForSelector: method, otherwise an exception will be thrown.

- (void)forwardInvocation:(NSInvocation *)anInvocation {
​
    if ([_otherObject respondsToSelector:[anInvocation selector]]) {
        [anInvocation invokeWithTarget:_otherObject];
    } else {
        [super forwardInvocation:anInvocation];
    }

The entire process of message forwarding is shown in the figure:

insert image description here
When an object does not implement the corresponding method, the runtime system will notify the object through the forwardInvocation message. Every object inherits the forwardInvocation: method from the NSObject class, and the method implementation in NSObject simply calls doesNotRecognizeSelector:. By implementing our own forwardInvocation: method, we can forward messages to other objects in the method implementation. If it is not handled in these processes, an exception will be thrown and a crash will occur.

The above are typical cases, source code, and principle analysis of financial apps in the process of crash management. In the above case management process, the R&D team has accumulated very useful problem-finding experience.
4. Precipitation of actual combat experience

1. The user's operation path is the best prompt to reproduce the problem. The stack information in the crash log can trace which pages the user has entered, what state it is in, whether it is running in the foreground or in the background. Combined with the APM performance monitoring platform, the current state of the App can be analyzed more clearly. According to the startup Id, you can see the network status of the app before the crash and which network requests were sent. These can help developers reproduce problems faster.

2. The crash occurs in the business code of the app itself, which is often easy to solve. If it occurs in the third-party open source library used, you can first go to the open source community to find the same problem. Some frequently maintained open source libraries have gone through many real app scenarios Test, there will be similar problems and solutions. If it is solved by the R&D team, you can also submit a related pull request in the open source community. And share the experience of solving it with developers who encounter the same problem.

3. Any crash has specific conditions for its occurrence. When it is really impossible to reproduce, you should find a solution from another angle. Reading the source code is a very, very good way to understand the nature of the problem. Apple's system is a closed-loop ecology, but some The source code is open source, you can read it online or download the relevant source code for in-depth understanding.

4. The crash log is the first-hand information to solve the crash problem. The crash log contains the stack information and the cause of the crash when the App crashes. Reading crash logs is very important in development. Therefore, it is necessary to briefly introduce the contents of the crash log in the following section.

Five, crash log analysis

1. Crash log content

After a crash occurs, the first thing we think of is which line of code the crash is on, what the stack is, and which threads are running, all of which are included in the crash report. Taking the Demo in WWDC as an example, ChocolateChip runs on the emulator. The top of the crash log contains some summary information, including the App name, version number, operating system, and the date and time of the crash.

insert image description here
The following part is the cause of the crash. The error occurs in the main thread. The crash type is SIGILL, that is, the CPU is executing a non-existent or invalid instruction. The specific reason for the crash displayed in Fatal error is to force unpacking a variable with an optional value of nil.

insert image description here
The next part is the stack information of the crash, you can check the current crashed thread, the stack information at the time of the crash, etc.

The original stack information of the crash is shown in the following figure:

insert image description here
The original stack information of the crash is inconvenient to directly locate the crash problem, and the original stack information needs to be symbolized (). The process of converting memory addresses into method names, file names, and line numbers is called symbolization. There are 3 necessary elements for crash log symbolization.

(1) Crash logs, crash logs can be obtained from the Crashes panel by opening the Organizer window in the Window option of Xcode, or downloaded from the background of the App submission. Apps with complete monitoring capabilities can collect and report to the server for storage through the App, and then download crash logs from the server.

(2) Symbol table. dSYM (debugging SYMbols) is also called the debugging symbol table. Every application compiled and uploaded through Xcode will be automatically archived. Open the Organizer window in the Window option of Xcode, and the compiled application files will be displayed in the Archives menu. Select the file to see the dsym file of the application through show in finder → display package content.

(3) The /Applications/Xcode.app/Contents/SharedFrameworks/DVTFoundation.framework/Versions/A/Resources/symbolicatecrash path can obtain the symbolic tool that comes with Xcode.

Copy the above three files into the same file, check that the UUIDs of the three files are consistent, then use the terminal to enter the current directory, and execute the symbolic command ./symbolicatecrash-vxxxx.crash xxxx.app.dSYM. Open the xxxx.crash file after completion, you can see the symbolized stack, and you can clearly see the method name, file name and line number and other information. The symbolized information is shown in the figure below:

insert image description here
Of course, there will be some low-level information at the bottom of the log, including the register status of the crashed thread, and the binary data image loaded into the process, which is the executable file data of the App. Xcode looks up symbols, files, and line number information through symbolization, and displays them in the stack.

Register information:
insert image description here
Executable file image:
insert image description here
The above is all the contents of a crash log, and the following useful information should be paid attention to in these contents.

First, start with the crash type. In the example, the exception type is EXC_BAD_INSTRUCTION exception, and the CPU is executing an illegal instruction. The crash message states that the cause of the crash was forced unpacking of an optional object.

Second, the crash is on the main thread, and the stack contains the function stack that was running at the time of the crash. The fatalErrorMessage function is seen in the stack. This is a system function, and a function in the code calls it.

As you can see from the stack trace (RecipeImage.swift:26), the call occurs on line 26 of the RecipeImage.swift file. In the code there is a Recipe class whose image function is called and that function calls the fatalErrorMessage function due to some error. When obtaining the image, the code unpacks the optional path forcibly, causing a crash. See below:

insert image description here

2. Where to view the crash log

After interpreting the content of the crash log, where can I view the crash log

(1) Use AppleID to log in to Xcode, and view the crash item in the organizer in the menu bar.
(2) If you can get the crashed machine, you can directly obtain the log information in the device, and filter out the log information related to the App.
(3) App monitoring platform, which collects crash information through the App side, and classifies and analyzes it in the background. It is more convenient to help develop and locate problems.

6. Crash special governance strategy

1. Set up a special crash project and set a phased goal

Before the special governance of JD Financial App crashes, the app crash rate was unstable and would fluctuate with version releases and business iterations. There are many kinds of crash problems in online statistics. There are simple array out-of-bounds crashes, crashes when inserting nil values, and memory problems that cause wild pointers and multi-thread exceptions. In response to these situations, the team set up a special team for crash management, sorted according to the number of crashes and urgency, and cyclically solved the top ten problems in the crash list.

2. Crash module location and distribution

The current App is no longer limited to a certain business, but is already a collection of multiple business functions. After the entire app is split into components, the codes of each business function are integrated in the app in the form of .a, .framwork, etc. When a crash occurs, it is difficult to find which business module the crash code exists in, which causes great difficulties in the distribution and resolution of the crash. Based on this problem, the crash task force performs file name matching through linkmap, or uses the grep command to find .a and .framework files that contain crash codes in the binary. If the search is successful, it will be sent to all business parties in a convenient manner for timely processing.

3. Establish App monitoring system

In the previous crash resolution process, R&D took the initiative to go to the Apple developer background or third-party crash monitoring background (bugly, Youmeng, etc.) to check the current crash trend, etc. This method requires R&D to be self-driven and has a delay , For example, it is difficult to respond in a timely manner to a large number of crashes that suddenly occur when changing configurations in the background. The financial app monitors the trend of crashes through the APM performance monitoring system, sets the crash threshold, and automatically triggers an alarm within a specified number of crashes within a certain period of time, and triggers an alarm if the crash rate exceeds the threshold for a certain period of time, and sends emails, internal Communication tools and other means to notify the person in charge to deal with it in a timely manner. At the same time, weekly performance reports are automatically sent to evaluate the performance system.

4. Continue to pay attention to the existing problems and curb emerging problems

There must be some problems in daily development that are persistent and difficult to reproduce. This kind of problem has a small number of crashes when the daily activity is stable, but it continues to run through multiple versions. It may be submerged in other crashes at ordinary times, but when the daily activity such as 618 and Double Eleven increases greatly, the number of crashes will increase. This is also a special period for discovering problems. In addition, while solving the original problems, focus on monitoring the new business after it goes online. In case of problems that do not appear in Grayscale, it will crash and explode when more users use it.

5. Coding specification

Good coding practices help reduce coding errors. The complex algorithms and logic in the program are only a small part of the program, and most of the crashes can be avoided through code review. Conventional dictionary arrays insert null values, methods cannot be found, etc. have basically been eliminated after development, testing, and grayscale.

7. Summary

This article focuses on the basic knowledge of crashes, including the occurrence of crashes, typical scenarios of crashes, and how to interpret crash logs. At the same time, combined with the actual crash examples encountered in the development of financial apps, it will be analyzed in detail from the cause of the problem, how to locate the crash location, how to reproduce it, and how to repair it. During development, you can locate the problem according to the type of crash and the crash log, and provide a solution for typical crashes. App performance and user experience are a long-term optimization process. Crashes will not stop with optimization. Only continuous attention and optimization can make today's apps that have exploded in code volume move forward steadily.

In the sixth part of this article, the APM performance monitoring system is mentioned. The APM performance monitoring system is a performance monitoring platform built by JD Technology’s mobile team and the operation and maintenance team. It takes time to start, fluctuates in network requests, and takes time to open webView. , User track, native page monitoring, crash freeze, custom monitoring and other functions are all available, achieving full-link monitoring from startup to server connection to exit. At present, multiple apps in JD Technology have been connected to the APM performance monitoring system, which provides higher quality assurance for each app and business team.

Author of this article: Jingdong Technology Wu Xinyu
For more technical best practices & innovations, please pay attention to the WeChat official account of "Jingdong Technology Technology Talk"

insert image description here

Guess you like

Origin blog.csdn.net/JDDTechTalk/article/details/119238048