In-depth analysis of Android stability optimization

Author: Programmer Jiang

foreword

AndroidThe stability of the app is Androidan important indicator of performance, and it is also the most basic and critical part of the App quality construction system . If the app crashes frequently, or if key features are unavailable, that obviously has a big impact on our retention.
In order to ensure the stability of the application, we should first establish a correct understanding of stability. This article mainly includes the following contents:

  1. Correct Understanding of Stability Optimization
  2. CrashGeneral steps in processing
  3. CrashLong-term governance
  4. Business High Availability Solution Construction
  5. Stability optimization common interview questions

Correct Understanding of Stability Optimization

Key metrics for stability optimization

To optimize stability, the first question is, what effect should be achieved? CrashWhat is the rate of excellence? Only after the goal is clarified can we correctly understand the role of our work

To calculate Crashthe rate, we should first understand some key indicators of stability optimization

UV Crashrate vs. PV Crashrate

PV(Page View)That is, the number of visits, UV(Unique Visitor)that is, unique visitors, and the same terminal within 0-24 hours is only counted once

  • UV CrashRate: For the statistics of user usage, the proportion of crashes of all users within a period of time is counted, which is used to evaluate the Crashinfluence range of the rate.
  • PV CrashRate: Based on the statistics of user usage times, evaluate Crashthe severity of related impacts.

You can choose the appropriate indicator according to your own needs. It should be noted that you need to ensure that you always use the same measurement method.

Crashrate evaluation

So, how much lower is our Apprate Crashto be considered a normal level or an excellent level?

  • JavaThe Nativetotal crash rate must be less than 2 per thousand.
  • CrashThe 10,000 percentile is excellent

Note that the above mentioned is UVthe crash rate

Dimensions of Stability Optimization

Many people think that stability optimization is to reduce Crashthe rate, but if your application APPdoes not crash, but the key functions are not available, how can it be considered stable?
Therefore, the stability of the application can be divided into three latitudes, as follows:

  • 1. CrashLatitude: The most important indicator is Crashthe rate of application.
  • 2. Performance latitude: including optimization directions such as startup speed, memory, drawing, etc., which Crashare relatively secondary, but are also part of application stability.
  • 3. Business high availability latitude: This is a very critical step. We need to use various methods to ensure Appthe stability of our main process and core path.

CrashGeneral steps in processing

Let's take a look at how to deal with it Crash, that is, if the application crashes, how should you analyze it?
Mainly analyze from two perspectives of crash scene and crash analysis

crash site

The crash scene is our "first crime scene" and it holds many valuable clues. The more information we dig out here, the clearer the direction of the next analysis, instead of relying on blind guessing.
Next, let's take a look at what information should be collected at the crash site.

crash information

From the basic information of the crash, we can have a preliminary judgment on the crash.

  • Process name, thread name. Is the crashing process a foreground process or a background process, and whether the crash occurred on the UI thread.
  • Crash stack and type. Does the crash belong to Javacrash , Nativecrash, or ANR, we pay different attention to different types of crashes. In particular, we need to look at the top of the crash stack to see whether the specific crash is in the system code or our own code.

system message

In addition to the crash information, the system information sometimes contains some key clues, which are very helpful for us to solve the problem.

  • Logcatoutput. This includes application and system operation logs. Sometimes you can't see much information from the stack, but you can get unexpected gains Logcatfrom it
  • Model, system, manufacturer, CPU, ABI, Linuxversion, etc. We will collect as many as dozens of dimensions, which will be very helpful for finding common problems when we talk about it later.
  • Device status: whether root, whether it is an emulator. Some problems are caused by Xposedor over- opening software, we have to treat these problems differently.

memory information

OOM, ANR, virtual memory exhaustion, etc., many crashes are directly related to memory. If we divide the user's mobile phone memory into two buckets of "below 2GB" and "above 2GB", we will find that the crash rate of "below 2GB" users is several times that of "above 2GB" users.

  • System remaining memory. Regarding system memory status, files can be read directly /proc/meminfo. When the available memory of the system is very small (less MemTotalthan 10%), problems such as OOMmemory, large amounts GC, and frequent system suicides are very likely to occur.
  • Apps use memory. Including Javamemory , RSS( Resident Set Size), PSS( Proportional Set Size), we can get the size and distribution of the application's own memory.
  • Virtual Memory. The virtual memory can /proc/self/statusbe obtained , and /proc/self/mapsthe specific distribution can be obtained through the file. Sometimes we generally don't pay much attention to virtual memory, but many problems like OOM, tgkilletc. are caused by insufficient virtual memory.

Name:     com.sample.name   // 进程名
FDSize:   800               // 当前进程申请的文件句柄个数
VmPeak:   3004628 kB        // 当前进程的虚拟内存峰值大小
VmSize:   2997032 kB        // 当前进程的虚拟内存大小
Threads:  600               // 当前进程包含的线程个数

Generally speaking, for a 32-bit process, if it is 32-bit CPU, the virtual memory reaches 3GB, which may cause memory application failure. If it is 64-bit CPU, the virtual memory is generally between 3 and 4GB. Of course if we support 64-bit processes, virtual memory won't be an issue. Therefore, our application should try to adapt to 64-bit

resource information

Sometimes we will find that the application heap memory and device memory are very sufficient, but there will still be memory allocation failures, which may have a greater relationship with resource leaks.

  • file handle fd. Generally, the maximum number of file handles allowed to be opened by a single process is 1024. But if the file handle 800exceeds , it is more dangerous. You need to fdoutput all and the corresponding file names to the log, and further check whether there is a file or thread leak
  • Threads. A single thread may 2MBoccupy virtual memory, too many threads will put pressure on virtual memory and file handles. According to my experience, if the number of threads exceeds 400, it is dangerous. All threads idand be output to the log to further check whether there are thread-related problems.

application information

In addition to the system, our application actually understands itself better and can leave a lot of relevant information.

  • Crash scene. In which Activityor Fragment, in which business did the crash occur.
  • critical operating path. Different from the detailed management log during the development process, we can record key user operation paths, which will be of great help to us in reproducing crashes.
  • Additional custom information. Different applications may have different concerns. For example, Netease Cloud Music will focus on the currently playing music, and QQ Browser will focus on the currently opened URL or video. In addition, information such as uptime, whether a patch is loaded, whether it is a new installation or an upgrade, etc. are also very important.

The information that should be collected at the crash site is introduced above. Of course, it is still very complicated to develop such a collection platform. In most cases, we only need to access some third-party platforms such as buglyand Sentry. But through the above introduction, we can know what information we should focus on when analyzing crashes. At the same time, if the platform capabilities are missing, we can also add custom reporting

crash analysis

After enough information is reported at the crash site, we can begin to analyze the crash. Below we introduce the "trilogy" of crash analysis

Step 1: Determine your focus

To confirm and analyze the key points, the key is to find important information in the log and have a general judgment on the problem. Generally speaking, I suggest that you can focus on the following points in the step of determining the focus.

  1. Confirm severity and priority . Solving crashes also depends on cost-effectiveness. We give priority to solving Topcrashes or having a major impact on business.

  2. Basic crash information . Determine the type of crash and the description of the exception, and have a rough judgment on the crash. Generally speaking, most simple crashes can be concluded after this step.

  • Javacollapse. JavaThe type of crash is obvious, such NullPointerExceptionas a null pointer OutOfMemoryErroror insufficient resources. At this time, you need to further check the "memory information" and "resource information" in the log.
  • Nativecollapse. Need to watch signal, code, fault addretc., and the stack at the time Javaof . For an introduction to signalthe meaning , you can view the introduction to crash signals. The more common ones are SIGSEGVand SIGABRT, the former is generally caused by null pointers and illegal pointers, and the latter is ANRmainly abort()caused by calling and exiting.
  • ANR. My experience is, first look at the stack of the main thread, whether it is caused by lock waiting. Then look at ANRthe iowait, CPU, GC, system serverand other information in the log to further determine whether it is I/Oa problem , CPUa competition problem, or a large number GCof causing the card to die
  1. Logcat. LogcatGenerally, there will be some valuable clues, and the log level is Warningand Errorneeds special attention. From Logcatit, we can see some behaviors of the system and the state of the mobile phone at that time, for example, ANRwhen appears, there will be "am_anr"; Appwhen it is killed, there will be "am_kill". The logs output by different systems and manufacturers are different. When you can’t see the cause of the problem or get useful information from a crash log, don’t give up. It is recommended to check more crash logs under the same crash point .

  2. The situation of each resource. Combined with the basic information of the crash, let's see if it is related to "memory information" or "resource information". For example, the physical memory is insufficient, the virtual memory is insufficient, or the file handle is fdleaked .

Both resource files and Logcatmemory and thread-related information require special attention, and many crashes are caused by their improper use.

Step Two: Find Commonalities

If the above method still cannot effectively locate the problem, we can try to find out if there are any commonalities in such crashes. Once the commonality is found, the differences can be further found, and the solution to the problem will be one step closer.

Model, system, ROM, manufacturer, and ABI, these collected system information can be aggregated as dimensions. Common issues such as whether it is because it is installed Xposed, whether it only appears on mobile phones x86of , whether it is only the Samsung model, whether it is only on the system Android 5.0of . Application information can also be aggregated as dimensions, such as links being opened, videos being played, countries, regions, etc. If you find a commonality, you can have clearer guidelines for your next step to reproduce the problem.

Step 3: Try to reproduce

If we already know the cause of the crash, in order to further confirm more information, we need to try to reproduce the crash. If we have no clue about the crash at all, we also hope to try to reproduce it through the user operation path, and then analyze the cause of the crash.

"As long as it can be reproduced locally, I can solve it", I believe this is what many developers and tests have said. Such confidence is mainly because on the stable recurrence path, we can use various means or tools such as adding logs or using them for further analysis Debugger.GDB

System crash resolution

Sometimes some crashes are not caused by our application, but by the system. System crashes often make us feel very helpless. It may be caused by a certain Androidversion modificationbug by a certain manufacturer . The crash stack in this case may not have our own code at all, and it is difficult to directly locate the problem.ROM

For this difficult problem, we can try to solve it through the following methods.

  1. Look for possible causes. Through the above common classification, let's first check whether it is a problem of a certain system version or a specific problem ROMof . Although the crash log may not have our own code, by manipulating the path and log, we can find some suspicious points.
  2. Try to avoid it. Check suspicious code calls, whether inappropriate ones are used API, and whether other implementation methods can be used to avoid them.
  3. Hooksolve. After understanding the reason, you can finally Hookmodify the logic of the system code to deal with it

For example, we found that there was a Toastrelated , which only appeared Android 7.0in the system of , and it seemed Toastthat the window tokenwas invalid when it was displayed. It is possible that the window has been destroyed when Toastit needs to displayed.

android.view.WindowManager$BadTokenException: 
  at android.view.ViewRootImpl.setView(ViewRootImpl.java)
  at android.view.WindowManagerGlobal.addView(WindowManagerGlobal.java)
  at android.view.WindowManagerImpl.addView(WindowManagerImpl.java4)
  at android.widget.Toast$TN.handleShow(Toast.java)

Android 8.0Why doesn't the system have this problem? Android 8.0After checking the source code of , we found the following modifications:

try {
  mWM.addView(mView, mParams);
  trySendAccessibilityEvent();
} catch (WindowManager.BadTokenException e) {
  /* ignore */
}

Therefore, we can refer to Android 8.0his practice and directly catchcatch this exception. The key here is to find Hookthe point , Toastthere is a variable called mTN, its type is handler, we only need to proxy it to realize the capture.

CrashLong-term governance

The above describes Crashthe general steps to deal with online, but Crashthe really important stage of governance is before going online. We need to start from the development stage and carry out systematic Crashlong-term governance

development stage

CrashLong-term governance needs to start from the development stage. In the long run, better code quality will bring better stability. We can improve code quality from the following two perspectives

  • Unified coding standards, enhanced coding skills, technical review, enhanced CodeReviewmechanism
  • Architecture optimization, capability convergence (encapsulation of some common operations), unified fault tolerance: For example, in the network library utils, the returned information is uniformly pre-verified, and if it is illegal, the next process will not be followed directly.

testing phase

In addition to routine testing procedures such as functional testing, automated testing, regression testing, and overlay installation, it is also necessary to test for special scenarios, models, and other boundaries: such as abnormal data returned by the server, server downtime, etc.

Composite stage

  • When our function is developed and is about to be merged into the main branch, we must first perform compilation detection and static scanning to find possible problems.
  • After the scan is completed, it cannot be merged directly, because multiple branches may conflict, so we first perform a pre-compilation process, that is, merge into a branch that is the same as the main branch, and then package it for automatic regression testing of the main process. After the process passes Merge into the main branch again. Of course, it may be troublesome to do so, but these steps should be automated

release stage

  • In the release stage, we should carry out multiple rounds of gray scale, and the gray scale should gradually change from small to large, so as to expose problems in advance with the smallest cost
  • Grayscale releases should also be divided into scenarios and cover multiple latitudes comprehensively. Special grayscales can be carried out for special versions, models, etc., to see if users who are more likely to have problems have problems

Operation and maintenance phase

  • After going online, stability issues also need to be paid attention to, so it is especially dependent on APMsensitive monitoring, and timely alarm when problems are found
  • If there is an abnormal situation, it is also necessary to roll back or downgrade the strategy according to the situation
  • If it cannot be rolled back or downgraded, it can also be repaired by hot repair. If the hot repair fails, it can only rely on the local disaster recovery solution to recover

Business High Availability Solution Construction

Many people think that stability optimization is to reduce Crashthe rate, but in fact, another important dimension of stability optimization is the high availability of the business.
The unavailability of the business may not cause a crash, but it will reduce the user experience, which will directly affect our revenue

Business High Availability Solution Construction

  1. Unlike high availability of business Crash, we need to do data collection by ourselves. We need to sort out the main process, core path, key nodes of the project, and add points
  2. For data collection, we can also use AOPmethods to collect data to reduce the cost of manual management.
  3. After the data is reported, we can build a data dashboard and count the conversion rate of each step.
  4. After the data report, we can also establish alarm strategies, such as threshold alarms, trend alarms (compared with the same period) and specific indicator alarms (such as payment failures)
  5. At the same time, we can do some abnormal monitoring work, such as Catchreporting abnormalities and abnormal logic. Although these abnormalities will not crash, they are also what we need to pay attention to.
  6. For some difficult-to-solve problems, we can use the method of full log recovery for specific users to collect more information.
  7. After discovering the abnormality, we can solve the problem through some back-and-forth strategies, such as supporting whether to enable or disable the function switch through the configuration center. When we find a problem with a new function, we can directly hide the function, or configure the route. jump to another way

Client Disaster Recovery Solution

After a performance or business exception occurs, how should we solve it? The traditional process needs to go through multiple steps such as user feedback, repackaging, and channel update. It can be seen that it is actually more troublesome and less responsive to users. We can
build a disaster recovery solution for the client from the following perspectives

  1. For newly added functions or code refactoring, it is supported to configure the switch through the configuration center, and it can be closed in time if a problem occurs
  2. At the same time, if all our Apppages are redirected through routing, we can jump to the unified error handling page by dynamically configuring the routing, or jump to the temporary h5 page
  3. Repair through hot repair technology BUG, such as accessing Tencent Tinkeror Meituan, Robustetc.
  4. If your project uses RNor Weex, you can directly implement incremental updates
  5. If the crash occurs at startup APP, the dynamic update and dynamic configuration will be invalid at this time, and safe mode needs to be used at this time. The safe mode Crashautomatically restores according to the information, and resets the application to the initial state of the installation after multiple startup failures. If it is particularly serious Bug, it can also be solved by blocking hot repair, that is, only after the hot repair is successful can it be entered APP. Safe mode can be used not only for APPcomponents, but also for components. If a component reports an error multiple times, you can enter the bottom page

Stability optimization common interview questions

The following introduces the mock interview questions for stability optimization

What stability optimizations have you made?

Reference answer:

With the gradual maturity of the project, the user base has gradually increased and DAUcontinued to increase. We have encountered many stability problems. For our technical students, we have encountered many challenges. Users often use our Appfreezes or functions are not available, so We have started a special optimization for stability, and we have mainly optimized three items:

  • CrashSpecial optimization
  • Performance stability optimization
  • Business stability optimization

Through the optimization of these three aspects, we have built a high-availability platform for mobile terminals. At the same time, many measures have been taken to Apptruly achieve high availability.

How is performance stability done?

Reference answer:

  • Comprehensive performance optimization: startup speed, memory optimization, drawing optimization
  • Find problems offline and focus on optimization
  • Mainly online monitoring
  • CrashSpecial optimization

We have made multi-dimensional optimizations in terms of startup speed, memory, layout loading, freeze, slimming, traffic, and power.

Our optimization is mainly divided into two levels, namely online and offline. For offline, we focus on finding problems and solving them directly, aiming to solve problems as much as possible before going online. When it comes to the real line, our main purpose is to monitor. For the monitoring of various performance latitudes, we can get the alarm of abnormal situations as early as possible.

At the same time, for the most serious online performance problem: Crash, we have made a special optimization, not only optimized Crashthe specific indicators, but also obtained as Crashmuch detailed information as possible when it occurred, combined with back-end aggregation, alarm and other functions , so that we can quickly locate the problem.

How to ensure business stability?

Reference answer:

  • Data Acquisition + Alarm
  • It is necessary to monitor the main process and core path of the project,
  • At the same time, we also need to know how many exceptions occurred in each step, so that we know the conversion rate of all business processes and the conversion rate of the corresponding interface
  • Combined with the market, if the conversion rate is lower than a certain value, an alarm will be issued
  • Abnormal monitoring + single point tracking
  • Back-and-forth strategies, such as the Tmall security model

The high availability of mobile services focuses on the complete availability of user functions, mainly to solve some online abnormalities that cause users to have no crashes or performance problems, but it is just a simple function that is not available. We need to The main process and core path of the project are monitored at buried points to calculate the real conversion rate of each step. At the same time, it is also necessary to know how many exceptions occurred at each step. In this way, we know the conversion rate of all business processes and the conversion rate of the corresponding interface. With the data of the market, we know that if the conversion rate or the success rate of some monitoring is lower than a certain value, it is very important. It may be that there is an online abnormality. Combined with the corresponding alarm function, we don’t need to wait for users to give feedback. This is the basis for business stability guarantee.

At the same time, for some special cases, for example, some code blocks appear during the development process or in the code, catchand the exception is caught so that the program does not crash. This is actually unreasonable. Although the program did not crash, the function of the program at that time has already It becomes unavailable, so catchwe also need to report these abnormalities, so that we can know what problems the user has caused the abnormality. In addition, there are some single-point problems online. For example, users cannot log in after clicking on the login button. This is a single-point problem. In fact, we cannot find out the commonality between it and other problems. It is necessary to find its corresponding details.

Finally, if an abnormal situation occurs, we have also adopted a series of measures to quickly stop the loss.

If an abnormal situation occurs, how to quickly stop the loss?

Reference answer:

  • function switch
  • jump center
  • Dynamic fixes: hot fixes, resource pack updates
  • Self-healing: Safe Mode

First of all, it needs to have Appsome advanced capabilities. For any new function to be launched, we need to add a function switch. The switch issued by the configuration center determines whether to display the entrance of the new function. If there is an abnormal situation, the entrance of the new function can be closed urgently, so that this can be kept Appin a controllable state.

Then, we need to Appset up routing jumps. All interface jumps need to be distributed through routing. If we match a new function that needs to jump to bugsome , then we will not jump, or jump To the unified exception handling interface. If these two methods are not possible, you can consider dynamic repair through hot repair. The current hot repair solution is actually relatively mature. We can add hot repair capabilities to our projects at low cost. Of course, it would be better if some functions are realized by RNor , then dynamic update can be realized by updating the resource package. WeeXAnd if none of these are possible, then you can consider adding a self-repair capability to the application yourself. If you Appstart it multiple times, you can consider clearing all cached data and Appresetting it to the installed state. The most serious level can block the main thread. At this time, users must wait for Appthe hotfix to be successful before allowing users to enter.

Summarize

This article mainly introduces Androidthe correct understanding of stability, how to deal with it Crash, Crashlong-term governance, business high-availability solution construction, etc., and introduces some ideas and solutions for stability optimization.


When we are in performance optimization and monitoring, you will find that there are many knowledge points related to the underlying Framework. Therefore, while learning performance optimization and performance monitoring, we also need to learn and understand the underlying principles of Framework. For reference:

Android performance tuning study notes:https://qr18.cn/FVlo89

Android Framework core notes:https://qr18.cn/AQpN4J

Guess you like

Origin blog.csdn.net/maniuT/article/details/129951910