Author: Programmer Jiang
foreword
Android
The stability of the app is Android
an important indicator of performance, and it is also the most basic and critical part of the App quality construction system . If the app crashes frequently, or if key features are unavailable, that obviously has a big impact on our retention.
In order to ensure the stability of the application, we should first establish a correct understanding of stability. This article mainly includes the following contents:
- Correct Understanding of Stability Optimization
Crash
General steps in processingCrash
Long-term governance- Business High Availability Solution Construction
- Stability optimization common interview questions
Correct Understanding of Stability Optimization
Key metrics for stability optimization
To optimize stability, the first question is, what effect should be achieved? Crash
What is the rate of excellence? Only after the goal is clarified can we correctly understand the role of our work
To calculate Crash
the rate, we should first understand some key indicators of stability optimization
UV Crash
rate vs. PV Crash
rate
PV(Page View)
That is, the number of visits, UV(Unique Visitor)
that is, unique visitors, and the same terminal within 0-24 hours is only counted once
UV Crash
Rate: For the statistics of user usage, the proportion of crashes of all users within a period of time is counted, which is used to evaluate theCrash
influence range of the rate.PV Crash
Rate: Based on the statistics of user usage times, evaluateCrash
the severity of related impacts.
You can choose the appropriate indicator according to your own needs. It should be noted that you need to ensure that you always use the same measurement method.
Crash
rate evaluation
So, how much lower is our App
rate Crash
to be considered a normal level or an excellent level?
Java
TheNative
total crash rate must be less than 2 per thousand.Crash
The 10,000 percentile is excellent
Note that the above mentioned is UV
the crash rate
Dimensions of Stability Optimization
Many people think that stability optimization is to reduce Crash
the rate, but if your application APP
does not crash, but the key functions are not available, how can it be considered stable?
Therefore, the stability of the application can be divided into three latitudes, as follows:
- 1.
Crash
Latitude: The most important indicator isCrash
the rate of application. - 2. Performance latitude: including optimization directions such as startup speed, memory, drawing, etc., which
Crash
are relatively secondary, but are also part of application stability. - 3. Business high availability latitude: This is a very critical step. We need to use various methods to ensure
App
the stability of our main process and core path.
Crash
General steps in processing
Let's take a look at how to deal with it Crash
, that is, if the application crashes, how should you analyze it?
Mainly analyze from two perspectives of crash scene and crash analysis
crash site
The crash scene is our "first crime scene" and it holds many valuable clues. The more information we dig out here, the clearer the direction of the next analysis, instead of relying on blind guessing.
Next, let's take a look at what information should be collected at the crash site.
crash information
From the basic information of the crash, we can have a preliminary judgment on the crash.
- Process name, thread name. Is the crashing process a foreground process or a background process, and whether the crash occurred on the UI thread.
- Crash stack and type. Does the crash belong to
Java
crash ,Native
crash, orANR
, we pay different attention to different types of crashes. In particular, we need to look at the top of the crash stack to see whether the specific crash is in the system code or our own code.
system message
In addition to the crash information, the system information sometimes contains some key clues, which are very helpful for us to solve the problem.
Logcat
output. This includes application and system operation logs. Sometimes you can't see much information from the stack, but you can get unexpected gainsLogcat
from it- Model, system, manufacturer,
CPU
,ABI
,Linux
version, etc. We will collect as many as dozens of dimensions, which will be very helpful for finding common problems when we talk about it later. - Device status: whether
root
, whether it is an emulator. Some problems are caused byXposed
or over- opening software, we have to treat these problems differently.
memory information
OOM
, ANR
, virtual memory exhaustion, etc., many crashes are directly related to memory. If we divide the user's mobile phone memory into two buckets of "below 2GB" and "above 2GB", we will find that the crash rate of "below 2GB" users is several times that of "above 2GB" users.
- System remaining memory. Regarding system memory status, files can be read directly
/proc/meminfo
. When the available memory of the system is very small (lessMemTotal
than 10%), problems such asOOM
memory, large amountsGC
, and frequent system suicides are very likely to occur. - Apps use memory. Including
Java
memory ,RSS
(Resident Set Size
),PSS
(Proportional Set Size
), we can get the size and distribution of the application's own memory. - Virtual Memory. The virtual memory can
/proc/self/status
be obtained , and/proc/self/maps
the specific distribution can be obtained through the file. Sometimes we generally don't pay much attention to virtual memory, but many problems likeOOM
,tgkill
etc. are caused by insufficient virtual memory.
Name: com.sample.name // 进程名
FDSize: 800 // 当前进程申请的文件句柄个数
VmPeak: 3004628 kB // 当前进程的虚拟内存峰值大小
VmSize: 2997032 kB // 当前进程的虚拟内存大小
Threads: 600 // 当前进程包含的线程个数
Generally speaking, for a 32-bit process, if it is 32-bit CPU
, the virtual memory reaches 3GB, which may cause memory application failure. If it is 64-bit CPU
, the virtual memory is generally between 3 and 4GB. Of course if we support 64-bit processes, virtual memory won't be an issue. Therefore, our application should try to adapt to 64-bit
resource information
Sometimes we will find that the application heap memory and device memory are very sufficient, but there will still be memory allocation failures, which may have a greater relationship with resource leaks.
- file handle
fd
. Generally, the maximum number of file handles allowed to be opened by a single process is1024
. But if the file handle800
exceeds , it is more dangerous. You need tofd
output all and the corresponding file names to the log, and further check whether there is a file or thread leak - Threads. A single thread may
2MB
occupy virtual memory, too many threads will put pressure on virtual memory and file handles. According to my experience, if the number of threads exceeds 400, it is dangerous. All threadsid
and be output to the log to further check whether there are thread-related problems.
application information
In addition to the system, our application actually understands itself better and can leave a lot of relevant information.
- Crash scene. In which
Activity
orFragment
, in which business did the crash occur. - critical operating path. Different from the detailed management log during the development process, we can record key user operation paths, which will be of great help to us in reproducing crashes.
- Additional custom information. Different applications may have different concerns. For example, Netease Cloud Music will focus on the currently playing music, and QQ Browser will focus on the currently opened URL or video. In addition, information such as uptime, whether a patch is loaded, whether it is a new installation or an upgrade, etc. are also very important.
The information that should be collected at the crash site is introduced above. Of course, it is still very complicated to develop such a collection platform. In most cases, we only need to access some third-party platforms such as bugly
and Sentry
. But through the above introduction, we can know what information we should focus on when analyzing crashes. At the same time, if the platform capabilities are missing, we can also add custom reporting
crash analysis
After enough information is reported at the crash site, we can begin to analyze the crash. Below we introduce the "trilogy" of crash analysis
Step 1: Determine your focus
To confirm and analyze the key points, the key is to find important information in the log and have a general judgment on the problem. Generally speaking, I suggest that you can focus on the following points in the step of determining the focus.
-
Confirm severity and priority . Solving crashes also depends on cost-effectiveness. We give priority to solving
Top
crashes or having a major impact on business. -
Basic crash information . Determine the type of crash and the description of the exception, and have a rough judgment on the crash. Generally speaking, most simple crashes can be concluded after this step.
Java
collapse.Java
The type of crash is obvious, suchNullPointerException
as a null pointerOutOfMemoryError
or insufficient resources. At this time, you need to further check the "memory information" and "resource information" in the log.Native
collapse. Need to watchsignal
,code
,fault addr
etc., and the stack at the timeJava
of . For an introduction tosignal
the meaning , you can view the introduction to crash signals. The more common ones areSIGSEGV
andSIGABRT
, the former is generally caused by null pointers and illegal pointers, and the latter isANR
mainlyabort()
caused by calling and exiting.ANR
. My experience is, first look at the stack of the main thread, whether it is caused by lock waiting. Then look atANR
theiowait
,CPU
,GC
,system server
and other information in the log to further determine whether it isI/O
a problem ,CPU
a competition problem, or a large numberGC
of causing the card to die
-
Logcat
.Logcat
Generally, there will be some valuable clues, and the log level isWarning
andError
needs special attention. FromLogcat
it, we can see some behaviors of the system and the state of the mobile phone at that time, for example,ANR
when appears, there will be "am_anr";App
when it is killed, there will be "am_kill". The logs output by different systems and manufacturers are different. When you can’t see the cause of the problem or get useful information from a crash log, don’t give up. It is recommended to check more crash logs under the same crash point . -
The situation of each resource. Combined with the basic information of the crash, let's see if it is related to "memory information" or "resource information". For example, the physical memory is insufficient, the virtual memory is insufficient, or the file handle is
fd
leaked .
Both resource files and Logcat
memory and thread-related information require special attention, and many crashes are caused by their improper use.
Step Two: Find Commonalities
If the above method still cannot effectively locate the problem, we can try to find out if there are any commonalities in such crashes. Once the commonality is found, the differences can be further found, and the solution to the problem will be one step closer.
Model, system, ROM
, manufacturer, and ABI
, these collected system information can be aggregated as dimensions. Common issues such as whether it is because it is installed Xposed
, whether it only appears on mobile phones x86
of , whether it is only the Samsung model, whether it is only on the system Android 5.0
of . Application information can also be aggregated as dimensions, such as links being opened, videos being played, countries, regions, etc. If you find a commonality, you can have clearer guidelines for your next step to reproduce the problem.
Step 3: Try to reproduce
If we already know the cause of the crash, in order to further confirm more information, we need to try to reproduce the crash. If we have no clue about the crash at all, we also hope to try to reproduce it through the user operation path, and then analyze the cause of the crash.
"As long as it can be reproduced locally, I can solve it", I believe this is what many developers and tests have said. Such confidence is mainly because on the stable recurrence path, we can use various means or tools such as adding logs or using them for further analysis Debugger
.GDB
System crash resolution
Sometimes some crashes are not caused by our application, but by the system. System crashes often make us feel very helpless. It may be caused by a certain Android
version modificationbug
by a certain manufacturer . The crash stack in this case may not have our own code at all, and it is difficult to directly locate the problem.ROM
For this difficult problem, we can try to solve it through the following methods.
- Look for possible causes. Through the above common classification, let's first check whether it is a problem of a certain system version or a specific problem
ROM
of . Although the crash log may not have our own code, by manipulating the path and log, we can find some suspicious points. - Try to avoid it. Check suspicious code calls, whether inappropriate ones are used
API
, and whether other implementation methods can be used to avoid them. Hook
solve. After understanding the reason, you can finallyHook
modify the logic of the system code to deal with it
For example, we found that there was a Toast
related , which only appeared Android 7.0
in the system of , and it seemed Toast
that the window token
was invalid when it was displayed. It is possible that the window has been destroyed when Toast
it needs to displayed.
android.view.WindowManager$BadTokenException:
at android.view.ViewRootImpl.setView(ViewRootImpl.java)
at android.view.WindowManagerGlobal.addView(WindowManagerGlobal.java)
at android.view.WindowManagerImpl.addView(WindowManagerImpl.java4)
at android.widget.Toast$TN.handleShow(Toast.java)
Android 8.0
Why doesn't the system have this problem? Android 8.0
After checking the source code of , we found the following modifications:
try {
mWM.addView(mView, mParams);
trySendAccessibilityEvent();
} catch (WindowManager.BadTokenException e) {
/* ignore */
}
Therefore, we can refer to Android 8.0
his practice and directly catch
catch this exception. The key here is to find Hook
the point , Toast
there is a variable called mTN
, its type is handler
, we only need to proxy it to realize the capture.
Crash
Long-term governance
The above describes Crash
the general steps to deal with online, but Crash
the really important stage of governance is before going online. We need to start from the development stage and carry out systematic Crash
long-term governance
development stage
Crash
Long-term governance needs to start from the development stage. In the long run, better code quality will bring better stability. We can improve code quality from the following two perspectives
- Unified coding standards, enhanced coding skills, technical review, enhanced
CodeReview
mechanism - Architecture optimization, capability convergence (encapsulation of some common operations), unified fault tolerance: For example, in the network library utils, the returned information is uniformly pre-verified, and if it is illegal, the next process will not be followed directly.
testing phase
In addition to routine testing procedures such as functional testing, automated testing, regression testing, and overlay installation, it is also necessary to test for special scenarios, models, and other boundaries: such as abnormal data returned by the server, server downtime, etc.
Composite stage
- When our function is developed and is about to be merged into the main branch, we must first perform compilation detection and static scanning to find possible problems.
- After the scan is completed, it cannot be merged directly, because multiple branches may conflict, so we first perform a pre-compilation process, that is, merge into a branch that is the same as the main branch, and then package it for automatic regression testing of the main process. After the process passes Merge into the main branch again. Of course, it may be troublesome to do so, but these steps should be automated
release stage
- In the release stage, we should carry out multiple rounds of gray scale, and the gray scale should gradually change from small to large, so as to expose problems in advance with the smallest cost
- Grayscale releases should also be divided into scenarios and cover multiple latitudes comprehensively. Special grayscales can be carried out for special versions, models, etc., to see if users who are more likely to have problems have problems
Operation and maintenance phase
- After going online, stability issues also need to be paid attention to, so it is especially dependent on
APM
sensitive monitoring, and timely alarm when problems are found - If there is an abnormal situation, it is also necessary to roll back or downgrade the strategy according to the situation
- If it cannot be rolled back or downgraded, it can also be repaired by hot repair. If the hot repair fails, it can only rely on the local disaster recovery solution to recover
Business High Availability Solution Construction
Many people think that stability optimization is to reduce Crash
the rate, but in fact, another important dimension of stability optimization is the high availability of the business.
The unavailability of the business may not cause a crash, but it will reduce the user experience, which will directly affect our revenue
Business High Availability Solution Construction
- Unlike high availability of business
Crash
, we need to do data collection by ourselves. We need to sort out the main process, core path, key nodes of the project, and add points - For data collection, we can also use
AOP
methods to collect data to reduce the cost of manual management. - After the data is reported, we can build a data dashboard and count the conversion rate of each step.
- After the data report, we can also establish alarm strategies, such as threshold alarms, trend alarms (compared with the same period) and specific indicator alarms (such as payment failures)
- At the same time, we can do some abnormal monitoring work, such as
Catch
reporting abnormalities and abnormal logic. Although these abnormalities will not crash, they are also what we need to pay attention to. - For some difficult-to-solve problems, we can use the method of full log recovery for specific users to collect more information.
- After discovering the abnormality, we can solve the problem through some back-and-forth strategies, such as supporting whether to enable or disable the function switch through the configuration center. When we find a problem with a new function, we can directly hide the function, or configure the route. jump to another way
Client Disaster Recovery Solution
After a performance or business exception occurs, how should we solve it? The traditional process needs to go through multiple steps such as user feedback, repackaging, and channel update. It can be seen that it is actually more troublesome and less responsive to users. We can
build a disaster recovery solution for the client from the following perspectives
- For newly added functions or code refactoring, it is supported to configure the switch through the configuration center, and it can be closed in time if a problem occurs
- At the same time, if all our
App
pages are redirected through routing, we can jump to the unified error handling page by dynamically configuring the routing, or jump to the temporary h5 page - Repair through hot repair technology
BUG
, such as accessing TencentTinker
or Meituan,Robust
etc. - If your project uses
RN
orWeex
, you can directly implement incremental updates - If the crash occurs at startup
APP
, the dynamic update and dynamic configuration will be invalid at this time, and safe mode needs to be used at this time. The safe modeCrash
automatically restores according to the information, and resets the application to the initial state of the installation after multiple startup failures. If it is particularly seriousBug
, it can also be solved by blocking hot repair, that is, only after the hot repair is successful can it be enteredAPP
. Safe mode can be used not only forAPP
components, but also for components. If a component reports an error multiple times, you can enter the bottom page
Stability optimization common interview questions
The following introduces the mock interview questions for stability optimization
What stability optimizations have you made?
Reference answer:
With the gradual maturity of the project, the user base has gradually increased and DAU
continued to increase. We have encountered many stability problems. For our technical students, we have encountered many challenges. Users often use our App
freezes or functions are not available, so We have started a special optimization for stability, and we have mainly optimized three items:
Crash
Special optimization- Performance stability optimization
- Business stability optimization
Through the optimization of these three aspects, we have built a high-availability platform for mobile terminals. At the same time, many measures have been taken to App
truly achieve high availability.
How is performance stability done?
Reference answer:
- Comprehensive performance optimization: startup speed, memory optimization, drawing optimization
- Find problems offline and focus on optimization
- Mainly online monitoring
Crash
Special optimization
We have made multi-dimensional optimizations in terms of startup speed, memory, layout loading, freeze, slimming, traffic, and power.
Our optimization is mainly divided into two levels, namely online and offline. For offline, we focus on finding problems and solving them directly, aiming to solve problems as much as possible before going online. When it comes to the real line, our main purpose is to monitor. For the monitoring of various performance latitudes, we can get the alarm of abnormal situations as early as possible.
At the same time, for the most serious online performance problem: Crash
, we have made a special optimization, not only optimized Crash
the specific indicators, but also obtained as Crash
much detailed information as possible when it occurred, combined with back-end aggregation, alarm and other functions , so that we can quickly locate the problem.
How to ensure business stability?
Reference answer:
- Data Acquisition + Alarm
- It is necessary to monitor the main process and core path of the project,
- At the same time, we also need to know how many exceptions occurred in each step, so that we know the conversion rate of all business processes and the conversion rate of the corresponding interface
- Combined with the market, if the conversion rate is lower than a certain value, an alarm will be issued
- Abnormal monitoring + single point tracking
- Back-and-forth strategies, such as the Tmall security model
The high availability of mobile services focuses on the complete availability of user functions, mainly to solve some online abnormalities that cause users to have no crashes or performance problems, but it is just a simple function that is not available. We need to The main process and core path of the project are monitored at buried points to calculate the real conversion rate of each step. At the same time, it is also necessary to know how many exceptions occurred at each step. In this way, we know the conversion rate of all business processes and the conversion rate of the corresponding interface. With the data of the market, we know that if the conversion rate or the success rate of some monitoring is lower than a certain value, it is very important. It may be that there is an online abnormality. Combined with the corresponding alarm function, we don’t need to wait for users to give feedback. This is the basis for business stability guarantee.
At the same time, for some special cases, for example, some code blocks appear during the development process or in the code, catch
and the exception is caught so that the program does not crash. This is actually unreasonable. Although the program did not crash, the function of the program at that time has already It becomes unavailable, so catch
we also need to report these abnormalities, so that we can know what problems the user has caused the abnormality. In addition, there are some single-point problems online. For example, users cannot log in after clicking on the login button. This is a single-point problem. In fact, we cannot find out the commonality between it and other problems. It is necessary to find its corresponding details.
Finally, if an abnormal situation occurs, we have also adopted a series of measures to quickly stop the loss.
If an abnormal situation occurs, how to quickly stop the loss?
Reference answer:
- function switch
- jump center
- Dynamic fixes: hot fixes, resource pack updates
- Self-healing: Safe Mode
First of all, it needs to have App
some advanced capabilities. For any new function to be launched, we need to add a function switch. The switch issued by the configuration center determines whether to display the entrance of the new function. If there is an abnormal situation, the entrance of the new function can be closed urgently, so that this can be kept App
in a controllable state.
Then, we need to App
set up routing jumps. All interface jumps need to be distributed through routing. If we match a new function that needs to jump to bug
some , then we will not jump, or jump To the unified exception handling interface. If these two methods are not possible, you can consider dynamic repair through hot repair. The current hot repair solution is actually relatively mature. We can add hot repair capabilities to our projects at low cost. Of course, it would be better if some functions are realized by RN
or , then dynamic update can be realized by updating the resource package. WeeX
And if none of these are possible, then you can consider adding a self-repair capability to the application yourself. If you App
start it multiple times, you can consider clearing all cached data and App
resetting it to the installed state. The most serious level can block the main thread. At this time, users must wait for App
the hotfix to be successful before allowing users to enter.
Summarize
This article mainly introduces Android
the correct understanding of stability, how to deal with it Crash
, Crash
long-term governance, business high-availability solution construction, etc., and introduces some ideas and solutions for stability optimization.
When we are in performance optimization and monitoring, you will find that there are many knowledge points related to the underlying Framework. Therefore, while learning performance optimization and performance monitoring, we also need to learn and understand the underlying principles of Framework. For reference:
Android performance tuning study notes:https://qr18.cn/FVlo89