Android performance optimization - ANR monitoring and resolution

Author: Drummor

1 Where does the ANR come from?

ANR (Application Not responding): An "Application Not Responding" (ANR) error is triggered if the UI thread of an Android app is blocked for too long. If the app is in the foreground, the system displays a dialog to the user. The ANR dialog gives the user the option to force quit the app.

ANR is a problem because the main application thread responsible for updating the interface cannot handle user input events or drawing operations, which will cause unhappy users

The above is the official description of ANR. The purpose of ANR is a kind of self-protection from the system level to the behavior of input and interface drawing operations that seriously affect the user's direct perception.

1.1 Four conditions

There are only four nodes that trigger ANR at the system level. That is to say, even if the main thread has a very long time-consuming task, ANR will not be generated if there is no condition that triggers ANR generation.

An ANR is triggered for your app when any of the following conditions occur:

  • Enter the timeout for scheduling, the threshold is 5 seconds
  • Service execution timeout, failure to complete service creation and startup within the threshold will also trigger ANR, the threshold for foreground tasks is 10s, and the threshold for background tasks is 60s
  • ContentProvider scheduling timeout, the threshold is 10s
  • BroadcastReceiver scheduling timeout, the threshold is 10s

1.2 The general flow of ANR generated by the system

Take BroadcastReceive timeout as an example to see how the system triggers ANR

1.2.1 BroadcastReceiver generates ANR

The core logic of BroadcastReceiver processing broadcast is located BroadcastQueuein

public final class BroadcastQueue {
        final void processNextBroadcastLocked(boolean fromMsg, boolean skipOomAdj) {

          setBroadcastTimeoutLocked(timeoutTime);
          ...
          performReceiveLocked(...);//内部最终会调用BroadcastReceiver的onReceiver
          ...
         cancelBroadcastTimeoutLocked();//解除超时
          ..
        }

        // 设置超时
        final void setBroadcastTimeoutLocked(long timeoutTime) {
            if (!mPendingBroadcastTimeoutMessage) {
                Message msg = mHandler.obtainMessage(BROADCAST_TIMEOUT_MSG, this);
                mHandler.sendMessageAtTime(msg, timeoutTime);
                mPendingBroadcastTimeoutMessage = true;
            }
        }

      //解除超时
      final void cancelBroadcastTimeoutLocked() {
        if (mPendingBroadcastTimeoutMessage) {
            mHandler.removeMessages(BROADCAST_TIMEOUT_MSG, this);
            mPendingBroadcastTimeoutMessage = false;
        }
}

The above is the core logic of setting, canceling and triggering ANR by the end of the broadcast. Delay sending an [ANR task] through the handler mechanism, and complete your broadcast receiver task removal ANR task within the specified time. Otherwise trigger.

1.2.1 System processing ANR

In fact, no matter what condition triggers ANR, it will be handed over to AnrHelper for processing. The logic of core processing ANR in this class starts a thread named "AnrConsumer". Execute the method in ProcessErrorStateRecord appNotResponding().

 void appNotResponding(String activityShortComponentName, ApplicationInfo aInfo,
                          String parentShortComponentName, WindowProcessController parentProcess,
                          boolean aboveSystem, String annotation, boolean onlyDumpSelf) {
        ArrayList<Integer> firstPids = new ArrayList<>(5);
        SparseArray<Boolean> lastPids = new SparseArray<>(20);
        ...
        setNotResponding(true);//标记ANR标识
         ...
        firstPids.add(pid);
        ...
        isSilentAnr = isSilentAnr();//后台的应用发生ANR
        if (!isSilentAnr && !onlyDumpSelf) {//前台进程和不仅仅dump自身时
            mService.mProcessList.forEachLruProcessesLOSP(false, r -> {
              ...
                firstPids.add(pid);//添加其他进程
            }
        }); 
        ...
        StringBuilder report = new StringBuilder();
        report.append(MemoryPressureUtil.currentPsiState());//内存信息
        ProcessCpuTracker processCpuTracker = new ProcessCpuTracker(true);//cup信息

        nativePids.add(..); //添加native进程
        ...
        File tracesFile =  .. 
        report.append(tracesFileException.getBuffer());
        info.append(processCpuTracker.printCurrentState(anrTime));
​
        if (tracesFile == null) {
            Process.sendSignal(pid, Process.SIGNAL_QUIT); //不dump信息时直接发送SIGNAL_QUIT信号
        }
        ...
        File tracesFile = ActivityManagerService.dumpStackTraces(..); //dump栈
          ...
        mService.mUiHandler.sendMessageDelayed(msg, anrDialogDelayMs);//ANR弹窗
    }

Noteworthy points

  • Silent ANR", the foreground ANR will play an unresponsive Dialog, and the background ANR will directly kill the process**.
  • When dumping information, dump the information of the ANR process first, and dump other associated processes and native processes if conditions allow. If the system process has a lot of ANR to process, and it takes more than 60s or is a silent process, it will only dump the ANR process information
  • The total time of dumping all processes cannot exceed 20 seconds. If it exceeds, return immediately to ensure that the ANR pop-up window can pop up in time (or be killed)
  • Process.sendSignal(pid, Process.SIGNAL_QUIT);The system issues a Process.SIGNAL_QUIT signal. This is very important

(picture from WeChat team)

When an ANR occurs in an application, the system will collect many processes and dump the stack to generate an ANR Trace file. The first process collected is the process that must be collected. Send the SIGQUIT signal, and the application process starts dumping the stack after receiving SIGQUIT.

2 How to monitor the application layer

After the Android M (6.0) version, the application side cannot directly monitor whether ANR occurs through monitoring data/anr/tracefiles .

2.1 WatchDog scheme

We also introduced this solution in the Caton Monitoring article. The main idea is timeout detection, which detects whether the main thread MessageQueue has processed a given message within the specified time (5s). If the given message is not processed within the specified time, it is considered that an ANR has occurred.

This scheme is used to detect the disadvantages of ANR:

  • Inaccurate: The solution does not necessarily generate an ANR if the timeout condition is triggered. The 5-second timeout is only ToucEventa condition for ANR if not consumed. Other conditions that generate ANR are not 5s;
  • Not elegant: This solution will keep the main thread message scheduling in a "busy state" all the time, which will have unnecessary impact on application power consumption and load.

2.2 Monitor signal scheme ( SIGQUIT )

When ANR occurs in the system, SIGQUITa signal is sent out. By monitoring this signal, we can judge the occurrence of ANR. This solution is also the main solution for monitoring ANR on the market.

Except for the Zygote process, each process has SignalCatchera thread, which catches the SIGQUIT signal and acts accordingly. Android sets SIGQUIT to BLOCKED by default, which means that only SignalCatcherthreads can listen to SIGQUITthe signal, and we sigactioncannot register to listen to it. We SIGQUITset it to UNBLOCK so that it is possible to receive a signal. But it should be noted that the signal needs to be resent without destroying the mechanism of the system.

2.2.1 False Positives & Perfection

The signal sent by the system SIGQUITdoes not necessarily mean that the application has ANR. In other cases, the 'SIGQUIT' signal will also be sent. For example, ANR occurs in other processes.

Find the answer in the source code

    private void makeAppNotRespondingLSP(String activity, String shortMsg, String longMsg) {
        setNotResponding(true);
        // mAppErrors can be null if the AMS is constructed with injector only. This will only
        // happen in tests.
        if (mService.mAppErrors != null) {
            mNotRespondingReport = mService.mAppErrors.generateProcessError(mApp,
                    ActivityManager.ProcessErrorStateInfo.NOT_RESPONDING, //把发生ANR的进程
                    activity, shortMsg, longMsg, null);
        }
        startAppProblemLSP();
        mApp.getWindowProcessController().stopFreezingActivities();
    }

When ANR occurs, the system will mark the process that has ANR NOT_RESPONDING. We can check the status through ActivityManager at the application layer. The code milk is as follows:

private static boolean checkErrorState() {
    try {

        ActivityManager am = (ActivityManager) application.getSystemService(Context.ACTIVITY_SERVICE);
        List<ActivityManager.ProcessErrorStateInfo> procs = am.getProcessesInErrorState();
        if (procs == null) return false;
        for (ActivityManager.ProcessErrorStateInfo proc : procs) {
            if (proc.pid != android.os.Process.myPid()) continue;
            if (proc.condition != ActivityManager.ProcessErrorStateInfo.NOT_RESPONDING) continue;
            return true;
        }
        return false;
    } catch (Throwable t){
    }
    return false;
}

After receiving SIGQUITthe signal, check the state continuously within a period of time. If the flag is obtained, it can be considered that ANR has occurred in the current process.

2.2.2 False Negative & Improvement

Some ANR occurrences will not set the process will not NOT_RESPONDINGidentify

  • Silent ANR ( SilentAnr ), SilentANR will kill the process directly, and will not set this flag.
  • Flashback ANR, OPPO VIVO models will flashback directly after ANR, and this flag will not be set.

Solution: combined with the stuck state of the main thread.

Reflection obtains the objectMessageQueue of the main thread . The when variable of this object is the time when the message is expected to be processed. The difference between this variable and the current time can get the waiting time of the message. The delayed time-consuming, if it takes too long, just mMessagesIt means that the main thread is 'stuck'.

If it is received SIGQUITand the current main thread is stuck, it is considered that ANR has occurred.

2.2.3 Summary of ANR Monitoring

By monitoring system SIGQUITsignals combined with checking NOT_RESPONDINGthe identification and the stuck state of the main thread, it is comprehensively determined that ANR has occurred in the process.

This is just that we know that ANR has occurred, know that ANR has occurred, further know what caused ANR, collect context information when ANR occurs, and solve ANR is more important.

3 Information collection and monitoring

3.1 Difficulties in ANR problem location

The difficulty of observing ANR information collection is that the information collection is often inaccurate and incomplete. When ANR occurs, the information collected at the moment is not the real cause of ANR, so the reference value of the collected information for troubleshooting is greatly reduced.

As shown in the figure above, the time-consuming task in the main thread has been executed, and the service startup task has generated ANR when it exceeds the specified threshold. The information collected at this time is a normal task call information.

In general, the causes of ANR include excessive execution time of the main thread and heavy system load.

The execution time of the main thread task is too long and can be roughly divided into the following types

  • There are multiple time-consuming historical messages that trigger ANR.
  • There is an extremely time-consuming news in the historical news.
  • The execution of extremely many time-consuming small messages takes a lot of time and triggers ANR.

The system load is too heavy, including insufficient system memory and cpu load, resulting in the task not being executed.

If we can more completely record the main thread historical message tasks, current and upcoming tasks, and system load conditions within a period of time, it will be very important for us to diagnose ANR problems more accurately.

3.2 Message scheduling monitoring

The main thread records and monitors the Looper message execution plan, and we naturally turn our attention to the Looper's Printer plan. About this has been introduced in the first two articles of Sanbanfu and will not be expanded here.

When the Looper distributes the message and executes, it prints the message information before and after, and we can obtain the relevant information of the message task, including the target, what callback of the message , and the time-consuming of message execution.

The message is time-consuming, and the WallTime and ThreadTime of the main thread need to be collected.

  • WallTime: The time taken by the task, including waiting for locks, and the time spent in thread sleep time slices.
  • ThreadTime (CpuTime) is the actual execution time of the thread, excluding the time waiting for locks. In turn, we can infer the system load on the side.

In most cases, message execution takes a short time, and Looper also has Idel state, that is, the state of no message execution. We need to aggregate these messages.

In addition, in the time-consuming monitoring of the main thread of the Sanbanx series of articles, it is introduced that the main thread processes messages. In addition to the messages normally distributed by the Looper, which need to be monitored, the IdleHandler and TouchEvent messages must also be included in the statistical records to be more complete.

3.2.1 Message aggregation and classification

  • Aggregation statistics of continuous time-consuming small messages, continuous messages less than 50ms, aggregated into a record, the number of messages stored in the record, and total time-consuming information, etc.
  • For messages exceeding the threshold, a single record is counted.
  • System call message statistics (ActivityThread.H Activity, Service, ContentProvider), these are very important for us to analyze ANR problems.
  • IDLE status messages are counted separately.
  • Statistics for IdleHandler.
  • The main thread scheduling task statistics triggered by the native layer such as TouchEvent.

In general, classify messages into types, aggregate types (Agree), and continuous messages that take less time. Time-consuming type (Huge): Messages exceeding 50ms. System call messages (SYSTEM)

3.2.2 Message stack collection

In addition to statistically recording the what, callback, and time-consuming of Looper Messge, what actions are executed in each message also need to be collected. This requires the collection of the execution stack of each message. Frequent collection of the execution stack has an impact on performance. Larger ones require strategic collection.

  • Enable the sub-thread to collect the stack of the main thread.
  • Message tasks that consume less time are not collected.
  • The message task that exceeds a certain threshold and has not been executed is collected once, and after a period of time if the message task has not been executed, the collection is performed again, and the interval time is linearly increased accordingly.
  • [shoppe's non-blocking efficient stack grabbing]

3.2.3 Statistics of executing messages and pending messages

In addition to monitoring the scheduling and time-consuming of the main thread's historical message before ANR occurs, it is also necessary to know the message being scheduled and its time-consuming when ANR occurs, so that when you see the Trace stack of ANR, you can clearly know the current Trace logic. how long did it take

It is also necessary to count the messages waiting to be executed in the MessageQueue

  • For us to know what components induce ANR
  • You can count how long the messages waiting to be executed have been waiting. Determine how busy the message queue is.

3.3 More comprehensive information collection

Above we have comprehensively monitored and counted the time-consuming of the main thread scheduling tasks,

3.3.1 Get ANRInfo

The application layer can obtain ANRInfo through the ActivityManager

    val am = application.getSystemService(Context.ACTIVITY_SERVICE) as ActivityManager
    val processesInErrorStates = am.processesInErrorState

Available in ProcessesInErrorState

 shortMsg: ANR Inputdispatching timed out ...
  • shortMessage: The direct cause of ANR, TouchEvent timeout, Servcie startup timeout, etc.
Reason: Input dispatching timed out 
xxx is not responding. Waited 5003ms for FocusEvent(hasFocus=false))                                                                                                                                                                          
Load: 3.19 / 2.46 / 2.42
​
----- Output from /proc/pressure/memory -----
some avg10=0.00 avg60=0.00 avg300=0.00 total=144741615
full avg10=0.00 avg60=0.00 avg300=0.00 total=29370150
----- End output from /proc/pressure/memory -----

CPU usage from 0ms to xxx ms later with 99% awake:
​
TOTAL: 6.4% user + 8.2% kernel + 0% iowait + 0.6% irq + 0.1% softirq
​
CPU usage from 0ms to xxx ms later with 99% awake:
​
27% TOTAL: 10% user + 14% kernel + 0.3% iowait + 0.9% irq + 0.3% softirq
  • longMessage: Including system load, cup usage, IO and other system conditions.

3.3.2 Logcat logs

Collect logcat logs when the application is running, pay attention to the need to control the amount of corresponding policies, and regularly clean up.

String cmd = "logcat -f " + file.getAbsolutePath();
Runtime.getRuntime().exec(cmd);

3.3.3 Stack information of other threads in the current process

Obtain each thread stack from the Java layer, or obtain the interface of the Dump thread stack inside the virtual machine through reflection, and force the interface to be called at the function address of the memory map, and redirect the data to the local output.

In this way, when ANR occurs, we have a wealth of information for our reference, including the past, present and future scheduling information of the main thread, system information, thread information, and Logcat information.

4 Problem analysis, positioning and solution

4.1 Analysis of main thread scheduling

Check our main thread task scheduling to see if there are obvious time-consuming task executions that induce ANR.

  • Analyze the wallTime and cputime of our recorded message scheduling
  • The time-consuming message stack for locating records has a problem with locating
  • Note that there may be a lot of continuous small time-consuming messages that will also cause ANR

4.2 Interpretation of ANR Info

4.2.1 System load

Load: xx/xx/xx

The CPU load in the period of 1 minute, 5 minutes and 15 minutes before ANR occurs. The value represents the number of tasks waiting for the system to schedule. If the value is too high, it means that the system has intense competition for CPU and IO, and our application process may be affected.

4.2.2 CPU usage

CPU usage from 0ms to xxx ms later with xx% awake:
​
14% 1673/system_server: 8% user + 6.7% kernel / faults: 12746 minor
13% 30829/tv.danmaku.bili: 7.3% user + 6.2% kernel / faults: 24286 minor
6.6% 31147/tv.danmaku.bili:ijkservice: 3.7% user + 2.8% kernel / faults: 11880 minor
6% 574/logd: 2.1% user + 3.8% kernel / faults: 64 minor
..
TOTAL: 6.4% user + 8.2% kernel + 0% iowait + 0.6% irq + 0.1% softirq
​
CPU usage from xxms to xxxms later 
73% 1673/system_server: 49% user + 24% kernel / faults: 1695 minor
  33% 2330/AnrConsumer: 12% user + 21% kernel
  15% 1683/HeapTaskDaemon: 15% user + 0% kernel
  9.2% 7013/Binder:1673_12: 6.1% user + 3% kernel
  6.1% 1685/ReferenceQueueD: 6.1% user + 0% kernel
  6.1% 2715/HwBinder:1673_5: 6.1% user + 0% kernel
  3% 2529/PhotonicModulat: 0% user + 3% kernel
25% 30829/tv.danmaku.bili: 4.2% user + 21% kernel / faults: 423 minor
  25% 31050/thread_ad: 4.2% user + 21% kernel
  ...
  ...                                                                                                   
27% TOTAL: 10% user + 14% kernel + 0.3% iowait + 0.9% irq + 0.3% softirq

As above, it indicates the CPU usage before and after the occurrence of ANR, and the specific usage of these processes.

  • user: user state
  • kernel: kernel mode
  • iowait: io wait. If it is too high, file reading and writing or memory shortage may occur.
  • irq: hard interrupt
  • softirq: proportion of soft interrupts

Note: The load of a single-process CPU is not limited to 100%, but there are several cores, and there are several hundred percent. For example, the upper limit of 8 cores is 800%.

In addition, the systemkswapd critical thread CPU thread is too large, often accompanied by the system recycling resources, affecting the application processmmcqd

Interpretation of ANR information can better and comprehensively help us locate ANR problems.

4.3 Logcat message

If the Logcat print message is recorded online, focus on the following aspects to see the problem

  • onTrimeMemory: Continuous onTrimMemory often indicates that the APP's memory is insufficient or the system resources are insufficient to cause ANR
  • Slow operation Slow deliveryWhen this occurs, system performance is limited.

5 knots

From the causes of ANR, the system processing ANR, the application layer monitoring ANR, the task scheduling of the main thread for comprehensive monitoring of the application side of ANR, and the collection of system information in order to solve ANR, the general idea of ​​analyzing and solving ANR problems is finally given.

For the ANR problem, these are far from enough. Here is just a general framework. I hope this article will help to solve the ANR problem more comprehensively.


According to different performance monitoring problems, we need to adopt different performance optimization methods. At present, some people are not very proficient in some optimization methods in the middle of performance optimization, so all the different types of optimization methods in the middle of performance optimization are classified. Class sorting, including startup optimization, memory optimization, network optimization, freeze optimization, storage optimization, etc., integrated into "Android Performance Optimization Core Knowledge Points Manual" , you can refer to the following:

"APP Performance Tuning Advanced Manual":https://qr18.cn/FVlo89

Startup optimization

Memory optimization

UI

optimization Network optimization

Bitmap optimization and image compression optimization

Multi-threaded concurrency optimization and data transmission efficiency optimization

Volume package optimization

"Android Performance Tuning Core Notes Summary":https://qr18.cn/FVlo89

"Android Performance Monitoring Framework":https://qr18.cn/FVlo89

Guess you like

Origin blog.csdn.net/maniuT/article/details/130105015