Android Advanced Guide—How to analyze the cause of problems through ANR logs

1 Learn more about ANR

When an ANR occurs in the system, the trace log will be actively dumped and saved in the data/anr/trace.txt folder. After we get the ANR log, we can start analyzing the log; or we can pull the log through the bugreport command. Specifically, The command is as follows:

adb bugreport [要保存日志的路径]

Of course, when we interpret the logs, we definitely need some keywords to query to determine what type of problem caused the ANR.

1.1 logcat log keyword

We know that the main scenarios that cause ANR are as follows:

(1) User input event

For example, if clicks, slides, long presses, etc. are not processed within 5 seconds, ANR will occur. For this type of problem, you can search for the following keywords:

input event dispatching timed out

This is the most common ANR. Usually when we click a button, the page will freeze and ANR will appear. However, this is only in the logcat log. When ANR occurs during the development process, we can search for this keyword. Find the cause of ANR and how to analyze the trace log, which will be introduced in detail later .

I/WindowManager: Input event dispatching timed out sending to com.lay.layzproject/com.lay.datastore.DataStoreActivity.  Reason: Waiting to send non-key event because the touched window has not finished processing certain input events that were delivered to it over 500.0ms ago.  Wait queue length: 3.  Wait queue head age: 5527.5ms.

(2) Broadcast receiver

Foreground broadcast receiver, if onReceive is not processed within 10s, ANR will be triggered.
Background broadcast receiver, if onReceive is not processed within 60s, ANR will be triggered.

The logcat keywords are:

timeout of broadcast BroadcastRecord

Note that during analysis, it is necessary to distinguish between foreground and background broadcasts.

(3)Service

Foreground services, if onCreate, onStart, and onBind are not processed within 20s, ANR will be triggered.
Backend services, if onCreate, onStart, and onBind are not processed within 200s, ANR will be triggered.

The logcat keywords are:

timeout executing service

(4)Contentprovider

If the Contentprovider does not complete processing within 10 seconds, for example, performing a query operation will trigger ANR.

The logcat keywords are:

timeout publishing content providers

Of course, this is also during our development process. If we encounter ANR, we can directly obtain the log information in the compiler. However, in most scenarios, this problem actually occurs in the test scenario or user scenario, so we need to obtain the trace. Log for detailed analysis.

1.2 Summary of reasons for the occurrence of ANR

( 1) The main thread performs frequent IO operations, such as file reading and writing, SP storage, database reading and writing, causing the main thread to block;

(2) Deadlock occurs in multiple threads and the main thread is blocked

In fact, in our client development, there are very few multi-threaded scenarios. Especially after the concept of coroutines appeared in Kotlin, thread pools are almost rare, and the probability of deadlock is also very low; however, in single-threaded scenarios, if When using coroutines, the main thread will also wait for the result to be returned, resulting in timeout ANR .

(3) The main thread is blocked by the Binder peer

In fact, when Binder communicates, it can be synchronous or asynchronous. However, if it is synchronous Binder communication, considering the transmission efficiency issue, it is very likely that the main thread will always block the block, resulting in ANR.

(4) System resources have been exhausted, such as CPU, IO, etc.

2. Read trace logs

Suppose that after we develop or test, a QA partner proposes an anr that appears when using an app, and exports the corresponding log file through bugreport. After we get this file, how can we quickly locate the problem?

2.1 Process of locating ANR issues

(1) Locate the time when ANR occurs;

(2) Check the trace log to see if there are CPU exceptions, lock competition, time-consuming messages, and time-consuming binder calls;

(3) Check the status of the main thread;

(4) Check the status of other threads;

In fact, we start from the above 4 points, step by step to eliminate the possible problems mentioned in 1.2, finally confirm the point of occurrence of the accident, and come up with the final solution. However, some scenarios still need to be analyzed in conjunction with the business context.

2.2 Trace log keyword analysis

----- pid 32012 at 2023-04-16 12:19:57 -----
Cmd line: com.lay.layzproject
Build fingerprint: 'google/sdk_gphone_x86_arm/generic_x86_arm:9/PSR1.180720.122/6736742:userdebug/dev-keys'
ABI: 'x86'
Build type: optimized
Zygote loaded classes=10642 post zygote classes=1095
Intern table: 74397 strong; 365 weak
JNI: CheckJNI is on; globals=609 (plus 26 weak)
Libraries: /data/app/com.lay.layzproject-ctmKoWSLQO-XwViIKfoW5Q==/lib/x86/libmmkv.so /system/lib/libandroid.so /system/lib/libcompiler_rt.so /system/lib/libjavacrypto.so /system/lib/libjnigraphics.so /system/lib/libmedia_jni.so /system/lib/libsoundpool.so /system/lib/libwebviewchromium_loader.so libjavacore.so libopenjdk.so (10)
//已经分配了堆内存大小3M,已经使用了2M,创建了43648个对象
Heap: 20% free, 2MB/3MB; 43648 objects
// GC的一些信息,可以不关注
Dumping cumulative Gc timings
Cumulative bytes moved 6114552
Cumulative objects moved 141676
Peak regions allocated 416 (104MB) / 1536 (384MB)
Total number of allocations 43648
Total bytes allocated 2MB
Total bytes freed 0B
Free memory 774KB
Free memory until GC 774KB
Free memory until OOME 381MB
Total memory 3MB
Max memory 384MB
Zygote space size 1308KB
Total mutator paused time: 0
Total time waiting for GC to complete: 31us
Total GC count: 0
Total GC time: 0
Total blocking GC count: 0
Total blocking GC time: 0
Registered native bytes allocated: 265097
Current JIT code cache size: 12KB
Current JIT data cache size: 10KB
Current JIT mini-debug-info size: 27KB
Current JIT capacity: 64KB
Current number of JIT JNI stub entries: 0
Current number of JIT code cache entries: 53
Total number of JIT compilations: 53
Total number of JIT compilations for on stack replacement: 0
Total number of JIT code cache collections: 0
Memory used for stack maps: Avg: 70B Max: 524B Min: 24B
Memory used for compiled code: Avg: 209B Max: 3KB Min: 1B
Memory used for profiling info: Avg: 62B Max: 1384B Min: 16B
Start Dumping histograms for 53 iterations for JIT timings
Compiling:	Sum: 124.695ms 99% C.I. 0.107ms-10.841ms Avg: 2.352ms Max: 10.934ms
TrimMaps:	Sum: 4.898ms 99% C.I. 6us-743.999us Avg: 92.415us Max: 820us
Done Dumping histograms
Memory used for compilation: Avg: 15KB Max: 159KB Min: 7KB
ProfileSaver total_bytes_written=0
ProfileSaver total_number_of_writes=0
ProfileSaver total_number_of_code_cache_queries=0
ProfileSaver total_number_of_skipped_writes=0
ProfileSaver total_number_of_failed_writes=0
ProfileSaver total_ms_of_sleep=5000
ProfileSaver total_ms_of_work=0
ProfileSaver max_number_profile_entries_cached=0
ProfileSaver total_number_of_hot_spikes=0
ProfileSaver total_number_of_wake_ups=1

suspend all histogram:	Sum: 734us 99% C.I. 0.304us-105us Avg: 38.631us Max: 105us
//
DALVIK THREADS (14):
"Signal Catcher" daemon prio=5 tid=3 Runnable
  | group="system" sCount=0 dsCount=0 flags=0 obj=0x13100020 self=0xe375e000
  | sysTid=32028 nice=0 cgrp=default sched=0/0 handle=0xdd37e970
  | state=R schedstat=( 9020262 11182596 24 ) utm=0 stm=0 core=2 HZ=100
  | stack=0xdd283000-0xdd285000 stackSize=1010KB
  | held mutexes= "mutator lock"(shared held)
  native: #00 pc 004151b6  /system/lib/libart.so (art::DumpNativeStack(std::__1::basic_ostream<char, std::__1::char_traits<char>>&, int, BacktraceMap*, char const*, art::ArtMethod*, void*, bool)+198)
  native: #01 pc 0051034e  /system/lib/libart.so (art::Thread::DumpStack(std::__1::basic_ostream<char, std::__1::char_traits<char>>&, bool, BacktraceMap*, bool) const+382)
  native: #02 pc 0050b603  /system/lib/libart.so (art::Thread::Dump(std::__1::basic_ostream<char, std::__1::char_traits<char>>&, bool, BacktraceMap*, bool) const+83)
  native: #03 pc 0052e424  /system/lib/libart.so (art::DumpCheckpoint::Run(art::Thread*)+916)
  native: #04 pc 00526146  /system/lib/libart.so (art::ThreadList::RunCheckpoint(art::Closure*, art::Closure*)+534)
  native: #05 pc 00525394  /system/lib/libart.so (art::ThreadList::Dump(std::__1::basic_ostream<char, std::__1::char_traits<char>>&, bool)+1316)
  native: #06 pc 00524d8d  /system/lib/libart.so (art::ThreadList::DumpForSigQuit(std::__1::basic_ostream<char, std::__1::char_traits<char>>&)+941)
  native: #07 pc 004ec186  /system/lib/libart.so (art::Runtime::DumpForSigQuit(std::__1::basic_ostream<char, std::__1::char_traits<char>>&)+214)
  native: #08 pc 004fafde  /system/lib/libart.so (art::SignalCatcher::HandleSigQuit()+1806)
  native: #09 pc 004f9a4f  /system/lib/libart.so (art::SignalCatcher::Run(void*)+431)
  native: #10 pc 0008f065  /system/lib/libc.so (__pthread_start(void*)+53)
  native: #11 pc 0002485b  /system/lib/libc.so (__start_thread+75)
  (no managed stack frames)
  
//主线程调用栈
"main" prio=5 tid=1 Sleeping
  | group="main" sCount=1 dsCount=0 flags=1 obj=0x7583df30 self=0xe36f4000
  | sysTid=32012 nice=-10 cgrp=default sched=0/0 handle=0xe83b5494
  | state=S schedstat=( 4837530082 1301459614 14038 ) utm=141 stm=342 core=2 HZ=100
  | stack=0xff753000-0xff755000 stackSize=8MB
  | held mutexes=
  at java.lang.Thread.sleep(Native method)
  - sleeping on <0x06bde954> (a java.lang.Object)
  at java.lang.Thread.sleep(Thread.java:373)
  - locked <0x06bde954> (a java.lang.Object)
  at java.lang.Thread.sleep(Thread.java:314)
  at com.lay.datastore.DataStoreActivity.onCreate$lambda-0(DataStoreActivity.kt:20)
  at com.lay.datastore.DataStoreActivity.$r8$lambda$afdjO_vwWNd-vtjqRlagos86bqM(DataStoreActivity.kt:-1)
  at com.lay.datastore.DataStoreActivity$$ExternalSyntheticLambda0.onClick(D8$$SyntheticClass:-1)
  at android.view.View.performClick(View.java:6597)
  at com.google.android.material.button.MaterialButton.performClick(MaterialButton.java:1219)
  at android.view.View.performClickInternal(View.java:6574)
  at android.view.View.access$3100(View.java:778)
  at android.view.View$PerformClick.run(View.java:25885)
  at android.os.Handler.handleCallback(Handler.java:873)
  at android.os.Handler.dispatchMessage(Handler.java:99)
  at android.os.Looper.loop(Looper.java:193)
  at android.app.ActivityThread.main(ActivityThread.java:6669)
  at java.lang.reflect.Method.invoke(Native method)
  at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:493)
  at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:858)

The above is an ANR exception I created myself, and then exported the log through bugreport. I found an ANR log information about our project through the keyword "ANR". I will introduce the meaning of some fields here.

2.2.1 Meaning of fields

(1)Cmd line

Displays the package name of the current application, which means that an ANR has occurred in the current application;

(2)Heap: 20% free, 2MB/3MB; 43648 objects

What this paragraph means is that 3M of memory has been allocated in the heap memory, 2M has been used, and a total of 43648 objects have been created.

(3)DALVIK THREADS (14):

The current process has a total of 14 threads

2.2.2 Introduction to thread call stack parameters

"main" prio=5 tid=1 Sleeping
  | group="main" sCount=1 dsCount=0 flags=1 obj=0x7583df30 self=0xe36f4000
  | sysTid=32012 nice=-10 cgrp=default sched=0/0 handle=0xe83b5494
  | state=S schedstat=( 4837530082 1301459614 14038 ) utm=141 stm=342 core=2 HZ=100
  | stack=0xff753000-0xff755000 stackSize=8MB
  | held mutexes=

Here we use the call stack of the main thread to introduce the parameters:

first row:

  • Thread name: main. If there is a daemon parameter, it is a daemon thread, such as Signal Catcher thread.
  • prio: thread priority
  • tid: thread internal id
  • Thread status: Sleeping, about the thread status, I will explain it later

second line:

  • group: The thread group to which the thread belongs
  • sCount: The number of times the thread has been suspended
  • dsCount: The number of thread suspensions used for debugging
  • obj: Thread java object associated with the current thread
  • self: the address of the current thread

The third row:

  • sysTid: the real tid of the thread
  • Nice: Scheduling priority. The smaller the value of nice, the higher the priority. The priority of -10 is already very high .
  • cgrp: the process scheduling group to which the process belongs
  • sched: scheduling strategy
  • handle: function processing address

The fourth line:

  • state: Thread status
  • schedstat: CPU scheduling time statistics
  • utm/stm: CPU time in user mode/kernel mode
  • core: The core where the thread last ran
  • HZ: clock frequency

The fifth line:

  • stack: the address range of the thread stack
  • stackSize: stack size

Line 6:

  • held mutexes: the type of lock held, including exclusive lock and shared lock shared

Regarding the parameter of CPU scheduling time statistics, I will focus on the following:

schedstat=( 4837530082 1301459614 14038 ) 

We see that there are three values ​​​​in brackets, namely Running, Runable, and Switch, which represent the three values ​​​​in the CPU time slice rotation mechanism:

  • Running: CPU running time, unit is ns
  • Runable: waiting time of RQ queue, unit is ns
  • Switch: The number of CPU scheduling switches

Next is utm and stm:

  • utm: The time the thread is executed in user mode, the unit is jiffies, the default is 10ms
  • stm: the execution time of the thread in the kernel state, in jiffies, the default is 10ms

Therefore, the CPU running time in kernel mode and user mode is: 141 * 10 + 342 * 10 = 4830ms. The CPU running time is 4837530082ns, which is 4837ms, which is roughly equal to the time of utm + stm, which is the first parameter of schedstat. .

So from the thread's call stack, we can get the state of the thread when ANR occurs, as well as the current operation of the CPU, especially the state of the thread.

2.2.3 Check the status of threads

We know that when a thread comes from scratch, it has its own state from creation to destruction. If the thread is running normally, it will be in the Runnable state. In addition, what other states does the thread have?

public enum State {
    
    
    NEW,
    RUNNABLE,
    /**
     * Thread state for a thread blocked waiting for a monitor lock.
     * A thread in the blocked state is waiting for a monitor lock
     * to enter a synchronized block/method or
     * reenter a synchronized block/method after calling
     * {@link Object#wait() Object.wait}.
     */
    BLOCKED,

    /**
     * Thread state for a waiting thread.
     * A thread is in the waiting state due to calling one of the
     * following methods:
     * <ul>
     *   <li>{@link Object#wait() Object.wait} with no timeout</li>
     *   <li>{@link #join() Thread.join} with no timeout</li>
     *   <li>{@link LockSupport#park() LockSupport.park}</li>
     * </ul>
     *
     * <p>A thread in the waiting state is waiting for another thread to
     * perform a particular action.
     *
     例如当前主线程调用了wait方法,需要等待另一个线程调用notify来唤醒,那么此时线程就处于
     WAITING状态
     */
    WAITING,

    /**
    调用了wait方法,但是没有超时时间,也就意味着可能一直无法被唤醒而一直处于等待状态
     */
    TIMED_WAITING,

    /**
     * Thread state for a terminated thread.
     * The thread has completed execution.
     */
    TERMINATED;
}

(1)Runnable / Native

Page UI updates are often completed by the main thread. When the main thread is ready to update the UI and generally responds quickly, it will be in a Runnable state. At this time, the main thread will wait for resources to be obtained before updating the UI. .

it maybe waiting for other resources from the operating system such as processor.

Even if the Runnable is in a normal state, the official statement also states that this state may be waiting for other resources of the operating system. Therefore, if the resources are slow to arrive, there will be a risk of ANR at this time. At this time, the main thread is always in a waiting state. Then ANR will occur after a timeout of 5 seconds. Therefore, if the thread is found to be in the Runnable state during ANR analysis , you need to consider whether there is a scenario where the main thread is waiting for resources and causes blocking .

Let's look at the following scenario: the main thread starts an asynchronous task. This asynchronous task holds a lock with the main thread at the same time. Only when the asynchronous task is completed, the lock is released, and the main thread can get the lock for processing.

findViewById<Button>(R.id.btn_anr).setOnClickListener {
    
    
    CostTimeTask().execute("test")
    Log.d("TAG","execute --- ")
    synchronized(mLock){
    
    
        Toast.makeText(this,"异步任务执行完成",Toast.LENGTH_SHORT).show()
    }
}

inner class CostTimeTask : AsyncTask<String,Int,String>(){
    
    
     override fun doInBackground(vararg params: String?): String {
    
    
         synchronized(mLock){
    
    
             while (true){
    
    

             }
         }
     }
 }

I am simulating time-consuming operations in asynchronous tasks, because the main thread can only acquire the lock after the asynchronous task processing is completed, otherwise it will remain blocked until the lock is acquired.

"AsyncTask #1" prio=5 tid=15 Runnable
  | group="main" sCount=0 dsCount=0 flags=0 obj=0x12cbe0f0 self=0xc83e0400
  | sysTid=6148 nice=10 cgrp=default sched=0/0 handle=0xc454e970
  | state=R schedstat=( 6304787005 112481313 737 ) utm=629 stm=1 core=2 HZ=100
  | stack=0xc444b000-0xc444d000 stackSize=1042KB
  | held mutexes= "mutator lock"(shared held)
  at com.lay.datastore.DataStoreActivity$CostTimeTask.doInBackground(DataStoreActivity.kt:42)
  at com.lay.datastore.DataStoreActivity$CostTimeTask.doInBackground(DataStoreActivity.kt:38)
  at android.os.AsyncTask$2.call(AsyncTask.java:333)
  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
  at android.os.AsyncTask$SerialExecutor$1.run(AsyncTask.java:245)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641)
  at java.lang.Thread.run(Thread.java:764)

"main" prio=5 tid=1 Blocked
  | group="main" sCount=1 dsCount=0 flags=1 obj=0x7583df30 self=0xe36f4000
  | sysTid=6084 nice=-10 cgrp=default sched=0/0 handle=0xe83b5494
  | state=S schedstat=( 4210489664 1169737873 12952 ) utm=123 stm=298 core=2 HZ=100
  | stack=0xff753000-0xff755000 stackSize=8MB
  | held mutexes=
  at com.lay.datastore.DataStoreActivity.onCreate$lambda-1(DataStoreActivity.kt:29)
  - waiting to lock <0x0493299a> (a java.lang.Object) held by thread 15
  at com.lay.datastore.DataStoreActivity.$r8$lambda$IFZrCDzOUja7d5eTPj5Nq-CEC-8(DataStoreActivity.kt:-1)
  at com.lay.datastore.DataStoreActivity$$ExternalSyntheticLambda0.onClick(D8$$SyntheticClass:-1)
  at android.view.View.performClick(View.java:6597)
  at com.google.android.material.button.MaterialButton.performClick(MaterialButton.java:1219)
  at android.view.View.performClickInternal(View.java:6574)
  at android.view.View.access$3100(View.java:778)
  at android.view.View$PerformClick.run(View.java:25885)
  at android.os.Handler.handleCallback(Handler.java:873)
  at android.os.Handler.dispatchMessage(Handler.java:99)
  at android.os.Looper.loop(Looper.java:193)
  at android.app.ActivityThread.main(ActivityThread.java:6669)
  at java.lang.reflect.Method.invoke(Native method)
  at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:493)
  at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:858)

(2)Blocked / Monitor

When the main thread cannot continue to execute, it will enter the Blocked state ( if the status appears in the log as Monitor, it is the same as Blocked ), and all events will be unable to respond;
Insert image description here

At this time, the status of the thread is in Wait or Monitor. Once the thread enters this state, it is on the verge of "dying". At this time, the thread is in a blocked and suspended state.

(3)Sleeping

As seen in the log in 2.2, the status of the main thread at this time is Sleeping. It can be seen from the log that the sleep method is called when the event is clicked, resulting in ANR, and the status of the thread is also abnormal at this time.

In fact, the status of the main thread is the lifeline of the entire App. When the status of the main thread is abnormal, even if ANR does not occur, it is not far from ANR. Therefore, when analyzing the ANR situation, exclude CPU and other problems. After that, the focus needs to be on the status of the main thread, or the status of other threads that belong to the same group as the main thread.

2.2.4 Searching for “deadlock” problems

In fact, the last line of the thread's stack contains mutexes lock information. If you don't know whether a deadlock has occurred, you can use "held by" to find whether there is corresponding log information . If found, then it is most likely caused by a deadlock. , but in the client development process, deadlocks seem to be relatively rare.

Take a look at the example in 2.2.3. In this case, the main thread has been waiting for thread 15 to release the lock, resulting in a deadlock. Thread 15 is the AsyncTask thread, which also belongs to the main thread group.

waiting to lock <0x0493299a> (a java.lang.Object) held by thread 15

In the end, the main thread is in the Blocked state and cannot continue to execute, resulting in ANR, while the AsyncTask #1 thread is in the Runnable state and waits until the method ends to release the resources, but in this case the resources will not be released.

Guess you like

Origin blog.csdn.net/m0_71506521/article/details/130187138