Solutions to Common ANR Problems

The ANR problem is believed to be a problem that every developer will encounter every day. For the analysis of this kind of problem, according to the official recommendation, or the summary ideas of the web blog can solve certain problems, but most of the time everyone is confused because it is very difficult to apply this logic. It is simple, takes a short time or the application stack is completely normal, or it is in an idle state, but why does the system think that the receiver has ANR? Let's open up an internal sharing of the previous company, and use several examples to analyze the Root Case that caused ANR from different angles. I also hope that it will be a reference for everyone to analyze this type of problem in the future. If it is wrong, everyone is welcome to make a brick.


One ANR classification

Contains the following types:

          Broadcast ANR

          Service ANR

          ContentProvider ANR

          Input ANR

System-oriented: WatchDog

Two causes of ANR

The following types:

         Time-consuming operations on the main thread

         The main thread is blocked by other threads

         System level response blocking

         The available memory of the system or process itself is tight

         CPU resources are preempted

For these ANRs, I would like to give you a general analysis of ideas and related logs. Usually, when an ANR occurs, first look for the corresponding Trace log to see if the main thread is processing the broadcast or is blocked. If you find the above phenomenon, congratulations , is very close to the answer. But if you find that the stack is completely idle, unfortunately, you need to expand the reference area, and you need to analyze it in combination with logs. The logs include logcat, kernel logs, cpuinfo, and meminfo, etc., and the reference order is from front to back.

1. Analyze the logcat idea: first search for keywords such as ("anr in", "low_memory", "slow_operation") in the log, and use this type of keyword to check the system CPU load. If it is found that the CPU of the application process is obviously too high, Then it is very likely that the process preempts too much CPU, the system scheduling is not timely, and it is mistakenly believed that the application has timed out.

2 Analyze the kernel idea: search for lowmemorykiller directly in such logs, and if it exists, check whether the occurrence time and ANR time roughly correspond. If there is little difference, you can see the current memory situation at the operating system level from this log. Free physical memory, File Free refers to the file Cache, that is, the application or system reads files from the hard disk. After use, the kernel does not release this type of memory and cache it. The purpose is to speed up the reading and writing process next time. speed. Of course, when the overall value of Free and Other is found to be low, Kernel will perform a certain degree of memory exchange, causing the entire system to freeze. At the same time, this phenomenon will also be reflected in the log "slow_operation", that is, the scheduling of the system process will also be affected.

3. Analysis of cpuinfo ideas: This type of log is clear at a glance, and you can clearly see which type of process has a high CPU. If there is an obviously high process, then ANR has a certain relationship with this process's preemption of CPU. Of course, if Kswapd and emmc processes are found in top, it means that system memory pressure or file IO overhead is encountered.

4. Analyzing meminfo ideas: Analyzing this type of log is mainly to see which type of application or system occupies high memory. If the application memory occupation is relatively normal and the system does not have excessive memory usage, then it means that a large number of processes are cached in the system, and there is no Timely release leads to low overall system memory.

5 Comprehensively analyze the system environment at that time, such as battery power (low power may cause mobile phone frequency limit, core limit, etc.), mobile phone temperature (high temperature may also cause frequency limit), and operating frequency (such as performing monkey test), etc. ;
So much has been said above, let's analyze it with examples:

Example 1: The main thread performs time-consuming operations, or is blocked by other threads in the process

Example:
first step

Observing the Trace main thread stack, I found that the main thread was blocked in the process of applying for memory, waiting for the GC to end, but through the stack, it was further found that the GC did not occur in this thread, that is to say, other threads are performing GC actions, while the main thread is in In the process of applying for memory, you need to wait for the GC to complete before further applying for memory.

"main" prio=5 tid=1 WaitingForGcToComplete

  native: #00 pc 0000000000019980  /system/lib64/libc.so (syscall+28)
  native: #01 pc 000000000013a62c  /system/lib64/libart.so (_ZN3art17ConditionVariable4WaitEPNS_6ThreadE+136)
  native: #02 pc 0000000000237f14  /system/lib64/libart.so (_ZN3art2gc4Heap19WaitForGcToCompleteENS0_7GcCauseEPNS_6ThreadE+1376)
  native: #03 pc 000000000024798c  /system/lib64/libart.so (_ZN3art2gc4Heap22AllocateInternalWithGcEPNS_6ThreadENS0_13AllocatorTypeEmPmS5_S5_PPNS_6mirror5ClassE+168)
  native: #04 pc 000000000050394c  /system/lib64/libart.so (artAllocObjectFromCodeRosAlloc+1412)
  native: #05 pc 00000000001215d0  /system/lib64/libart.so (art_quick_alloc_object_rosalloc+64)
  native: #06 pc 00000000018e72f0  /system/framework/arm64/boot.oat (Java_android_widget_TextView__0003cinit_0003e__Landroid_content_Context_2Landroid_util_AttributeSet_2II+1156)
  at android.widget.TextView.<init>(TextView.java:727)
  at android.widget.TextView.<init>(TextView.java:682)
  at android.widget.TextView.<init>(TextView.java:678)
  at java.lang.reflect.Constructor.newInstance!(Native method)

second step

Look at the status of other threads, further search and find that the following tasks are executing GC

"LeuiRunningState:Background" prio=5 tid=28 WaitingPerformingGc

"AsyncTask #6" prio=5 tid=20 WaitingPerformingGc

To sum up, we can draw a general conclusion. Tid=28, 20 threads execute GC, which causes the main thread to apply for memory to be blocked. But after further thinking, it is common to apply GC, but why does it take so long this time, with doubts Let's look at the memory usage of the process:

Total number of allocations 9887486

Total bytes allocated 732MB

Total bytes freed 476MB

Free memory 5KB

Free memory until GC 5KB

Free memory until OOME 5KB

Total memory 256MB

Max memory 256MB

It is found above that the application has used 256Mb, the distance from OOM is only 5K, and the number of memory objects exceeds 9.98 million, which means that the GC process needs to scan a huge part of these objects, which takes a long time. In addition, the memory distance from OOM is only 5kb, indicating that there is a memory leak. Or memory usage is unreasonable.

In summary, for this problem, it is concluded that there is a leak or improper use of the application process memory, which leads to the GC time process and produces ANR.

Example 2: Application internal thread logic dependency causes timeout and triggers ANR

Example:


first step

Observe the Trace main thread stack and find that the main thread is Blocked during the Binder communication process.

"main" prio=5 tid=1 Native
  | group="main" sCount=1 dsCount=0 obj=0x75f0eaa8 self=0x7fad046a00
  | sysTid=4298 nice=-6 cgrp=default sched=0/0 handle=0x7fb1d18fe8
  | state=S schedstat=( 79488910537 19985244611 169915 ) utm=6564 stm=1384 core=0 HZ=100
  | stack=0x7fc237c000-0x7fc237e000 stackSize=8MB
  | held mutexes=
  kernel: (couldn't read /proc/self/task/4298/stack)
  native: #00 pc 00000000000683d0  /system/lib64/libc.so (__ioctl+4)
  native: #01 pc 00000000000723f8  /system/lib64/libc.so (ioctl+100)
  native: #02 pc 000000000002d584  /system/lib64/libbinder.so (_ZN7android14IPCThreadState14talkWithDriverEb+164)
  native: #03 pc 000000000002e050  /system/lib64/libbinder.so (_ZN7android14IPCThreadState15waitForResponseEPNS_6ParcelEPi+104)
  native: #04 pc 000000000002e2c4  /system/lib64/libbinder.so (_ZN7android14IPCThreadState8transactEijRKNS_6ParcelEPS1_j+176)
  native: #05 pc 0000000000025654  /system/lib64/libbinder.so (_ZN7android8BpBinder8transactEjRKNS_6ParcelEPS1_j+64)
  native: #06 pc 00000000000e0928  /system/lib64/libandroid_runtime.so (???)
  native: #07 pc 000000000139ba24  /system/framework/arm64/boot.oat (Java_android_os_BinderProxy_transactNative__ILandroid_os_Parcel_2Landroid_os_Parcel_2I+200)
  at android.os.BinderProxy.transactNative(Native method)
  at android.os.BinderProxy.transact(Binder.java:503)
  at android.nfc.INfcAdapter$Stub$Proxy.setAppCallback(INfcAdapter.java:529)
  at android.nfc.NfcActivityManager.requestNfcServiceCallback(NfcActivityManager.java:339)
  at android.nfc.NfcActivityManager.setNdefPushMessageCallback(NfcActivityManager.java:309)

second step

Further find out which process this thread is communicating with, search for the keyword "setAppCallback" (Android naming convention, client and server functions are named basically the same), the Binder_3 thread of Nfc responded to the client request, but was blocked during the processing Thread 1 is blocked, look at the status of thread 1 along the way

"Binder_3" prio=5 tid=17 Blocked

  | group="main" sCount=1 dsCount=0 obj=0x12ddf0a0 self=0x7fa670f000

  | sysTid=3183 nice=-6 cgrp=default sched=0/0 handle=0x7f93c30440

  | state=S schedstat=( 3041465858 2637156615 16961 ) utm=168 stm=136 core=3 HZ=100

  | stack=0x7f93b34000-0x7f93b36000 stackSize=1013KB

  | held mutexes=

  at com.android.nfc.P2pLinkManager.setNdefCallback(P2pLinkManager.java:420)

  - waiting to lock <0x0bed0520> (a com.android.nfc.P2pLinkManager) held by thread 1

  at com.android.nfc.NfcService$NfcAdapterService.setAppCallback(NfcService.java:1679)

  at android.nfc.INfcAdapter$Stub.onTransact(INfcAdapter.java:178)

  at android.os.Binder.execTransact(Binder.java:453)

"main" prio=5 tid=1 Native
  | group="main" sCount=1 dsCount=0 obj=0x75f0eaa8 self=0x7fad046a00
  | sysTid=2706 nice=0 cgrp=default sched=0/0 handle=0x7fb1d18fe8
  | state=S schedstat=( 115355173189 36125520701 224819 ) utm=8594 stm=2941 core=0 HZ=100
  | stack=0x7fc237c000-0x7fc237e000 stackSize=8MB
  | held mutexes=
  kernel: (couldn't read /proc/self/task/2706/stack)
  native: #00 pc 00000000000683d0  /system/lib64/libc.so (__ioctl+4)
  native: #01 pc 00000000000723f8  /system/lib64/libc.so (ioctl+100)
  native: #02 pc 000000000002d584  /system/lib64/libbinder.so (_ZN7android14IPCThreadState14talkWithDriverEb+164)
  native: #03 pc 000000000002e050  /system/lib64/libbinder.so (_ZN7android14IPCThreadState15waitForResponseEPNS_6ParcelEPi+104)
  native: #04 pc 000000000002e2c4  /system/lib64/libbinder.so (_ZN7android14IPCThreadState8transactEijRKNS_6ParcelEPS1_j+176)
  native: #05 pc 0000000000025654  /system/lib64/libbinder.so (_ZN7android8BpBinder8transactEjRKNS_6ParcelEPS1_j+64)
  native: #06 pc 00000000000e0928  /system/lib64/libandroid_runtime.so (???)
  native: #07 pc 000000000139ba24  /system/framework/arm64/boot.oat (Java_android_os_BinderProxy_transactNative__ILandroid_os_Parcel_2Landroid_os_Parcel_2I+200)
  at android.os.BinderProxy.transactNative(Native method)
  at android.os.BinderProxy.transact(Binder.java:503)
  at android.nfc.IAppCallback$Stub$Proxy.createBeamShareData(IAppCallback.java:113)
  at com.android.nfc.P2pLinkManager.prepareMessageToSend(P2pLinkManager.java:558)
  - locked <0x0bed0520> (a com.android.nfc.P2pLinkManager)

Through the main thread, it was found that the process of Binder communication was blocked at the same time, and the keyword "createBeamShareData" was searched, and it was found that it returned to the browser thread. The Binder_6 thread responded to this request and was also in the Waiting state.

"Binder_6" prio=5 tid=12 Waiting

  | group="main" sCount=1 dsCount=0 obj=0x12c13a00 self=0x7f52850e00

  | sysTid=23857 nice=0 cgrp=default sched=0/0 handle=0x7f694ff440

  | state=S schedstat=( 705897380 828401158 3677 ) utm=45 stm=25 core=1 HZ=100

  | stack=0x7f69403000-0x7f69405000 stackSize=1013KB

  | held mutexes=

  at java.lang.Object.wait!(Native method)

  - waiting on <0x08a80433> (a java.lang.Object)

  at java.lang.Thread.parkFor$(Thread.java:1220)

  - locked <0x08a80433> (a java.lang.Object)

  at sun.misc.Unsafe.park(Unsafe.java:299)

  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)

  at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:810)

  at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:970)

  at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1278)

  at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:203)

  at com.android.browser.NfcHandler.createNdefMessage(NfcHandler.java:92)

  at android.nfc.NfcActivityManager.createBeamShareData(NfcActivityManager.java:377)

  at android.nfc.IAppCallback$Stub.onTransact(IAppCallback.java:53)

  at android.os.Binder.execTransact(Binder.java:453)

Why is Binder_6 in Waiting state? This requires everyone to combine the spirit of Read the Fuck Code to study the logic. Afterwards, it was found that the events of this thread were executed on the main thread. When the execution was completed, the notification was received and the waiting was stopped.

So far, we have found a complete link, (browser main thread---->NFC Binder_3---->NFC main thread---->browser Binder_6---->browser main thread), You have now seen the root cause, the deadlock! ! !

To sum up, for this problem, it is concluded that a deadlock occurs in the application communication process and causes ANR, and it only needs to be unlocked later.

The above two types of problems are relatively simple, and most of you can analyze and solve them by yourself when you encounter them. The following two types involve more systems or other factors, and the problems are relatively obscure. Reasons or optimization schemes given.

Example 3: The system memory is too low, and the memory exchange process of the kernel will cause the entire system to run slowly (stuck)

Example:


first step

Observing the Trace main thread stack, it is found that the main thread is in the Suspend state; this kind of problem generally occurs in two situations, one is that the process itself is too busy, and the time slice allocated each time is not enough, and the scheduler forces it to be replaced by sleep. The other is that the system is relatively busy, and low-priority integration cannot get time slices; with such doubts, continue to read:

"main" prio=5 tid=1 Suspended

  | group="main" sCount=1 dsCount=0 obj=0x745518a0 self=0x7f86254a00

  | sysTid=21916 nice=0 cgrp=default sched=0/0 handle=0x7f8b30efc8

  | state=S schedstat=( 311762801762 96254728754 409881 ) utm=25610 stm=5566 core=0 HZ=100

  | stack=0x7fd023c000-0x7fd023e000 stackSize=8MB

  | held mutexes=

  at java.util.regex.Splitter.fastSplit(Splitter.java:73)

  at java.lang.String.split(String.java:1410)

  at java.lang.String.split(String.java:1392)

  at android.content.res.theme.LeResourceHelper.getResName(LeResourceHelper.java:193)

  at android.content.res.Resources.loadDrawable(Resources.java:2624)

  at android.content.res.Resources.getDrawable(Resources.java:862)

  at android.content.Context.getDrawable(Context.java:458)

  at android.widget.ImageView.resolveUri(ImageView.java:813)

At this time, you can check whether there will be busy operations in the application logic to seize the time slice. On the other hand, you can check the corresponding logs and find the following information through logcat.

11-17 09:49:41.392  1532  1574 E ActivityManager: ANR in com.android.systemui

11-17 09:49:41.392  1532  1574 E ActivityManager: PID: 21916

11-17 09:49:41.392  1532  1574 E ActivityManager: Reason: Broadcast of Intent { act=android.intent.action.TIME_TICK flg=0x50000014 mCallingUid=1000 (has extras) }

11-17 09:49:41.392 1532 1574 E ActivityManager: Load: 22.72 / 20.06 / 15.54 / corresponding to 1 minute / 5 minutes / 15 minutes /

11-17 09:49:41.392  1532  1574 E ActivityManager: CPU usage from 3ms to 24033ms later:

11-17 09:49:41.392  1532  1574 E ActivityManager:   60% 134/kswapd0: 0% user + 60% kernel

11-17 09:49:41.392  1532  1574 E ActivityManager:   32% 1532/system_server: 7.4% user + 25% kernel / faults: 31214 minor 423 major

The overall load of the system is very heavy, and the normal load is about 10; in addition, it is found that the kswapd CPU usage rate is extremely high. Through these two items, the system memory is low, and the process is continuously killed and memory swap occurs. Is this true? Let's search for other keywords Slow operation:

11-17 09:42:25.292  1532  1572 W ActivityManager: Slow operation: 2440ms so far, now at startProcess: returned from zygote!

11-17 09:42:25.357  1532  1572 W ActivityManager: Slow operation: 2505ms so far, now at startProcess: done updating battery stats

11-17 09:42:25.357  1532  1572 I am_proc_start: [0,30188,10088,com.letv.android.usagestats,service,com.letv.android.usagestats/.UsageStatsReportService]

11-17 09:42:25.357  1532  1572 W ActivityManager: Slow operation: 2505ms so far, now at startProcess: building log message

11-17 09:42:25.357  1532  1572 I ActivityManager: Start proc 30188:com.letv.android.usagestats/u0a88 for service com.letv.android.usagestats/.UsageStatsReportService

11-17 09:42:25.357  1532  1572 W ActivityManager: Slow operation: 2505ms so far, now at startProcess: starting to update pids map

11-17 09:42:25.357  1532  1572 W ActivityManager: Slow operation: 2505ms so far, now at startProcess: done updating pids map

11-17 09:42:25.385  1532  1572 W ActivityManager: Slow operation: 2534ms so far, now at startProcess: done starting proc!

It is found that the execution of ordinary system functions takes more than 2S once, which shows that the system is stuck. Now let's continue to confirm the memory direction, look at the meminfo log

Total PSS by process:

  3441530 kB: com.android.mms (pid 2518 / activities)

   229272 kB: mediaserver (pid 763)

Through PSS, it is found that the SMS process memory occupies more than 3G! Yes, the first reaction is a memory leak. It is impossible for ordinary applications or even system memory usage to reach so much. If you have time, you can look at the kernel log and search for lowmemoryKiller. There must be a large number of processes killed during the time when the problem occurs.

In summary, it is concluded that the application has a memory leak in the Native layer (don't ask me why there are so many memory leaks in the Java layer @@). As a result, the overall memory of the system is tight, and because of its Persist attribute, it has a very high priority (-12), and LMK will not kill it. It can only kill other applications continuously and process memory swap. For similar problems, see XIIIM-8358

Example 4: Binder resources are exhausted, making it difficult to respond to communication requests in a timely manner

Example:


This type of problem is similar to low memory, and it is basically normal to view the main thread stack

"main" prio=5 tid=1 Native
  | group="main" sCount=1 dsCount=0 obj=0x76261710 self=0x7f82646a00
  | sysTid=3084 nice=0 cgrp=default sched=0/0 handle=0x7f874adfe8
  | state=S schedstat=( 83808100322 29188718104 264083 ) utm=5716 stm=2664 core=1 HZ=100
  | stack=0x7ff0f87000-0x7ff0f89000 stackSize=8MB
  | held mutexes=
  kernel: (couldn't read /proc/self/task/3084/stack)
  native: #00 pc 00000000000682e4  /system/lib64/libc.so (__epoll_pwait+8)
  native: #01 pc 000000000001f3a4  /system/lib64/libc.so (epoll_pwait+32)
  native: #02 pc 000000000001be88  /system/lib64/libutils.so (_ZN7android6Looper9pollInnerEi+144)
  native: #03 pc 000000000001c268  /system/lib64/libutils.so (_ZN7android6Looper8pollOnceEiPiS1_PPv+80)
  native: #04 pc 00000000000d3088  /system/lib64/libandroid_runtime.so (_ZN7android18NativeMessageQueue8pollOnceEP7_JNIEnvP8_jobjecti+48)
  native: #05 pc 000000000000554c  /system/framework/arm64/boot.oat (Java_android_os_MessageQueue_nativePollOnce__JI+144)
  at android.os.MessageQueue.nativePollOnce(Native method)
  at android.os.MessageQueue.next(MessageQueue.java:324)
  at android.os.Looper.loop(Looper.java:135)

When in this state, we go straight to the topic, analyze the log log, and analyze it in sequence according to logcat, kernel, cpuinfo, meminfo, etc.:

11-08 23:51:44.088  1514  1554 E ActivityManager: ANR in com.android.phone

11-08 23:51:44.088  1514  1554 E ActivityManager: PID: 3084

11-08 23:51:44.088  1514  1554 E ActivityManager: Reason: Broadcast of Intent { act=com.android.internal.telephony.data-restart-trysetup.default flg=0x10000014 mCallingUid=1001 (has extras) }

11-08 23:51:44.088  1514  1554 E ActivityManager: Load: 9.92 / 9.81 / 10.02

11-08 23:51:44.088  1514  1554 E ActivityManager: CPU usage from 0ms to 6497ms later:

11-08 23:51:44.088  1514  1554 E ActivityManager:   108% 3084/com.android.phone: 101% user + 6.7% kernel / faults: 12120 minor 179 major

11-08 23:51:44.088  1514  1554 E ActivityManager:   66% 1514/system_server: 16% user + 49% kernel / faults: 20836 minor 88 major

11-08 23:51:44.088  1514  1554 E ActivityManager:   13% 13013/ca.bellmedia.cp24: 5.3% user + 8.4% kernel / faults: 3216 minor 39 major

Through the above logs, it is found that the CPU usage of the ANR process itself is relatively high, and then search for keywords such as "slow operation" and "low_memory", but none of them appear in the log, and lowmemorykiller also appears in the dmesg log with a reasonable frequency Medium, so the basic rule is that the memory is too low; so the following will continue to analyze in the direction of the CPU

I can’t find more clues in the log, and at the same time think that since the main thread is in normal state, the high cpu must be caused by other threads, then feed back the trace and continue to analyze. Check other threads of the phone process and find that almost all binder threads are in the waiting state. Only Binder_2 is working:

"Binder_1" prio=5 tid=40 TimedWaiting

"Binder_3" prio=5 tid=40 TimedWaiting

"Binder_4" prio=5 tid=40 TimedWaiting

"Binder_5" prio=5 tid=39 TimedWaiting

"Binder_6" prio=5 tid=40 TimedWaiting

"Binder_7" prio=5 tid=40 TimedWaiting

"Binder_8" prio=5 tid=40 TimedWaiting

。。。。

"Binder_2" prio=5 tid=8 Native

  | group="main" sCount=1 dsCount=0 obj=0x12c9b0a0 self=0x7f7be14400
  | sysTid=3107 nice=0 cgrp=default sched=0/0 handle=0x7f8131d440
  | state=R schedstat=( 515275891171 40426859698 234033 ) utm=49200 stm=2327 core=2 HZ=100
  | stack=0x7f81221000-0x7f81223000 stackSize=1013KB
  | held mutexes=
  kernel: (couldn't read /proc/self/task/3107/stack)
  native: #00 pc 0000000000070f20  /system/lib64/libsqlite.so (???)
  native: #01 pc 000000000007420c  /system/lib64/libsqlite.so (sqlite3_step+652)
  native: #02 pc 00000000000ba4a4  /system/lib64/libandroid_runtime.so (???)
  native: #03 pc 00000000000ba514  /system/lib64/libandroid_runtime.so (???)
  native: #04 pc 00000000003bc578  /system/framework/arm64/boot.oat (Java_android_database_sqlite_SQLiteConnection_nativeExecuteForChangedRowCount__JJ+140)
  at android.database.sqlite.SQLiteConnection.nativeExecuteForChangedRowCount(Native method)
  at android.database.sqlite.SQLiteConnection.executeForChangedRowCount(SQLiteConnection.java:732)
  at android.database.sqlite.SQLiteSession.executeForChangedRowCount(SQLiteSession.java:754)
  at android.database.sqlite.SQLiteStatement.executeUpdateDelete(SQLiteStatement.java:64)
  at android.database.sqlite.SQLiteDatabase.delete(SQLiteDatabase.java:1499)
  at com.android.providers.telephony.SmsProvider.delete(SmsProvider.java:899)
  at android.content.ContentProvider$Transport.delete(ContentProvider.java:339)
  at android.content.ContentProviderNative.onTransact(ContentProviderNative.java:206)
  at android.os.Binder.execTransact(Binder.java:453)

Further analyze the state of the thread: state=R indicates that it is in the working state. By checking the thread stack logic, it is found that there is log printing under normal circumstances, so as to return to the log log again, and find the following information:

11-08 23:51:14.512  3084  3289 W SQLiteConnectionPool: The connection pool for database '/data/user/0/com.android.providers.telephony/databases/mmssms.db' has been unable to grant a connection to thread 111 (Binder_3) with flags 0x1 for 30.000002 seconds.

11-08 23:51:14.512  3084  3289 W SQLiteConnectionPool: Connections: 1 active, 0 idle, 0 available.

11-08 23:51:14.512  3084  3289 W SQLiteConnectionPool:

11-08 23:51:14.512  3084  3289 W SQLiteConnectionPool: Requests in progress:

11-08 23:51:14.512  3084  3289 W SQLiteConnectionPool:   executeForChangedRowCount started 30008ms ago - running, sql="DELETE FROM sms WHERE (thread_id=2) AND (locked=0 AND date<1452658564000)"

11-08 23:51:14.513  3084  3613 W SQLiteConnectionPool: The connection pool for database '/data/user/0/com.android.providers.telephony/databases/mmssms.db' has been unable to grant a connection to thread 141 (Binder_5) with flags 0x1 for 30.009 seconds.

11-08 23:51:14.513  3084  3613 W SQLiteConnectionPool: Connections: 1 active, 0 idle, 0 available.

It means that before the Binder_3 and Binder_6 threads execute Sql, other threads have been executed for more than 30S and have not yet ended. Continue to collect logs and find that there are 15 Binder threads in the Waiting state, and the one that is being executed is Binder-2, which takes more than 30 seconds.

In summary, the reason for the high CPU of this process is that the Binder_2 thread executes the Sql operation for too long, which further causes all other Binder threads to be blocked, resulting in the failure of the system broadcast transmission to be delivered to the main thread through Binder in time, and the system is mistakenly triggered to think that the Phone process broadcast timeout .

Example 5: High CPU excessively preempts time slices, making it difficult for other applications or tasks to be scheduled in time

Example:
The main thread of this kind of problem is mostly in the idle or suspend state. The latter means that the CPU time slice allocated by the system cannot meet the current needs and is forcibly switched. The cause of this kind of phenomenon is either the action of the underlying system or the high speed of other tasks. Priority tasks preempt CPU behavior;

"main" prio=5 tid=1 Suspended

  | group="main" sCount=2 dsCount=0 obj=0x75285af8 self=0x7f87a46a00

  | sysTid=9251 nice=-6 cgrp=default sched=0/0 handle=0x7f8c5f7fe8

  | state=S schedstat=( 50580737351 8433337317 81975 ) utm=4561 stm=497 core=1 HZ=100

  | stack=0x7ff8105000-0x7ff8107000 stackSize=8MB

  | held mutexes=

  at java.util.Arrays.checkOffsetAndCount(Arrays.java:1722)

  at java.nio.CharBuffer.wrap(CharBuffer.java:90)

  at java.nio.CharBuffer.wrap(CharBuffer.java:68)

  at android.text.TextDirectionHeuristics$TextDirectionHeuristicImpl.isRtl(TextDirectionHeuristics.java:149)

  at android.text.BoringLayout.isBoring(BoringLayout.java:477)

  at android.widget.TextView.onMeasure(TextView.java:7096)

  at android.view.View.measure(View.java:19138)

  at android.view.ViewGroup.measureChildWithMargins(ViewGroup.java:6064)

  at android.widget.LinearLayout.measureChildBeforeLayout(LinearLayout.java:1465)

  at android.widget.LinearLayout.measureHorizontal(LinearLayout.java:1112)

  at android.widget.LinearLayout.onMeasure(LinearLayout.java:632)

  at android.view.View.measure(View.java:19138)

When the analysis cannot be continued on the Trace, it is necessary to analyze the log. Search for the keyword "anr in" and find

11-26 11:47:16.514  1457  1490 E ActivityManager: ANR in com.android.browser (com.android.browser/.MainActivity)

11-26 11:47:16.514  1457  1490 E ActivityManager: PID: 9251

11-26 11:47:16.514  1457  1490 E ActivityManager: Reason: Input dispatching timed out (Waiting to send non-key event because the touched window has not finished processing certain input events that were delivered to it over 500.0ms ago.  Wait queue length: 10.  Wait queue head age: 8974.9ms.)

11-26 11:47:16.514  1457  1490 E ActivityManager: Load: 10.97 / 10.71 / 10.0

11-26 11:47:16.514  1457  1490 E ActivityManager: CPU usage from 0ms to 10480ms later:

11-26 11:47:16.514  1457  1490 E ActivityManager:   114% 9251/com.android.browser: 65% user + 48% kernel / faults: 10870 minor 11 major

11-26 11:47:16.514  1457  1490 E ActivityManager:   108% 1457/system_server: 33% user + 74% kernel / faults: 9584 minor 11 major

The browser's own CPU usage is relatively high. As for System_server, it takes up a lot, especially when you see that "CPU usage from 0ms to 10480ms later" has taken up a lot of kernel part (74% kernel /), don't easily suspect that it is The high CPU of system_server is caused by the high CPU. The real reason for its high CPU is that it needs to dump the information of each process.

Following the log before "ANR in", we continued to look up and found that there should be a large number of frequent GC operations

11-26 11:47:05.204  1457  1467 I art     : Background partial concurrent mark sweep GC freed 842(578KB) AllocSpace objects, 455(85MB) LOS objects, 8% free, 169MB/185MB, paused 2.140ms total 245.072ms

11-26 11:47:10.493  9251 31938 W art     : Suspending all threads took: 131.446ms

11-26 11:47:10.598  9251 31938 W art     : Suspending all threads took: 88.134ms

11-26 11:47:10.699  9251 31938 W art     : Suspending all threads took: 93.939ms

11-26 11:47:10.795  9251 31938 W art     : Suspending all threads took: 75.051ms

11-26 11:47:10.821  9251 31938 W art     : Suspending all threads took: 14.536ms

11-26 11:47:10.956  9251 31938 W art     : Suspending all threads took: 114.243ms

11-26 11:47:11.101  9251 31938 W art     : Suspending all threads took: 121.775ms

11-26 11:47:11.254  9251 31938 W art     : Suspending all threads took: 93.763ms

.....

According to the GC type (Background partial concurrent), there should be tasks that are constantly applying for and using a large amount of memory. With this in mind, it is necessary to return to the Trace log to analyze the status of related threads. After a large number of comparative analysis After screening, I was lucky to find the following thread (this thread will only appear when TraceView is collected), and it is in the R state. Colleagues who are familiar with TraceView know that this task will cause the associated process to consume a lot of CPU, and it will be abnormally stuck (the main thread cannot get a timely response).

"Sampling Profiler" daemon prio=9 tid=162 Native

  | group="system" sCount=1 dsCount=0 obj=0x13102220 self=0x7f5a82f800

  | sysTid=31938 nice=-6 cgrp=default sched=0/0 handle=0x7f643ff440

  | state=R schedstat=( 22112458218 4449717737 10001 ) utm=2021 stm=190 core=0 HZ=100

In summary, we found the reason for the high CPU of the process: collecting TraceView threads needs to apply for a large amount of memory to continuously trigger the internal GC of the process, and its own tasks are time-consuming operations, which have never caused the main thread to fail to be scheduled and responded in time, triggering ANR.

Example 6: Incomplete logs, lack of Trace or other logs. Example:
Encountering such problems is quite frustrating. At this time, the intelligence analyzes the existing information and tries to find out the problem or improvement direction, such as lack of Trace. But other logs relatively complete

For example, the approximate time point of applying ANR is found in the event log: 10-14 00:40:26.010650

10-14 00:40:26.010650  1132  1172 I am_anr  : [0,19746,android.process.media,952680005,Broadcast of Intent { act=android.intent.action.MEDIA_SCANNER_SCAN_FILE dat=file:///sdcard/AutoSmoke_UI30/testSwitchLetvView_20161014_003533/1476376700108.png 在flg=0x10 cmp=com.android.providers.media/.MediaScannerReceiver }]

Process CPU information when ANR is found in sys_log

10-14 00:40:57.052274  1132  1172 E ANRManager: ANR in android.process.media, time=304722739
10-14 00:40:57.052274  1132  1172 E ANRManager: Reason: Broadcast of Intent { act=android.intent.action.MEDIA_SCANNER_SCAN_FILE dat=file:///sdcard/AutoSmoke_UI30/testSwitchLetvView_20161014_003533/1476376700108.png flg=0x10 cmp=com.android.providers.media/.MediaScannerReceiver }
10-14 00:40:57.052274  1132  1172 E ANRManager: Load: 37.88 / 25.54 / 20.22
10-14 00:40:57.052274  1132  1172 E ANRManager: Android time :[2016-10-14 00:40:56.95] [304754.500]
10-14 00:40:57.052274  1132  1172 E ANRManager: CPU usage from 17448ms to 0ms ago:
10-14 00:40:57.052274  1132  1172 E ANRManager:   117% 19252/com.letv.android.letvlive: 80% user + 36% kernel / faults: 684 minor
10-14 00:40:57.052274  1132  1172 E ANRManager:   110% 11620/mediaserver: 64% user + 45% kernel / faults: 23 minor
10-14 00:40:57.052274  1132  1172 E ANRManager:   41% 378/logd: 19% user + 21% kernel / faults: 17 minor
10-14 00:40:57.052274  1132  1172 E ANRManager:   22% 573/mobile_log_d: 17% user + 5.3% kernel / faults: 1123 minor
10-14 00:40:57.052274  1132  1172 E ANRManager:   18% 19286/com.letv.android.letvlive:cde: 11% user + 6.9% kernel / faults: 6029 minor
10-14 00:40:57.052274  1132  1172 E ANRManager:   18% 422/adbd: 2.1% user + 15% kernel / faults: 1722 minor
10-14 00:40:57.052274  1132  1172 E ANRManager:   17% 18392/logcat: 7.4% user + 10% kernel

From the above log, we can see that there are two processes with high CPU usage, and the system has a heavy CPU load for a long time (Load: 37.88 / 25.54 / 20.22), especially the load of 1 minute before ANR reached 37; from this we can greatly Probabilistic guesses that this ANR accident is caused by the CPU being too high and other tasks not being scheduled in time, is it right? Or is it caused by memory reasons as other colleagues think? Next, let's continue to look at the Kernel log at the corresponding time point, the keyword "lowmemorykiller", and get the following information:

<6>[302600.931727]  (4)[10628:Cam@AuxSensorCo]lowmemorykiller: Killing 'android.browser' (28649), adj 18, score_adj 1000,

<6>[302600.931727]    to free 72464kB on behalf of 'Cam@AuxSensorCo' (10628) because

<6>[302600.931727]    cache 1000628kB is below limit 322560kB for oom_score_adj 0

<6>[302600.931727]    Free memory is 235708kB above reserved

<6>[303901.663086]  (6)[16560:Cam@AuxSensorCo]lowmemorykiller: Killing 'roid.emojistore' (15854), adj 18, score_adj 1000,

<6>[303901.663086]    to free 75636kB on behalf of 'Cam@AuxSensorCo' (16560) because

<6>[303901.663086]    cache 1292884kB is below limit 322560kB for oom_score_adj 0

<6>[303901.663086]    Free memory is 285336kB above reserved

<6>[302623.705248]  (2)[10970:Cam@AuxSensorCo]lowmemorykiller: Killing 'ews:pushservice' (6186), adj 13, score_adj 764,

<6>[302623.705248]    to free 62140kB on behalf of 'Cam@AuxSensorCo' (10970) because

<6>[302623.705248]    cache 992668kB is below limit 322560kB for oom_score_adj 0

<6>[302623.705248]    Free memory is 81320kB above reserved

Cache item: Cache cache for files on the kernel side. In order to improve the IO access speed, the underlying system will selectively cache some files;

limit: The minimum memory limit of memory (file cache) is 322560kB. When both memory and file cache are lower than this threshold, LMK will start looking for low-priority processes to kill.

score_adj: From the upper layer setting to the converted process priority of the kernel, adj--> score_adj; score_adj is 1000, which means that the priority of the process to be killed is very low.

Free memory: The current free physical memory.

[302623.705248]: Kernel startup time stamp

Through the above log analysis, it can be concluded that the available memory of the system (Free+Cache) is generally maintained at about 1G, which is good. The interval between scanning and killing processes is relatively long, which will not bring too much overhead to the system load.

After analyzing the above logs, the ANR caused by the memory problem is basically ruled out. Next, go back to the log log and analyze the related logs of the ANR high CPU process to see if there is any further digging. In the log log, highlight the process PID (11620), and found that the process has hundreds of thousands of log outputs in memory for a long time. At this time, there may be hope in my heart, such frequent outputs contain many identical logs , it means that the process is generating a lot of loops, which is also a common cause of high CPU.

10-14 00:40:46.035707 11620 19687 D MtkOmxVdecEx: [0xe1eb7800] RemoveInputBuf frm=0xe1eb8d70, omx=0xa3b9dfe0, i=5
10-14 00:40:46.035791 11620 19687 D MtkOmxVdecEx: [0xe1eb7800] FB in (0xA3B9DFE0)
10-14 00:40:46.036599 11620 11620 D MtkOmxMVAMgr: [0xb3cca9f0] [ION][FreeBuffer] entry=0xa3bcf3c0, va=0xd30d7000, pa=0x47600000,size=0x180000, srcFd=0xFFFFFFFF, fd=0xFFFFFFFF, bufHdr=0xA3B9CAE0
10-14 00:40:46.037036 11620 11620 D MtkOmxVdecEx: [0xe1eb7800] RemoveInputBuf frm=0xe1eb8d28, omx=0xa3b9cae0, i=4
10-14 00:40:46.037125 11620 11620 D MtkOmxVdecEx: [0xe1eb7800] FB in (0xA3B9CAE0)
10-14 00:40:46.037907 11620 11655 D MtkOmxMVAMgr: [0xb3cca9f0] [ION][FreeBuffer] entry=0xabbfc4e0, va=0xd3557000, pa=0x47000000,size=0x180000, srcFd=0xFFFFFFFF, fd=0xFFFFFFFF, bufHdr=0xA3B9C0C0
10-14 00:40:46.038281 11620 11655 D MtkOmxVdecEx: [0xe1eb7800] RemoveInputBuf frm=0xe1eb8ce0, omx=0xa3b9c0c0, i=3
10-14 00:40:46.038364 11620 11655 D MtkOmxVdecEx: [0xe1eb7800] FB in (0xA3B9C0C0)
10-14 00:40:46.039097 11620 11657 D MtkOmxMVAMgr: [0xb3cca9f0] [ION][FreeBuffer] entry=0xa3bcf240, va=0xd3f80000, pa=0x46c00000,size=0x180000, srcFd=0xFFFFFFFF, fd=0xFFFFFFFF, bufHdr=0xA3B9C120
10-14 00:40:46.039734 11620 11657 D MtkOmxVdecEx: [0xe1eb7800] RemoveInputBuf frm=0xe1eb8c98, omx=0xa3b9c120, i=2
10-14 00:40:46.039829 11620 11657 D MtkOmxVdecEx: [0xe1eb7800] FB in (0xA3B9C120)
10-14 00:40:46.041510 11620 11653 D MtkOmxMVAMgr: [0xb3cca9f0] [ION][FreeBuffer] entry=0xa3bcf6f0, va=0xdb528000, pa=0x46600000,size=0x180000, srcFd=0xFFFFFFFF, fd=0xFFFFFFFF, bufHdr=0xA3B9DF20
10-14 00:40:46.041966 11620 11653 D MtkOmxVdecEx: [0xe1eb7800] RemoveInputBuf frm=0xe1eb8c50, omx=0xa3b9df20, i=1
10-14 00:40:46.042057 11620 11653 D MtkOmxVdecEx: [0xe1eb7800] FB in (0xA3B9DF20)
10-14 00:40:46.043345 11620 11654 D MtkOmxMVAMgr: [0xb3cca9f0] [ION][FreeBuffer] entry=0xa3bcf120, va=0xdb828000, pa=0x43200000,size=0x180000, srcFd=0xFFFFFFFF, fd=0xFFFFFFFF, bufHdr=0xABBC4420
10-14 00:40:46.043756 11620 11654 D MtkOmxVdecEx: [0xe1eb7800] RemoveInputBuf frm=0xe1eb8c08, omx=0xabbc4420, i=0
10-14 00:40:46.043841 11620 11654 D MtkOmxVdecEx: [0xe1eb7800] FB in (0xABBC4420)
10-14 00:40:46.044026 11620 11654 D MtkOmxVdecEx: [0xe1eb7800] MtkOmxVdec::FreeBuffer all input buffers have been freed!!! signal mInPortFreeDoneSem(1)

So far, a further conclusion has been derived. The ANR of the application is mainly due to the high CPU of the above two processes, which causes the scheduling to be untimely. As for the further reason for the high CPU of the process, it needs to be further analyzed and demonstrated by the owner of the relevant module combined with the log.

       Through the analysis of the above 6 types of ANR examples, it can be seen that in addition to ANR caused by normal Receiver processing time-consuming operations, other factors may cause such problems, such as low overall memory causing swap (kswap), excessive CPU causing Scheduling is not timely, Binder resources are exhausted and cannot communicate in time, etc. The clues of such problems are relatively obscure, and you need to collect multiple logs and compare them repeatedly; but fortunately, when such problems occur, the system has key log output, you can Using keywords for in-depth analysis from multiple angles and comprehensive comparison, most of the time it is possible to draw effective conclusions and provide optimization (solution) solutions for such problems; if the corresponding logs are really insufficient, we can only use the test to help reproduce and provide more What an effective log. In addition, you also need to know more about related system knowledge, such as LMK, process tuning, and Binder communication mechanism. Only in the process of analyzing and solving such problems can there be more references and measurements.
————————————————
Copyright statement: This article is an original article of CSDN blogger "Monday Morning", following the CC 4.0 BY-SA copyright agreement, please attach the original source link and this statement.
Original link: https://blog.csdn.net/qzh123456/article/details/78737791

Guess you like

Origin blog.csdn.net/BersonKing/article/details/129529487