Android binder full -- getContentProviderImpl

1.分析过程

发生watchdog重启，原因为systemserver binder耗尽，binder thread都在等pulish provider

Object.wait 调用之后会释放同步锁，线程会休眠，需要通过同个对象锁 notify() 或者notifyAll()唤醒。使用的是时候都需要同步锁，不然会报Exception。详情见java 同步

//swt blocked thread (binder full)
at android.os.Binder.blockUntilThreadAvailable(Native method)


//system_server thread Object.wait
"Binder:971_1D" prio=5 tid=122 Waiting
  | group="main" sCount=1 dsCount=0 flags=1 obj=0x12c56460 self=0x7c823fe600
  | sysTid=9487 nice=0 cgrp=default sched=1073741824/0 handle=0x7c83c054f0
  | state=S schedstat=( 138183285126 25399460384 143755 ) utm=9529 stm=4288 core=3 HZ=100
  | stack=0x7c83b0b000-0x7c83b0d000 stackSize=1005KB
  | held mutexes=
  at java.lang.Object.wait(Native method)
  - waiting on <0x031914fd> (a com.android.server.am.ContentProviderRecord)
  at com.android.server.am.ActivityManagerService.getContentProviderImpl(ActivityManagerService.java:12103)
  - locked <0x031914fd> (a com.android.server.am.ContentProviderRecord)

被占完的provider是 com.android.providers.media/.MediaProvider ，访问的进程是pid=13461

被占完的provider是 com.android.providers.media/.MediaProvider ，访问的进程是pid=13461

//from caller=android.app.ApplicationThreadProxy@6a6d5a4 (pid=13461, userId=0)  com.android.providers.media/.MediaProvider
s01-01 19:19:24.091 834 1582 D ActivityManager: getContentProviderImpl: from caller=android.app.ApplicationThreadProxy@af91f0d (pid=13461, userId=0) to get content provider media cpr=ContentProviderRecord{10948d12 u0 com.android.providers.media/.MediaProvider}
01-01 19:19:24.092 834 1580 D ActivityManager: getContentProviderImpl: from caller=android.app.ApplicationThreadProxy@24965ec2 (pid=13461, userId=0) to get content provider media cpr=ContentProviderRecord{10948d12 u0 com.android.providers.media/.MediaProvider}

也可以搜搜binderinfo,确定是那个进程binder到systemserver,也可以尝试搜索 ContentProviderRecord看能否有线索
from 13461:xxxxx to 834:xxx

看上面的Log,caller 是 com.google.android.apps.photos
u0_a105 13461 316 1031048 44860 2 20 0 0 0 fg ffffffff f7658938 S 32 com.google.android.apps.photos

2.解决办法

这个问题可以从两方面去追查，一方面APP 方面加快provider 的启动时间, 严禁同一个APP 多线程并发获取content provider 的情况. 另外一方面，可以在AMS 获取provider 时，引入timeout, 防止出现无限等待死机的情况。
麻烦修改AMS 的代码，导入timeout 机制.

  private final ContentProviderHolder getContentProviderImpl(IApplicationThread caller, ......
        // Wait for the provider to be published
        synchronized (cpr) {
            //yulong modify for binder death by [email protected] 20150608
            //mtk71029 add for resolve dead binder death can not notify AMS issue.
+           int wait_count = 0;
            //mtk71029 add end.

            while (cpr.provider == null) {
                if (cpr.launchingApp == null) {
                    Slog.w(TAG, "Unable to launch app "
                            + cpi.applicationInfo.packageName + "/"
                            + cpi.applicationInfo.uid + " for provider "
                            + name + ": launching app became null");
                    EventLog.writeEvent(EventLogTags.AM_PROVIDER_LOST_PROCESS,
                            UserHandle.getUserId(cpi.applicationInfo.uid),
                            cpi.applicationInfo.packageName,
                            cpi.applicationInfo.uid, name);
                    return null;
                }

+                //mtk71029 add for resolve binder death can not notify AMS issue.
+                //if we check the process doesn't exist, return and release binder thread.
+                //then the binder death will come, and AMS clear the app state.
+                if(!mANRManager.isJavaProcess(cpr.launchingApp.pid)
+                          || Process.getUidForPid(cpr.launchingApp.pid) != cpr.launchingApp.uid){
+                          //TODO maybe more action to clean content provider state
+                          return null;
+                }
+               //if the app wait the provider more than 4*5000 = 20s, then return null and release the binder.
+                if (wait_count >= 4) {
+                        return null;
+                }
+                //mtk71029 add end
                try {
                    if (DEBUG_MU) {
                        Slog.v(TAG_MU, "Waiting to start provider " + cpr + " launchingApp="
                                + cpr.launchingApp);
                    }
                    if (conn != null) {
                        conn.waiting = true;
                    }
                    //  cpr.wait();
                    //mtk71029 update for resolve binder death can not notify AMS issue.
                    //wait 5s, then check state.
                    cpr.wait(5*1000); //yulong.zhangjian modified  MTK patch
                    wait_count ++;
                    //mtk71029 update end.
                    //cpr.wait();
                } catch (InterruptedException ex) {
                } finally {
                    if (conn != null) {
                        conn.waiting = false;
                    }
                }
            }
        }
        return cpr != null ? cpr.newHolder(conn) : null;

3.Solution

总结目前遇到过的 getContentProviderImpl 耗尽binder线程的case还是比较多的，原因在于：
1.provider host process起不来
2.provider host process在publish provider的时候非常的费时。而恰好client端又频繁的访问数据库

Android P上google已经有了timeout超时的patch