android-O RescueParty 介绍

一. 概述
Android系统在很多情况下都会进入到一种无法自主恢复的状态下:例如无法开机,常驻系统进程无限crash等等,往往在这些情况下手机已经无法正常使用了,像这些情况小白用户往往都不知道怎么修复手机,只能送回售后了。在O上加了一个救援的机制就是来解决这些问题的,这个机制叫: RescueParty
RescueParty 的原理大致为:同一个uid的应用发生多次异常,RescueParty会根据该uid记录发生的次数,当次数达到默认次数后会调整拯救的策略。拯救策略等级分为:
1.NONE
2.RESET_SETTINGS_UNTRUSTED_DEFAULTS
3.RESET_SETTINGS_UNTRUSTED_CHANGES
4.RESET_SETTINGS_TRUSTED_DEFAULTS
5.FACTORY_RESET
最终的拯救策略是进recovery模式。

那么哪些场景会造成触发这个机制呢?
1.a persistent app is stuck in a crash loop
2.we're stuck in a runtime restart loop.
二.RescueParty 原理介绍
RescueParty的原理我们从第一点“ a persistent app is stuck in a crash”来说,appCrash的流程这里就不多说了,看一张时序图好了:

O上在AppErrors.java的 crashApplicationInner方法中加上了 RescueParty监控,具体代码如下:
void crashApplicationInner(ProcessRecord r, ApplicationErrorReport.CrashInfo crashInfo,
	int callingPid, int callingUid) {
	。。。
	// If a persistent app is stuck in a crash loop, the device isn't very
	// usable, so we want to consider sending out a rescue party.
	if (r != null && r.persistent) {
		RescueParty.notePersistentAppCrash(mContext, r.uid);
	}
	
	AppErrorResult result = new AppErrorResult();
	TaskRecord task;
 	。。。
}


这里调用了 RescuePartynotePersistentAppCrash 方法,并传入了Context和进程uid.现在我们进入方法内部看看:
/**
* Take note of a persistent app crash. If we notice too many of these
* events happening in rapid succession, we'll send out a rescue party.
*/
public static void notePersistentAppCrash(Context context, int uid) {
	if (isDisabled()) return;
	Threshold t = sApps.get(uid);
	if (t == null) {
		t = new AppThreshold(uid);
		sApps.put(uid, t);
	}
	if (t.incrementAndTest()) {
		t.reset();
		incrementRescueLevel(t.uid);
		executeRescueLevel(context);
	}
}
首先先进行了一个 RescueParty 机制是否被禁用了的的判断,我们看看什么情况下会被禁用:
禁用的情况分为以下几种情况:
1.eng版本会被禁用.
2.userdebug版本,并且usb正在连接中.
3.getprop persist.sys.disable_rescue 为true.
其他情况都没有被禁用
然后我们继续回到 notePersistentAppCrash 方法中来,如果RescueParty机制没有被禁用,我们继续往下:
Threshold t = sApps.get(uid);
	if (t == null) {
	t = new AppThreshold(uid);
	sApps.put(uid, t);
	}
	if (t.incrementAndTest()) {
		t.reset();
		incrementRescueLevel(t.uid);
		executeRescueLevel(context);
	}
我们先看看sApps的定义:
/** Threshold for app crash loops */private static SparseArray<Threshold> sApps = new SparseArray<>();
每一个uid会对应一个 Threshold 对象,这里会根据uid取得对应的 Threshold 对象,如果 Threshold 对象为Null,那么久new一个 Threshold 对象,然后放到sApps中。紧接着会调用 incrementAndTest 方法,看看 incrementAndTest 方法中做了什么:
/**
* @return if this threshold has been triggered
*/
public boolean incrementAndTest() {
	final long now = SystemClock.elapsedRealtime();
	final long window = now - getStart();
	if (window > triggerWindow) {
		setCount(1);
		setStart(now);
		return false;
	} else {
		int count = getCount() + 1;
		setCount(count);
		EventLogTags.writeRescueNote(uid, count, window);
		Slog.w(TAG, "Noticed " + count + " events for UID " + uid + " in last "
			+ (window / 1000) + " sec");
		return (count >= triggerCount);
	}
}
这里我们分别来看看 getStart / setStart / setCount / getCount 方法:
private static class BootThreshold extends Threshold {
	public BootThreshold() {
		// We're interested in 5 events in any 300 second period; this
		// window is super relaxed because booting can take a long time if
		// forced to dexopt things.
		super(android.os.Process.ROOT_UID, 5, 300 * DateUtils.SECOND_IN_MILLIS);
	}

	@Override
	public int getCount() {
		return SystemProperties.getInt(PROP_RESCUE_BOOT_COUNT, 0);
	}

	@Override
	public void setCount(int count) {
		SystemProperties.set(PROP_RESCUE_BOOT_COUNT, Integer.toString(count));
	}

	@Override
	public long getStart() {
		return SystemProperties.getLong(PROP_RESCUE_BOOT_START, 0);
	}

	@Override
	public void setStart(long start) {
		SystemProperties.set(PROP_RESCUE_BOOT_START, Long.toString(start));
	}
}
这里其实就是把时间,次数保存到了 Properties 文件中。
从上边的代码中我们可以看到 BootThreshold 继承了 Threshold 并调用了它的构造方法:
super(android.os.Process.ROOT_UID, 5, 300 * DateUtils.SECOND_IN_MILLIS);

private abstract static class Threshold {
	。。。
	public Threshold(int uid, int triggerCount, long triggerWindow) {
		this.uid = uid;
		this.triggerCount = triggerCount;
		this.triggerWindow = triggerWindow;
	}
	。。。
}
从这里我们可以知道 triggerWindow 的值为300000, triggerCount 的值为5.
到现在我们已经知道了 incrementAndTest 方法的具体含义了:
如果两次crash的时间差大于300000,那么就设置次数为1,并把时间设置为当前时间(重置时间和次数),否则就次数加1,然后保存次数。并判断当前次数是否大于 triggerCount (5),大于就返回true,返回true后会分别执行:
t.reset();
incrementRescueLevel(t.uid);
executeRescueLevel(context);
我们分别看看三个方法的实现:
public void reset() {
		setCount(0);
		setStart(0);
	}
将次数和时间分别设置为0。
/**
* Escalate to the next rescue level. After incrementing the level you'll
* probably want to call {@link #executeRescueLevel(Context)}.
*/
private static void incrementRescueLevel(int triggerUid) {
	final int level = MathUtils.constrain(
		SystemProperties.getInt(PROP_RESCUE_LEVEL, LEVEL_NONE) + 1,
		LEVEL_NONE, LEVEL_FACTORY_RESET);
	SystemProperties.set(PROP_RESCUE_LEVEL, Integer.toString(level));

	EventLogTags.writeRescueLevel(level, triggerUid);
	PackageManagerService.logCriticalInfo(Log.WARN, "Incremented rescue level to "
		+ levelToString(level) + " triggered by UID " + triggerUid);
}
这段代码其实 就是取出当前所在的等级,加1后在存到properties中。
private static void executeRescueLevel(Context context) {
	final int level = SystemProperties.getInt(PROP_RESCUE_LEVEL, LEVEL_NONE);
	if (level == LEVEL_NONE) return;

	Slog.w(TAG, "Attempting rescue level " + levelToString(level));
	try {
		executeRescueLevelInternal(context, level);
		EventLogTags.writeRescueSuccess(level);
		PackageManagerService.logCriticalInfo(Log.DEBUG,
			"Finished rescue level " + levelToString(level));
	} catch (Throwable t) {
			final String msg = ExceptionUtils.getCompleteMessage(t);
		EventLogTags.writeRescueFailure(level, msg);
		PackageManagerService.logCriticalInfo(Log.ERROR,
			"Failed rescue level " + levelToString(level) + ": " + msg);
	}
}


这里先取出当前的等级,判断等级是否为NONE,如果不是就会去调用 executeRescueLevelInternal 方法,我们接着看 executeRescueLevelInternal 方法做了什么:
private static void executeRescueLevelInternal(Context context, int level) throws Exception {
	switch (level) {
		case LEVEL_RESET_SETTINGS_UNTRUSTED_DEFAULTS:
			resetAllSettings(context, Settings.RESET_MODE_UNTRUSTED_DEFAULTS);
			break;
		case LEVEL_RESET_SETTINGS_UNTRUSTED_CHANGES:
			resetAllSettings(context, Settings.RESET_MODE_UNTRUSTED_CHANGES);
			break;
		case LEVEL_RESET_SETTINGS_TRUSTED_DEFAULTS:
			resetAllSettings(context, Settings.RESET_MODE_TRUSTED_DEFAULTS);
			break;
		case LEVEL_FACTORY_RESET:
			RecoverySystem.rebootPromptAndWipeUserData(context, TAG);
			break;
	}
}
这里根据不同的等级来救我们的系统,总共有四级,分别为:
接下来看看每一级做了些什么事情,前面的三级都是调用了 resetAllSettings 方法,那就先看看 resetAllSettings 方法吧:
private static void resetAllSettings(Context context, int mode) throws Exception {
	// Try our best to reset all settings possible, and once finished
	// rethrow any exception that we encountered
	Exception res = null;
	final ContentResolver resolver = context.getContentResolver();
	try {
		Settings.Global.resetToDefaultsAsUser(resolver, null, mode, UserHandle.USER_SYSTEM);
	} catch (Throwable t) {
		res = new RuntimeException("Failed to reset global settings", t);
	}
	for (int userId : getAllUserIds()) {
		try {
			Settings.Secure.resetToDefaultsAsUser(resolver, null, mode, userId);
		} catch (Throwable t) {
			res = new RuntimeException("Failed to reset secure settings for " + userId, t);
		}
	}
	if (res != null) {
		throw res;
	}
}


这里其实就是根据不同的等级尽最大的努力重置所有可能的设置,对这里感兴趣的可以详细看一下。我们接下来看看最后一个等级,它调用了 RecoverySystem 类里的 rebootPromptAndWipeUserData 方法,这里其实就是让系统进recovery模式了,详细流程就不说了,看个调用栈吧:
"Binder:1313_18@9485" prio=5 tid=0xbe nid=NA waiting
java.lang.Thread.State: WAITING
blocks Binder:1313_18@9485
waiting for android.ui@9431 to release lock on <0x2562> (a com.android.server.power.PowerManagerService$4)
at java.lang.Object.wait(Object.java:-1)
at com.android.server.power.PowerManagerService.shutdownOrRebootInternal(PowerManagerService.java:2802)
locked <0x2562> (a com.android.server.power.PowerManagerService$4)
at com.android.server.power.PowerManagerService.-wrap35(PowerManagerService.java:-1)
at com.android.server.power.PowerManagerService$BinderService.reboot(PowerManagerService.java:4483)
at android.os.PowerManager.reboot(PowerManager.java:969)
at com.android.server.RecoverySystemService$BinderService.rebootRecoveryWithCommand(RecoverySystemService.java:193)
locked <0x25e1> (a java.lang.Object)
at android.os.RecoverySystem.rebootRecoveryWithCommand(RecoverySystem.java:1146)
at android.os.RecoverySystem.bootCommand(RecoverySystem.java:925)
at android.os.RecoverySystem.rebootPromptAndWipeUserData(RecoverySystem.java:855)
at com.android.server.RescueParty.executeRescueLevelInternal(RescueParty.java:190)
at com.android.server.RescueParty.executeRescueLevel(RescueParty.java:166)
at com.android.server.RescueParty.notePersistentAppCrash(RescueParty.java:126)
at com.android.server.am.AppErrors.crashApplicationInner(AppErrors.java:343)
at com.android.server.am.AppErrors.crashApplication(AppErrors.java:322)
at com.android.server.am.ActivityManagerService.handleApplicationCrashInner(ActivityManagerService.java:14621)
at com.android.server.am.ActivityManagerService.handleApplicationCrash(ActivityManagerService.java:14603)
at android.app.IActivityManager$Stub.onTransact(IActivityManager.java:79)
at com.android.server.am.ActivityManagerService.onTransact(ActivityManagerService.java:3011)
at android.os.Binder.execTransact(Binder.java:677)
最终会调用到 PowerManagerServicelowLevelReboot方法。
三.RescueParty监控的业务
发在本文最开始就已经说了在哪些场景会造成触发这个机制:
  • a persistent app is stuck in a crash loop
  • we're stuck in a runtime restart loop.
第一种情况在原理介绍的时候已经说了,就是app连续crash的时候会触发,接下来我们看看另外一种情况:
we're stuck in a runtime restart loop:
这个其实就是监控手机是不是一直在无限重启,我们看看它怎么实现监控开机的:
private void startBootstrapServices() {
	。。。
	// Now that we have the bare essentials of the OS up and running, take
	// note that we just booted, which might send out a rescue party if
	// we're stuck in a runtime restart loop.
	 RescueParty.noteBoot(mSystemContext);

	// Manages LEDs and display backlight so we need it to bring up the display.
	 traceBeginAndSlog("StartLightsService");
 	。。。
}

在system_server启动的时候在startBootstrapServices方法里会调用noteBoot方法,我们可以继续看看noteBoot方法:
/**
* Take note of a boot event. If we notice too many of these events
* happening in rapid succession, we'll send out a rescue party.
*/
public static void noteBoot(Context context) {
	if (isDisabled()) return;
		if (sBoot.incrementAndTest()) {
			sBoot.reset();
			incrementRescueLevel(sBoot.uid);
			executeRescueLevel(context);
		}
	}
}

看到这我们就很熟悉了,这里其实也是根据时间来记录次数,到达默认次数后会升级处理对策。最后的一个策略就是进入recovery了。

四.总结
RescueParty 实际上就统计一段时间内某个常驻进程有没有在不断的crash,如果是的话就按照crash的次数来分等级处理,最后一个等级是进入recovery模式,让用户自主格式化数据来拯救无法恢复的手机。


猜你喜欢

转载自blog.csdn.net/aa787282301/article/details/78766058