Android fever monitoring practice

1. Background

I believe that now that mobile terminals are highly popular, everyone will have more or less battery anxiety and have had the bad experience of mobile phones getting hot. The heating problem is an indicator that exists for a long time and in multiple scenarios, and involves many influences such as the terminal-side application layer, the mobile phone ROM manufacturer's system, and the external environment. How to effectively measure heating scenarios, locate heating sites, and attribute heating problems have become three major challenges in end-side application layer heating monitoring. This article uses some existing monitoring practices on the Android side of Dewu. It cannot extricate itself without going into the power consumption calculation scenario. It focuses on the heating scenario itself first, hoping to give you some reference.

2. Definition of fever

Temperature is the most intuitive indicator that can reflect the heating problem. Currently, on the Android side, we use the body temperature above 37° as the dividing line, and every 3° upward as a heating temperature interval. The upper limit temperature of the interval subdivision is 49°, which is divided into 37-40 , 40-43, 43-46, 46-49, 49+ five levels.

Using mobile phone temperature and CPU usage as the first and second factors to determine whether the user has fever, other parameters are obtained to support the fever scene.

The specific indicators are as follows:

Mobile phone temperature, CPU usage, GPU usage;

thread stack;

System service usage frequency;

Device front and back, screen on and off duration;

Battery capacity and charging status;

Heat Relief Fever Level;

System model and version;

....

3. Obtaining indicators

temperature

  • battery temperature

The system BatteryManger already provides a series of built-in interfaces and sticky broadcasts to obtain battery information.

BatteryManager.EXTRA_TEMPERATURE broadcast, the temperature value obtained is 10 times the value in degrees Celsius.

//获取电池温度BatteryManager.EXTRA_TEMPERATURE,华氏温度需要除以10
fun getBatteryTempImmediately(context: Context): Float {
    return try {
        val batIntent = getBatteryStickyIntent(context) ?: return 0f
        batIntent.getIntExtra(BatteryManager.EXTRA_TEMPERATURE, 0) / 10F
    } catch (e: Exception) {
        0f
    }
}

private fun getBatteryStickyIntent(context: Context): Intent? {
    return try {
        context.registerReceiver(null, IntentFilter(Intent.ACTION_BATTERY_CHANGED))
    } catch (e: Exception) {
        null
    }
}

In addition to supporting the system broadcast of battery temperature, BatteryManager also includes the reading of additional information such as battery power and charging status, all of which are defined in its source code.

以下罗列几个值得关注的:
//BATTERY_PROPERTY_CHARGE_COUNTER 剩余电池容量,单位为微安时
//BATTERY_PROPERTY_CURRENT_NOW 瞬时电池电流,单位为微安
//BATTERY_PROPERTY_CURRENT_AVERAGE 平均电池电流,单位为微安
//BATTERY_PROPERTY_CAPACITY 剩余电池容量,显示为整数百分比
//BATTERY_PROPERTY_ENERGY_COUNTER 剩余能量,单位为纳瓦时
// EXTRA_BATTERY_LOW  是否认为电量低
// EXTRA_HEALTH  电量健康常量的常数
// EXTRA_LEVEL  电量值
// EXTRA_VOLTAGE 电压
// ACTION_CHARGING   进入充电状态
// ACTION_DISCHARGING  进入放电状态
  • sensor temperature

Android is an open source operating system modified based on Linux. Similarly, in the sys/class/thermal/ directory of the mobile phone system, there are thermal_zoneX representing the temperature zone of each sensor, and cooling_deviceX representing cooling devices such as fans or radiators.

Taking OnePlus 9 as an example, there are a total of 105 temperature sensors or temperature partitions, and 48 cooling devices.

The specific parameter type is recorded under each temperature partition. We focus on the type file and temp file, which record the name of the sensor device and the current sensor temperature respectively. Taking thermal_zone29 as an example, the temperature value representing the fifth processing unit of the first core of the CPU is 33.2 degrees Celsius. For a single device, the name corresponding to the partition is fixed, so we can read the thermal_zone file to record the current sensor whose first type file name contains the CPU as the CPU temperature.

  • Case temperature

Android 10 Google officially launched a thermal mitigation framework, which monitors underlying hardware sensors (mainly USB sensors and Skin sensors) through the HAL2.0 framework to provide USB and shell temperature thermal signal level change monitoring. The system PowerManager source code provides corresponding heat level changes. There are 7 levels of acquisition of callback and fever levels, which are provided to developers to acquire actively or passively.

final PowerManager powerManager = (PowerManager) mContext.getSystemService(Context.POWER_SERVICE);
powerManager.addThermalStatusListener(new PowerManager.OnThermalStatusChangedListener() {
    @Override
    public void onThermalStatusChanged(int status) {
       //返回对应的热状态
    }
});

But in terms of heat levels, the case temperature is undoubtedly the most reflective of the phone’s heat. It can be seen that the API of the Android system actually provides the AIDL interface, and you can directly register the monitoring of the Thermal change event and obtain the Temperature object. But since Hide API is identified. The regular application layer cannot be obtained. Taking into account the compatibility of the Android version, it is read through the reflection proxy ThermalManagerService.

But contrary to expectations, domestic manufacturers have not fully adapted to the official thermal mitigation framework, and the thermal status callback is often not accurate enough. Instead, they need to separately access each manufacturer's thermal mitigation SDK to directly obtain the shell temperature. The specific API is based on the application manufacturer's Internal access documents shall prevail.

CPU usage

The CPU usage is collected and calculated by reading and parsing the Proc stat file.

In the system proc/[pid]/stat and /proc/[pid]/task/[tid]/stat, the CPU information corresponding to the process ID and the thread ID under the process ID are recorded respectively. The specific field description will not be described here. For details, see: https://man7.org/linux/man-pages/man5/procfs.5.html .

We focus on the 14.15 bits of information, which respectively represent the user mode running time and kernel mode running time of the process/thread.

By parsing the Stat file of the current process and the Stat files of all threads in the Task directory, the difference/sampling interval between the sum of utime+stime within the two sampling periods (currently set to 1s) can be considered as thread entry. CPU usage. Immediate thread CPU usage = ((utime+stime)-(lastutime+laststime)) / period

GPU usage

For Qualcomm chip equipment, we can refer to the file content under /sys/class/kgsl/kgsl-3d0/gpubusy and the instructions on Qualcomm’s official website.

GPU usage = (picture below) value 1 / value 2 * 100, which has been verified to be basically consistent with the value obtained by SnapDragonProfiler information collection.

For MediaTek chip devices, we can directly read  the usage value under /d/ged/hal/gpu_utilization .

Similarly, by specifying the sampling interval of the period (1 time per second), the current GPU usage per second can be obtained.

System service usage

Android system services include Warelock, Alarm, Sensor, Wifi, Net, Location, Bluetooth, Camera, etc.

There is little difference from the conventional monitoring methods on the market. They all use the system Hook ServiceManager to monitor the Binder communication of the system service, match the corresponding calling method name, and perform callback record processing corresponding to the middle layer monitoring.

Students who are familiar with Android development know that Android's Zygote process is the first process when the Android system starts. In the Zygote Fork process, the system service-related process SystemServer will be hatched. In its core RUN method, a large number of system services will be registered and started, and managed through ServiceManager.

Therefore, we can use the LocationManager as an example to monitor by reflecting the proxy ServiceManager, intercept the corresponding methods in the corresponding LocationManager, and record the data we expect to obtain.

// 获取 ServiceManager 的 Class 对象
Class<?> serviceManagerClass = Class.forName("android.os.ServiceManager");
// 获取 getService 方法
Method getServiceMethod = serviceManagerClass.getDeclaredMethod("getService", String.class);
// 通过反射调用 getService 方法获取原始的 IBinder 对象
IBinder originalBinder = (IBinder) getServiceMethod.invoke(null, "location");
// 创建一个代理对象 Proxy
Class<?> iLocationManagerStubClass = Class.forName("android.location.ILocationManager$Stub");
Method asInterfaceMethod = iLocationManagerStubClass.getDeclaredMethod("asInterface", IBinder.class);
final Object originalLocationManager = asInterfaceMethod.invoke(null, originalBinder);
Object proxyLocationManager = Proxy.newProxyInstance(context.getClassLoader(),
        new Class[]{Class.forName("android.location.ILocationManager")},
        new InvocationHandler() {
            @Override
            public Object invoke(Object proxy, Method method, Object[] args) throws Throwable {
                // 在这里进行方法的拦截和处理
                Log.d("LocationManagerProxy", "Intercepted method: " + method.getName());
                // 执行原始的方法
                return method.invoke(originalLocationManager, args);
            }
        });
// 替换原始的 IBinder 对象
getServiceMethod.invoke(null, "location", proxyLocationManager);

In the same way, we obtain records of the number of applications and calculation intervals for each system service within a fixed sampling period.

The source code Power_profile file defines the current amount in each system service state.

After we need to record the working time of each component in different states, we can obtain the heating contribution ranking of the components through the following calculation method, namely:

Component power consumption (heat contribution) ~~ Current* Running time* Voltage (generally fixed value, can be ignored)

Thread stack

Since the heating problem is a comprehensive problem, unlike the Crash problem, we can know which thread triggered it at the scene of the occurrence. If the stacks of all threads are dumped and recorded, the number of sub-threads currently running is 200+, and it is undoubtedly unreasonable to store them all. The question becomes how to more accurately find the thread stack of the hot code?

As mentioned above, when calculating the CPU usage, we read the Stat files of all threads under the process. We can get the CPU usage of the sub-threads, perform inverse ranking of the usage, and filter out the ones that exceed the threshold (currently defined as 50%) or The threads occupying Top N are stored. Since there is a performance penalty at the timing of frequent stack collection, part of the stack sampling precision and accuracy is sacrificed. After indicators such as temperature and CPU usage exceed the threshold definition, stack information at the specified release time will be collected.

We also need to clarify a concept: the file name of the thread Stat file is the thread identification name, and Thread.id refers to the thread ID.

The two are not equivalent, but the Native method provides us with a corresponding way to establish the mapping relationship between the two.

In the Art Thread.cc method, the Thread object in Java is converted into a Thread object in C++, and ShortDump is called to print the relevant information of the thread. We can obtain the Tid of the thread by matching the core Tid= information through the string. .

The core code logic is as follows:

//获取队列中最近一次cpu采样的数据
 val threadCpuUsageData = cpuProfileStoreQueue.last().threadUsageDataList
       val hotStacks = mutableListOf<HotStack>()
        if (threadCpuUsageData != null) {
            val dataCount = if (threadCpuUsageData.size <= TOP_THREAD_COUNT) {
                threadCpuUsageData.size
            } else {
                TOP_THREAD_COUNT
            }
            val traces: MutableMap<Thread, Array<StackTraceElement>> = Thread.getAllStackTraces()
            //定义tid 和 thread的映射关系map
            val tidMap: MutableMap<String, Thread> = mutableMapOf()
            traces.keys.forEach { thread ->
                //调用native方法获取到tid信息
                val tidInfo = hotMonitorListener?.findTidInfoByThread(thread)
                tidInfo?.let {
                    findTidByTidInfo(tidInfo).let { tid ->
                        if (tid.isNotEmpty()) {
                            tidMap[tid] = thread
                        }
                    }
                }
            }
            //采集topN的发热堆栈
            for (index in 1..dataCount) {
                val singleThreadData = threadCpuUsageData[index - 1]
                val isMainThread = singleThreadData.pid == singleThreadData.tid
                val thread = tidMap[singleThreadData.tid.toString()]
                thread?.let { findThread ->
                    traces[findThread]?.let { findStackTrace ->
                        //获取当前的线程堆栈
                        val sb = StringBuilder()
                        for (element in findStackTrace) {
                            sb.append(element.toString()).append("\n")
                        }
                        sb.append("\n")
                        if (findStackTrace.isNotEmpty()) {
                            //是否为主线程
                            //组装hotStack
                            val hotStack = HotStack(
                                //进程id
                                singleThreadData.pid,
                                singleThreadData.tid,
                                singleThreadData.name,
                                singleThreadData.cpuUseRate,
                                sb.toString(),
                                thread.state
                                isMainThread
                            )
//                        Log.d("HotMonitor", sb.toString())
                            hotStacks.add(hotStack)
                        }
                    }
                }

            }
        }

4. Monitoring plan

Under the premise of understanding how the core indicator data is obtained, in fact, the core idea of ​​the monitoring solution is nothing more than limited sampling configurations such as sampling thresholds, sampling cycles, and data switches of each module issued by the remote APM configuration center, and the sub-thread Handler regularly sends messages. The data of each module is collected for assembly, and the data is reported at the appropriate time. The specific data disassembly and analysis work will be further processed by the heating platform.

Overall module architecture

Reporting time

Core collection process

Online and offline distinction

Since the CPU collection and stack collection of all sub-threads will actually compromise performance, the overall reading time for 200+ threads is about 200ms, and the CPU usage of the sampling sub-thread is 10%. Considering the online Due to user experience issues, high-frequency sampling cannot be fully enabled.

Therefore, in terms of the overall plan: the offline scenario focuses on discovering, troubleshooting, and managing all problems, reporting all logs, and taking CPU and GPU usage as the first measurement indicator;

The online scenario focuses on observing the overall heating market trend, analyzing potential problem scenarios, and reporting core logs, with battery temperature as the first measurement indicator.

Heating platform

With the support of classmates on the platform side, the heating field data is consumed through the platform side, and the core heating stack is aggregated through the Android stack anti-obfuscation service to complete basic fields such as charging status, main thread CPU usage, problem type, and battery temperature. , the platform side has the ability to discover, analyze, and solve process-based monitoring and advancement.

The specific stack information & fever information platform are displayed as follows:

Since battery temperature and CPU usage are the most intuitive indicators for heating scenarios during runtime, and we focus on the management of heating scenarios in the first phase, we will not conduct continuous in-depth analysis of power consumption scenarios such as component hooks, so the current object side is based on Battery temperature and CPU usage are the first and second indicators to establish the core four quadrants of heating problems, giving priority to high temperature and high CPU problem scenarios.

During the data analysis process, we encountered situations where the efficiency of data troubleshooting was not high enough and the accuracy of the questions was not accurate enough.

  • How to determine whether the high temperature scene occurs inside the App and increases significantly during use? By filtering the scenes where the temperature is high from the start and the temperature is high when switching back to the background, we focus on the scenes where the temperature inside the app rises.
  • After online sampling, there are still 60,000+ data reported in a single day. How do we filter out more core data? The current approach is to define the concept of temperature span, giving priority to cases with larger temperature spans within the App.
  • The thread has a stack blocked by calling Wait and other methods, which consumes time allocation in the kernel state, but does not actually consume false positive data of the overall CPU. The running status of the thread and the State recorded in the Proc file are supplemented to facilitate priority processing of the CPU high temperature and high usage problem of the RUNNABLE thread.
  • As the temperature of mobile phones rises as a gradual scenario, how to achieve accurate attribution of pages in the scenario of temperature rise? While increasing the temperature sampling frequency, instantaneous data such as CPU usage and real-time stack are aggregated as data support. However, considering the volume of data, the data reporting aggregation and trimming method is still gradually exploring a more reasonable way, striving to achieve the best between the two. Find a balance between.

5. Income

Since the Android end-side heating monitoring was launched, with the support of the platform side, some problems have been discovered one after another and the development students have been jointly carried out to manage and optimize the corresponding scenarios, such as:

Time-consuming independent thread tasks are connected to the unified thread pool scheduling management;

Animation execution infinite loop monitoring and repair;

Optimization of file reading and writing strategies in high IO scenarios;

High-concurrency task lock granularity optimization;

Frequent Json parsing scenarios such as log libraries use more efficient serialization methods;

Try to classify the collection parameter equipment with too high system power, such as system cameras;

Webgl-based game scene frame rate reduction and timely resource recycling optimize runtime memory;

....

This undoubtedly accumulated some valuable experience for the scene technology selection and technology implementation of future experience work, which is in line with the high standards and requirements for the ultimate pursuit of App experience.

6. Future Outlook

As a progressive experience scenario, mobile phone heating involves multiple factors such as mobile phone hardware, system services, software usage, and external environment. For end-side troubleshooting, the current priority is focused on unreasonable use of the application layer, including troubleshooting tool link enhancement, problem business attribution, low battery, dynamic policy reduction in low power consumption mode, automated diagnostic reports, etc. There are still many points worth digging into in this link, such as:

Monitoring/Tool Enhancements

  • App floating layer analysis tool (CPU\GPU/frequency/temperature/power consumption and other information)
  • Learn from BatteryHistorian, SnapdragonProfiler, Systrace and other tools to enhance the capabilities of self-developed TeslaLab.

business attribution

  • The heating stack is automatically allocated
  • Call traceability and attribution refinement

Scenario strategy, downgrade

  • CPU tuning, dynamic frame rate, resolution downgrade
  • In-device low power consumption mode exploration

Automated diagnostic reporting

  • Single-user targeted automated analysis output diagnostic report

7. Summary

This is just a rough introduction to some of the preliminary work that has been done to control heating, as well as ideas for future heating and power consumption-related developments. I hope that the App can bring a better experience and bring users a greater yearning for better things. feelings.

*Text/GavinX

This article is original to Dewu Technology. For more exciting articles, please see: Dewu Technology official website

Reprinting without the permission of Dewu Technology is strictly prohibited, otherwise legal liability will be pursued according to law!

Alibaba Cloud suffered a serious failure and all products were affected (restored). Tumblr cooled down the Russian operating system Aurora OS 5.0. New UI unveiled Delphi 12 & C++ Builder 12, RAD Studio 12. Many Internet companies urgently recruit Hongmeng programmers. UNIX time is about to enter the 1.7 billion era (already entered). Meituan recruits troops and plans to develop the Hongmeng system App. Amazon develops a Linux-based operating system to get rid of Android's dependence on .NET 8 on Linux. The independent size is reduced by 50%. FFmpeg 6.1 "Heaviside" is released
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5783135/blog/10141675