Android online OOM - how to analyze and solve

 From:https://toutiao.io/shares/1013011/url

 

In Android (Java) development, java.lang.OutOfMemoryError (referred to as OOM in this article) is basically encountered. This kind of error is more difficult to solve than the general Exception or Error, mainly because the root cause of the error is not very Obvious. Since there is no way to directly get the user's memory dump file, if the error occurs in the online version, it will be more difficult to analyze. This article starts from a specific case and introduces the idea of ​​OOM analysis and the use of related tools.

Case background

 

During the 7.4~7.7 version of the Meituan App, the number of OOMs in the food business remained high, much higher than the historical level, mainly due to errors in local DECODE resources.

The number of OOMs in the figure is the statistics of each version in the first month after the release, including the newly released version and the historical version. Compared with the situation of other businesses in the same period, there is also a similar OOM. Since the visit volume of the gourmet business accounts for a large proportion of the Meituan App, the number of OOMs is also higher than that of other businesses.

 

ideas

 

During the 7.6~7.7 version with more serious problems, the team had various speculations about the reason for the frequent occurrence of OOM. The author wondered whether it was caused by some changes in the business, such as the increase in the size of the header image, or the loading method of the page module, etc. But these don't coincide with the timing of the OOM problem. Secondly, it was suspected whether it was caused by some ROM bugs, but this inference lacked strong evidence support. Therefore, to find the root cause of OOM, the fundamental way is to find out who occupies the most memory, and then analyze why it occupies so much according to the specific case.

 

Collect user mobile phone memory information

 

To analyze the memory usage, a dump file of the memory is required, but the dump file is generally relatively large, so it is not suitable for users to cooperate with uploading the dump file. So I hope to be able to collect some memory features at runtime and report them with the crash log. When the user has OOM, dump the user's memory, and then analyze it based on com.squareup.haha:haha:2.0.3 to obtain some key data (the instances that occupy the most memory and the proportion, etc.). But this solution soon proved unworkable. Mainly for the following reasons:

  • A new library needs to be introduced.

  • Dumping and analyzing memory are time-consuming and unacceptably efficient.

  • The memory is almost exhausted during OOM, and then loading the memory dump file and analyzing it will lead to a second OOM, which is not worth the loss.

Simulation to reproduce OOM

 

It is not feasible to collect the memory information of the user's mobile phone, so the only way is to reproduce the user's scenario. Due to the uncertainty of the user's operation path when OOM occurs, the online OOM cannot be accurately reproduced. Therefore, the simulation reproduction method is adopted, and the stack information when OOM occurs is basically the same. In order to be able to simulate as much as possible a scenario in which the user occurs OOM, the basic conditions need to be basically the same, that is, various related parameters of the mobile phone used by the user.

 

Mining OOM Features

 

Analyze OOM since 7.4, and list the characteristics of the machine where OOM occurs, mainly memory and resolution, and properly consider other factors such as system version.

 

These characteristics can be summarized as: general memory, high resolution, and basically the same stack log of OOM. Among them, the proportion of OOM occurred on OPPO N1 (T/W) is relatively high, about 65%, so this machine was selected as the machine to reproduce OOM.

 

Critical data (memory dump file)

 

Need to reproduce OOM and then get memory dump. The idea is to take a memory stress test to expose problems quickly and fully. The specific plans are:

  • Select a page with many image resources and more complex, such as the POI details page of food.

  • Load the page 30 times, in order to increase the chance of OOM, the IDs of the 30 POI pages are different.

 

After the OOM occurs, use the Android Monitor that comes with Android Studio to dump the HPROF file, and then use the hprof-conv (located in sdk_root/platform-tools) tool in the SDK to convert it to the standard Java heap dump file format, so that you can use MAT ( Eclipse Memory Analyzer) to continue the analysis.

 

Cut to the histogram view and sort in descending order by shadow heap.

 

Select the byte array, right-click->list objects->with incoming references, sort in descending order, you can see that there are many byte[] instances of the same size.

 
Right click on one of the arrays->Path to GC Roots->exclude xxx references
 

如上图所示,这些byte[]都是系统的EdgeEffect的drawable所持有,drawable对应的bitmap占用的空间为1566 * 406 * 4 = 2543184,与byte数组的大小一致。

再看另外一个:

 
这些byte[]是被App的一个背景图所持有,如下图:
 

通过ImageView的ID(如图)及build目录下的R.txt反查可知该ImageView的ID名称,即可知其设置的背景图的大小为720 * 200(xhdpi),加载到内存并考虑density,size刚好是1080 * 300 * 4 = 1296000,与byte数组大小一致。

 

数据分析

 

为什么会出现这些大小一致的byte数组,或者说,为什么会创建多份EdgeEffect的drawable?查看EdgeEffect的源码(4.2.2)可知,其drawable成员也是通过Resources.getDrawable系统调用获取的。

 

不论是Resources.getDrawable还是TypedArray.getDrawable,最终都会调用Resources.loadDrawable。继续看Resources.loadDrawable的源码,发现的确是使用了缓存。对于同一个drawable资源,系统只会加载一次,之后都会从缓存去取。

 

既然drawable的加载机制并没有问题,那么drawable所在的缓存实例或者获取drawable的Resources实例是否是同一个呢?通过下面的代码,打印出每个Activity的Resources实例及Resources实例的drawable cache。

 

[java]  view plain  copy
 
  1.   
//noinspection unchecked
LongSparseArray<WeakReference<Drawable.ConstantState>> cache = (LongSparseArray<WeakReference<Drawable.ConstantState>>) Hack.into(Resources.class).field("mDrawableCache").get(getResources());
Object appCache = Hack.into(Resources.class).field("mDrawableCache").get(getApplication().getResources());
Log.e("oom", "Resources.mDrawableCache: {application=" + appCache + ", activity=" + cache + "}");
Log.e("oom", "Resources: {application=" + getApplication().getResources() + ", activity=" + getResources() + "}");

 

 

这也进一步解释了另外一个现象,即这些大小相同的数组的个数基本和启动Activity的数量成正比。

 

通过数据分析可知,这些drawable之所以存在多份,是因为其所在的Resources实例并不是同一个。进一步debug可知,Resources实例存在多个的原因是开启了标志位sCompatVectorFromResourcesEnabled

虽然最终造成OOM突然增多的原因只是开启一个标志位,但是这也告诫大家阅读API文档的重要性,其实很多时候API的使用说明已经明确告知了使用的限制条件甚至风险。

 

7.8版本关闭了此标志,发版后第一个月的OOM数量(包含历史版本)为153,如下图。

其中新版本发生的OOM数量为22。

总结

 

对于线上出现的OOM,如何分析和解决可以大致分为三个步骤:

  1. 充分挖掘特征。在挖掘特征时,需要多方面考虑,此过程更多的是猜测怀疑,所以可能的方面都要考虑到,包括但不限于代码改动、机器特征、时间特征等,必要时还需要做一定的统计分析。

  2. 根据掌握的特征寻找稳定的复现的途径。一般需要做内存压力测试,这样比较容易达到OOM的临界值,只是简单的一些正常操作难以触发OOM。

  3. 获取可分析的数据(内存dump文件)。利用MAT分析dump文件,MAT可以方便的按照大小排序实例,可以查看某些实例到GC ROOT的路径。

 

 

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326005018&siteId=291194637