Exploration of Android Memory Monitoring

background

As Cloud Music continues to manage online crashes, the current crash rate has reached a relatively low level in the industry. However, there are still many OOM crashes online. Most of these crashes are caused by abnormal memory problems caused by irregular coding (such as memory leaks, large objects, large images, and other unreasonable memory usage). Memory problems are difficult to find, reproduce and troubleshoot. This requires us to use some monitoring methods and some tools to assist developers to better troubleshoot such problems. Next is some exploration and practice of cloud music in memory monitoring, mainly from the following aspects

insert image description here

Memory leak monitoring

When it comes to memory issues, the first thing we think of should be memory leaks. To put it simply, a memory leak means that some unused objects are directly or indirectly held by other GC Roots with a longer life cycle in the form of strong references, resulting in memory not being released in time, causing memory problems.

Memory leaks tend to increase the probability of application memory peaks increasing OOM, which is an error-type problem, and it is also a type that is relatively easy to monitor. However, for business students, the general development tasks are relatively heavy, and they are generally less likely to take the initiative to pay attention to the leakage of local detection during the development process. This requires us to establish a set of automated tools to monitor memory leaks, and automatically generate task orders and dispatch them to corresponding development, so as to promote developers to solve the leak problem in APP just like the process of solving the crash problem.

Memory Monitoring Solution

First of all, when it comes to memory leak detection, everyone must think of LeakCanary. Leakcanary is an open source Java memory leak analysis tool from Square. It is mainly used to detect common memory leaks in Android applications during the development phase.

The advantage of LeakCanary is that it can give well-readable performance test results and some common solutions, so it is more efficient than other local analysis tools (MAT, etc.). The core principle of LeakCanary is to monitor when activities and fragments are destroyed through the API of the Android life cycle. The destroyed objects will be passed to an ObjectWatcher, which holds their weak references. By default, wait for 5 seconds to observe whether the weak references are Enter the associated reference queue, if it is, it means that there is no leak, otherwise it means that there may be a leak.

The core process of LeakCanary is as follows:

insert image description here

Leakcanary can basically satisfy our local leak monitoring in the test environment, but because LeakCanary itself detects that it will actively trigger GC to cause freezes, and the default direct use is that the application Debug.dumpHprofData()will freeze for a long time during the dump process, which is not very Suitable for production environment.

For this point, the Kuaishou team proposed an optimization solution in the open source framework Koom: it uses the Copy-on-write mechanism to fork the child process dump Java Heap, which solves the problem of App freezing for a long time during the dump process. The core principle of Koom is to periodically query the resource usage of Java heap memory, number of threads, and number of file descriptors. Adopt the strategy of virtual machine supend->fork virtual machine process->virtual machine resume->dump memory image, and execute image analysis based on shark to determine offline memory leak and reference chain search, and generate an analysis report.

The core flow chart of Koom is as follows:

insert image description here

After analyzing and comparing the two open source libraries, in order to achieve more comprehensive monitoring, we decided to build our monitoring system from online and offline dimensions, and combined with our platform to analyze the memory leaks, large objects, etc. Problems are automatically aggregated and attributable according to the reference chain, and sorted according to the aggregated problems, and then the business-side development is promoted to solve the problems through automatic bill creation.

Online, we set up a relatively strict condition (memory peaking continuously, memory bursting, the number of threads or the number of FDs reaching the threshold several times in a row, etc., and a single user will only trigger once in a certain period), when the user triggers these After the conditions are met, the memory will be dumped to generate an HPORF file, and then the HPORF file will be analyzed to analyze information such as memory leaks and large objects (the threshold of large objects can be dynamically adjusted through online configuration), and at the same time analyze information such as large image occupancy and total image occupancy , and finally report the analysis results to the background service. In order to reduce the impact on online users, we temporarily do not upload HPORF files in the early stage, and then report the cropped HPORF files according to the sampling method as needed. Regarding shark's analysis of HPORF files, there are relatively detailed information on the Internet, so I won't expand here.

Offline, we mainly combine automated testing and in the testing environment to monitor Activity, Fragent leaks reach a certain threshold or memory peaks and other situations to trigger dump, and output the HPORF file analysis results, and report to the background service at the same time.

According to the problems reported by the client, the platform side aggregates and consumes the big data problems, sorts them according to the number of user leaks, the number of affected users, and the average memory leak rate, and then distributes them to the corresponding users through automatic order building. Development to promote business-side solutions.

Currently we mainly support leaking of the following objects:

  • Activities that have been destroyed and finished
  • Fragment manager is empty fragment
  • The window that has been destroyed
  • Bitmaps exceeding the threshold size
  • Array of primitive types exceeding threshold size
  • Any class with the number of objects exceeding the threshold size
  • The cleaned up ViewModel instance
  • RootView removed from window manager

Big picture monitoring

We all know that Bitmap has always been the largest part of the total memory consumption of Android apps. Many large Bitmap shadows can be seen behind many java or native memory problems, so large image management is essential for memory management. One step, then we must do memory monitoring without large image monitoring.

For large image monitoring, we mainly divide into monitoring of large images loaded from online image libraries and large images of local resources.

Online Big Picture Monitoring

At present, we mainly do a unified monitoring of the pictures loaded from the network. Since the pictures loaded by our business use the same picture frame, we only need to judge whether the loaded pictures exceed a certain threshold or exceed the view when loading pictures. If it exceeds the size, it will be recorded and reported. We have modified the current image library, and added image information acquisition to call back to the monitoring sdk. We can get information such as the width, height, and file size of the loaded image, and also obtain the size of the current view, and then we will compare the current view. Whether the size of the picture or the memory occupied by the picture reaches a certain threshold (online configuration is supported here), and finally reported to our monitoring platform. In order to facilitate analysis and positioning and reduce performance consumption, we will not capture stack information online, but only obtain the current view level information. In order to prevent the view level from being too large, we only obtain 5-level data. At present, the current information is already Enough for us to locate the current view. At the same time, we also combined the self-developed Shuguang burying system to calculate the large image rate of the current Oid page, which can also facilitate us to monitor the large image rate of some p0-level pages.

Local image resource monitoring

In addition to the large online images, we will also take some control over the local resource images, and at the same time prevent the rapid growth of the package size caused by too large image resources. The specific implementation is to do some local resource detection through the card point process. After the mergeResources task, the plug-in traverses the image resources, collects the image resources exceeding the threshold, outputs a list, and then reports it to the background service. By automatically creating a list, find The corresponding development should be fixed before release.

insert image description here

Memory size monitoring

In addition to finding monitoring leaks and large image problems, we also need to build a large memory disk so that we can better understand the memory usage of the current App online and facilitate us to better monitor the memory usage of the App. Our memory disk is mainly divided into application startup memory (Pss) and running memory (Pss), Java memory, threads, etc.

Startup memory, running memory and Java memory monitoring

We found that when the App starts, if the memory that needs to be used is too large, there will be a big experience problem on the App side at this time. The system will continue to reclaim the memory. Slower, so we need to monitor the startup memory usage to facilitate our subsequent memory management. In the Android system, we need to pay attention to the usage of two types of memory, physical memory and virtual memory. Usually we use Android Memory Profiler to check the memory usage of APP.

insert image description here

We can view the total memory usage of the current process, JavaHeap, NativeHeap, and the memory allocation of subdivided types such as Graphics, Stack, and Code. So how do we get these memory data when we need to run online? Here we mainly obtain the Debug.MemoryInfo data of all processes (note: this interface may take a long time in low-end models and cannot be called in the main thread). Through the getMemoryStat method of Debug.MemoryInfo (requires version 23 and above), we can obtain multiple data equivalent to the default view of Memory Profiler, so as to continuously obtain the breakdown of memory usage during startup and running.

To obtain the memory when the application startup is completed, we combine the completion time node of our previous startup monitoring to collect the current memory situation. We will start multiple processes when starting. According to our previous analysis, if the APP needs to use more memory when starting, the easier it is to cause problems when starting the APP. Therefore, we will count the data of all processes to pave the way for subsequent process governance.

Running memory is to asynchronously obtain the current memory usage every once in a while, and also obtain the memory usage of all processes of the entire application. We report all collected data to the platform, and all calculations are processed in the background, which can be flexible and changeable. The background can calculate indicators such as startup completion and running average PSS, which can reflect the general situation of the entire APP memory.

In addition, we can also obtain Java memory through RunTime. We calculate the peaking rate of Java memory through the collected data (the default memory usage exceeds 85% is counted as peaking), and then calculate a peaking rate based on the summary of our platform, which can well reflect the Java memory usage of the App . Generally, if the maximum heap limit exceeds 85%, GC will become more frequent, which will easily cause OOM and freeze. Therefore, the peaking rate of Java is a very important indicator that we need to pay attention to.

While monitoring the peaking of Java memory, we also added a callback for insufficient Java memory when collecting data. Functions such as system functions onLowMemoryare memory callbacks for the entire system. For a single process, Java memory usage does not have a callback function for us to release memory in time. When we are doing peaking, we can just monitor the heap memory usage of the process in real time, and when the threshold is reached, the relevant modules will be notified to release memory, which can also reduce the probability of OOM to a certain extent.

thread monitoring

In addition to the common OOM problems caused by memory leaks or large memory allocations. We also get errors like

java.lang.OutOfMemoryError: {CanCatch}{main} pthread_create (1040KB stack) failed: Out of memory

Everyone should know the reason here. The root cause is insufficient memory. The direct reason is that the memory cannot be allocated when the initial stack size is created when the thread is created. Here we will not specifically analyze the source code of pthread_create. In addition to vmsize's limit on the maximum number of threads, there is also a certain limit on the number of threads that can be created by each process in Linux (/proc/pid/limits). In actual tests, we also found that different manufacturers have different restrictions on this limit , and when the system process thread limit is exceeded, this type of OOM will also be thrown. What is specifically pointed out here is that some models of Huawei's emui system limit the maximum number of threads to 500.

In order to understand our current thread usage, we monitor and count the number of threads in Cloud Music. When the number of threads exceeds a certain threshold, the current thread information will be reported to the platform. Here, the platform also calculates a thread peaking rate, which can measure our overall thread health and pave the way for our subsequent convergence of application threads.

In addition, we also borrowed from KOOM to monitor thread leaks, mainly monitoring several life cycle methods of native threads: pthread_create, pthread_detach, pthread_join, pthread_exit. The above methods of hook are used to record the thread life cycle, stack, name and other information. When a joinable thread is found to execute pthread_exit without detach or join, it will record the leaked thread information, and then at the right time Report thread leaks.

Summarize

Click to get the full version of the Android Performance Optimization Study Manual for free

The memory monitoring of cloud music started relatively late compared to the industry, so we can stand on the shoulders of giants and combine the current situation of cloud music to do monitoring and optimization that is more suitable for our current scenario. Memory monitoring is a subject of continuous improvement, and we cannot do everything in one step. What's more important is that we can continue to find problems and continue to do fine-grained monitoring, instead of being in the stage of "not knowing the current status of memory, filling pits and digging pits at the same time". Our goal is to build a reasonable platform for developers to solve problems or find problems in a timely manner. The current cloud music memory monitoring is still in the stage of continuous exploration and improvement, and we need to cooperate with developers in continuous optimization and iteration in the future

Guess you like

Origin blog.csdn.net/m0_71506521/article/details/130559448