Analysis of Android Camera memory problems

Good text recommended:
Author: Byte technical team beat
link: https: //juejin.cn/post/6862508868438589447

This article explores the problem of native memory OOM during camera shooting on a class of Android models, and analyzes such problems in depth with the help of memory snapshot cropping and the empowerment of native memory monitoring tools.

background

Raphael is a native memory monitoring tool developed by the Watermelon Video Android team. It is widely used to monitor native memory leaks on Bytedance's internal products (such as Watermelon, Douyin, Toutiao, etc.). On Douyin 7.8.0-8.3.0, a large number of memory log sites that crashed due to virtual memory peaking (such as pthread_create, GL error, EGL_BAD_ALLOC) have been collected, and more than 60% of them are camera-related memory leaks, accounting for the overall crash More than 15% (Java & Native). At the same time, we also received feedback from manufacturers such as OPPO that the native crash of the Douyin app on its new models is more than 3 times higher than that of other models. Analysis of the logs provided by the manufacturers found that basically all vehicles are caused by virtual memory peaking, of which 80% All of the above are logs of camera-related memory allocation failures.

problem

Through stack aggregation and so-level memory usage statistics of logs collected by native memory monitoring, it can be found that the total amount of native memory intercepted by the tool at the time of OOM has reached about 1.3G (native that can be directly used by 32-bit applications) The upper limit of the memory is about 2G). Among them, the largest proportion is the memory indirectly referenced by the CameraMetaData object, and the native memory leak is very serious.

Because the frequency of native memory allocation is too high, it is time-consuming to obtain the Java layer stack, and it is not suitable to directly grab the Java stack frequently when intercepting native memory allocation. Native memory is different from Java memory, and it is difficult to intuitively draw conclusions from intercepted data alone. Generally, it is difficult to attribute the problems caused by insufficient resources caused by unreasonable use of resources such as memory. From the intercepted data, CameraMetaData refers to the largest memory and the most suspicion. Based on this decision, we decided to analyze this problem

initial analysis

Analyze the allocation and release of native memory

It can be seen from the intercepted stack that the upper layer of the CameraMetaData creation stack is Java calls, and the memory allocation (boot-framework.oat & libandroid_runtime.so) is finally performed at the native layer. The CameraMetaData object has two parts of memory, the object itself & the memory referenced by the camera_metadata_t pointed to by mBuffer; the source code shows that the camera_metadata_t pointed to by the mBuffer of each CameraMetadata object is independent and does not overlap each other.

Since the tool can intercept so many unreleased memory allocations, it must be caused by problems with the release logic of these memory. We need to investigate the release logic of CameraMetadata.mBuffer first. By analyzing the source code of CameraMetadata.cpp, it can be seen that CameraMetadata::release() does not release the memory pointed to by mBuffer, but assigns the memory pointed to by mBuffer to another CameraMetadata object; CameraMetadata::clear() is a true release, The clear call has two scenarios: one is when the camera_metadata_t is reused, and the other is when the CameraMetadata object is destroyed.

The foregoing conclusion shows that the camera_metadata_t pointed to by CameraMetadata.mBuffer are independent of each other. By guessing the stack and allocation amount intercepted by the tool, there must be a large number of CameraMetadata instances in the memory of Native OOM. The destruction of C++ objects is usually implemented by calling delete. It is difficult to search where a CameraMetaData object is deleted in AOSP, because it is difficult to know the variable name at the time of delete. According to a basic C++ programming specification, where the memory is usually created, it should be released there. We can easily find that the creation and release of the CameraMetaData object is in [/frameworks/base/core by searching the new CameraMetaData string globally. /jni/android_hardware_camera2_CameraMetadata.cpp]

Through the registration list in android_hardware_camera2_CameraMetadata.cpp, you can see that the Java layer class associated with these functions is android / hardware / camera2 / impl / CameraMetadataNative , and the CameraMetadata_close function in Java corresponds to the nativeClose function. It can be further found that the nativeClose function in CameraMetaDataNative is called in the close function, and the close function is called in the finalize function.

Through the above analysis, it can be seen that the corresponding native memory will only be reclaimed when the CameraMetaDataNative object executes the finalize method, and the finalize method is executed in the FinalizerDaemon thread. It is guessed that if the native OOM of the above stack occurs, there must be a lot of Java layer CameraMetaDataNative object that has not implemented the finalize method.

Troubleshoot the Java heap site

Fortunately, we can easily get a large number of Java heap memory snapshot files corresponding to this kind of native OOM through the memory snapshot cropping tool (Tailor). These memory snapshot files perfectly confirm the previous conjecture that when this kind of native OOM occurs, there are indeed a large number of CameraMetadataNative objects in the Java layer. Take the following figure as an example. Except for 6 of these CameraMetadataNative objects that are referenced by other code, the rest of the objects are all in the queue of the FinalizerDaemon thread, waiting for the finalize method to be executed. At the same time, there are 6658 objects in the snapshot, and only about 600+ objects have mMetadataPtr equal to 0, indicating that the Native memory corresponding to this part of the object needs to be released during finalize. This completely matches the data intercepted by the tool, which is also indirectly verified. Correctness and reliability of Native memory monitoring

In-depth analysis

Troubleshoot Finalize execution

Although the above analysis verified the problem and confirmed the previous conjectures, the underlying cause of such problems has not yet been found, and there is still no way to solve such problems in the end. Why are there so many CameraMetadataNative objects waiting to execute the finalize method may be the next direction of investigation. Students who have done Java stability management should be aware of a well-known TimeoutException exception. The root cause of this type of exception is the timeout of finalize execution. Could this case be caused by the timeout of finalize execution of an object?

Combined with the source code of FinalizerDaemon, we can see that every time the finalize method of an object is executed, the current object will be recorded through the finalizingObject property. If it is really caused by the finalize timeout, there must be a scene where the finalizingObject property is not empty. After traversing the state of the FinalizerDaemon thread in all relevant memory snapshots, we found that the finalizingObject properties of these scenes are all empty. This result is unexpected, and it does not seem to be caused by a timeout of the finalize method of an object.

By analyzing finalizingReference = (FinalizerReference<?>)queue.remove(), it is found that the logic behind this line of code does not call   finalizingReference null, indicating that this place will definitely not return null. Since it is not empty, queue.remove() can only block waiting. The source code of ReferenceQueue.java also confirms the conjecture.

The source code shows that goToSleep is a synchronization method and may block. But traversing all relevant snapshots and finding that all the needToWork attributes are false, which proves that it has been passed (only FinalizerWatchdogDaemon.INSTANCE.goToSleep() will be set to false, and this function is private and only called in the FinalizerDaemon thread), so the block is in There are few possibilities here.


In fact, the reason why the block is here is usually because the objects that need to be finalized are added to the queue of FinalizerDaemon only during GC. If there is no GC for a period of time and the queue is empty, the above remove will always block, and no objects will be added to the queue until after the GC. Coincidentally, when this kind of native OOM occurs, we will actively dump the memory snapshot of the Java heap through Tailor, and the GC & suspend will be triggered when the snapshot is dumped, which eventually causes a large number of CameraMetadataNative objects to be added to the queue of FinalizerDaemon.queue at the same time.

Analyze GC strategy

Through the above analysis, it can be seen that if it is not GC, these objects will not be added to FinalizerDaemon.queue. This shows that there has been no GC for a period of time before the occurrence of this kind of native OOM, which leads to a large number of CameraMetadataNative objects not executing finalize in time, and then Native OOM occurred. The above analysis is also verified in the standing observation experiment after entering the shooting page offline. Among them, the Java heap will actively trigger a GC every 30s-40s or even longer. During this period, the native memory will continue to grow until the GC After that, it will drop significantly, and Java & Native memory will return to normal levels. Although the problem is not the block in the finalize link, in the end the cause of this problem is locked in the GC logic!


Students who understand GC may know that there are many GC causes of ART virtual machines, and kGcCauseForAlloc/kGcCauseBackground is the most frequently triggered virtual machine. When staying on the shooting page without doing any operation, the program logic is relatively simple. During this period, only the camera service cycle (>=30 times/s) is triggered to create a CameraMetadataNative object on the application side through the binder, and a camera capture is displayed on the shooting page. To the image. In this process, the Java heap is created only by the CameraMetadataNative object, and CameraMetadataNative itself occupies a relatively small memory. After a GC, the virtual machine will not actively trigger the GC for a long time when the Java heap memory is relatively rich. If the increase of native memory during this period is too large, native OOM will occur if it peaks before the next GC

In summary, the fundamental reason for this type of native OOM is: when the native memory of the application itself is already at a high water level, after turning on the camera, the camera service will continue to create CameraMetadataNative objects on the application side through binder communication, and create CameraMetadataNative objects at the same time. On the application side, a relatively large memory for storing camera_metadata_t is created/reused in the native layer through the jni interface. Because the CameraMetadataNative object of the Java layer itself is relatively small, this behavior of continuously creating small objects can hardly trigger the GC of the Java layer within a certain period of time, which causes the native memory indirectly referenced to rise continuously, and finally triggers the virtual memory upper limit and crashes.

Solutions

Although the cause of the problem is relatively simple, it is still difficult to decide how to solve this type of problem. Since the GC is not caused in time, a simple solution is to periodically trigger the GC on the shooting page. But if the GC interval is relatively small, GC is time-consuming after all. Too frequent GC will seriously affect the shooting experience; if the GC interval is relatively long, there will still be a high probability of repeating the mistakes of this kind of native OOM.

It is difficult to balance the impact on performance with a scheme that actively triggers GC. In fact, the focus of the problem is not the Java layer, but the native memory referenced by Java objects. If this part of the memory is actively released in time, this kind of problem can be completely solved. From the previous analysis, it can be known that this part of memory was originally recovered in the finalize link of GC, but if it is discovered in advance that CameraMetadataNative is no longer in use, it can be triggered to release this part of memory in advance and it can be done once and for all. By analyzing the source code, it can be found that CameraMetadataNative is not used after it is passed to the application layer. After the application layer uses the CameraMetadataNative object, the native memory referenced by it can be released by calling the close function through reflection. [Image uploading...(image-98dc21-1615988565752-1)]

Offline experiments can also find that after the active recovery strategy is turned on, the growth rate of Native memory is significantly lower than before. During this period, the Java heap & native layer still has small objects that continue to increase, but the growth rate of native is much lower than that of the Java layer. In this scenario, Java memory will trigger GC before the native memory reaches the top, which greatly reduces the occurrence of native OOM. The possibility

In the end, after the program went online, the effect was very obvious, and such crashes (the total proportion of Java & Native> 15%) were basically cleared. The memory related to CameraMetadata in the memory monitoring logs collected later is basically within 2M, and the effect is immediate!

to sum up

This kind of problem has existed for a long time, at least since Android 4.4, the native memory is released through the finalize function of CameraMetadataNative. In the past, the demand for shooting was relatively simple. Most of the time, the camera application that came with the ROM was used to take pictures. Because this kind of app is relatively simple, the native memory level itself is very low, and it is difficult to trigger the upper limit of the virtual memory, so this kind of problem It was not exposed. With the rise of apps such as small videos, the demand for shooting (special effects & beauty, etc.) has become heavier, and apps have become more and more complex. The app’s own native memory level continues to rise, coupled with native memory leaks and other reasons. This type of problem can be easily triggered when staying on the shooting page.

In addition, when the memory allocation of CameraMetadata fails, it will not crash directly. At this time, other memory allocation requests will trigger the crash (such as thread creation, GL memory allocation, etc.). This is also the root cause of many camera black screen problems during shooting. . This solution inadvertently solves the long-standing problem of the camera's black screen when shooting.

This type of problem has both the application's own reasons and the design of the memory reclamation strategy. While reducing leakage as much as possible, applications should also strive to lower their native memory water level. Using Java's finalize method in AOSP to release its indirect referenced native memory is a lazy design, and similar cases abound in AOSP. In our actual development, limited resources like memory should be recovered in time, and even the life cycle of the object can be actively limited. Once the mission is completed, the memory occupied by it will be actively recovered, avoiding the use of finalize logic to release native memory.

The two tools improved in the article (Native memory monitoring tool Raphael & Android heap memory snapshot cropping compression tool) are two sets of efficient and practical basic tools developed by the watermelon video Android team in the long-term memory optimization management. They are used in the major internal apps of our company. The application is very extensive, and it is the absolute first choice for memory optimization & stability management. For these two sets of tools, we will also introduce relevant technical details in subsequent technical articles such as monitoring tool construction & optimized governance practices, so stay tuned.

If you want to know more about Android-related knowledge, you can click into my [ GitHub project: https://github.com/733gh/GH-Android-Review-master ] to check it out for yourself , and there are many records in it. Android knowledge points. Finally, please give me your support! ! !

image

Guess you like

Origin blog.csdn.net/dongrimaomaoyu/article/details/114951734