Incredible OOM

Abstract:
 This paper discovers a type of OOM (OutOfMemoryError). This type of OOM is characterized by sufficient java heap memory and device physical memory when it crashes . It explores and explains the reasons why this type of OOM is thrown.

关键字:
 OutOfMemoryError ,OOM,pthread_create failed , Could not allocate JNI Env

1. Introduction

 For every mobile developer, memory is a resource that needs to be used carefully, and OOM (OutOfMemoryError) that appears online will drive developers crazy, because the intuitive stack information we usually rely on is usually not very helpful for locating this kind of problem .
 There are a lot of information on the Internet to teach us how to use precious heap memory in a "tight clothes" way (for example, using small pictures, bitmap multiplexing, etc.), but:

  • Are online OOMs really all caused by tight heap memory?
  • Is there any possibility of OOM occurring when the App heap memory is abundant and the physical memory of the device is also abundant?

OOM crash when memory is abundant? It seems unbelievable. However, when I was investigating a problem recently, I found that most of the OOM of a product of the company did have such characteristics through the self-developed APM platform, namely:

  • When OOM crashes, the java heap memory is far below the upper limit set by the Android virtual machine, and the physical memory is sufficient, and the SD card space is sufficient

 Since the memory is sufficient, why is there an OOM crash at this time?

2. Problem description

 Before describing the problem in detail, let me clarify a problem:

    What caused the occurrence of OOM?

The following are a few APIs about Android's official statement memory limit threshold:

 

ActivityManager.getMemoryClass():     虚拟机java堆大小的上限,分配对象时突破这个大小就会OOM
ActivityManager.getLargeMemoryClass():manifest中设置largeheap=true时虚拟机java堆的上限
Runtime.getRuntime().maxMemory() :    当前虚拟机实例的内存使用上限,为上述两者之一
Runtime.getRuntime().totalMemory() :  当前已经申请的内存,包括已经使用的和还没有使用的
Runtime.getRuntime().freeMemory() :   上一条中已经申请但是尚未使用的那部分。那么已经申请并且正在使用的部分used=totalMemory() - freeMemory()
ActivityManager.MemoryInfo.totalMem:   设备总内存
ActivityManager.MemoryInfo.availMem:   设备当前可用内存
/proc/meminfo                                           记录设备的内存信息

        Figure 2-1 Android memory indicators

 It is generally believed that OOM occurs because the java heap memory is not enough, that is

 

Runtime.getRuntime().maxMemory()这个指标满足不了申请堆内存大小时

        Figure 2-2 The cause of Java heap OOM
 This kind of OOM can be very convenient to verify (for example: try to apply for heap memory exceeding the threshold maxMemory() by means of new byte[]), usually the error message of this kind of OOM is usually as follows:

 

java.lang.OutOfMemoryError: Failed to allocate a XXX byte allocation with XXX free bytes and XXXKB until OOM

        Figure 2-3 OOM error message caused by insufficient heap memory
 As mentioned earlier, in the OOM case found in this article, the heap memory is sufficient (there is still a large part of the heap memory of the size of Runtime.getRuntime().maxMemory()), The current memory of the device is also abundant (ActivityManager.MemoryInfo.availMem still has a lot) . These OOM error messages roughly fall into the following two categories:

  1. This kind of OOM occurs on Android6.0 and Android7.0 on various models, which is referred to as OOM1 in the text , and the error message is as follows:

 

java.lang.OutOfMemoryError: Could not allocate JNI Env

        Figure 2-4 OOM 1 error message

  1. The OOM that occurs concentratedly on Huawei mobile phones (EmotionUI_5.0 and above) with Android 7.0 and above, referred to as OOM 2 , the corresponding error message is as follows:

 

java.lang.OutOfMemoryError: pthread_create (1040KB stack) failed: Out of memory

        Figure 2-5 Error message of OOM 2

3. Problem analysis and solution

3.1 Code Analysis

 In the Android system, how is the error OutOfMemoryError thrown by the system? The following is a simple analysis based on the code of Android6.0:

  1. The Android virtual machine finally throws the OutOfMemoryError code at /art/runtime/thread.cc

 

void Thread::ThrowOutOfMemoryError(const char* msg)
参数msg携带了OOM时的错误信息

        Figure 3-1 The location where ART Runtime throws

  1. Searching the code, you can find that the above method is called in the following places and an OutOfMemoryError error is thrown
  • The first place is during heap operations

 

系统源码文件:
    /art/runtime/gc/heap.cc
函数:
    void Heap::ThrowOutOfMemoryError(Thread* self, size_t byte_count, AllocatorType allocator_type)
抛出时的错误信息:
    oss << "Failed to allocate a " << byte_count << " byte allocation with " << total_bytes_free  << " free bytes and " << PrettySize(GetFreeMemoryUntilOOME()) << " until OOM";

        Figure 3-2 Java heap OOM
 is actually thrown when the heap memory is not enough, that is, the previously mentioned application heap memory size exceeds Runtime.getRuntime().maxMemory()

  • The second place is when creating a thread

 

系统源码文件:
    /art/runtime/thread.cc
函数:
    void Thread::CreateNativeThread(JNIEnv* env, jobject java_peer, size_t stack_size, bool is_daemon)
抛出时的错误信息:
    "Could not allocate JNI Env"
  或者
    StringPrintf("pthread_create (%s stack) failed: %s", PrettySize(stack_size).c_str(), strerror(pthread_create_result)));

        Figure 3-3 OOM comparison error information during thread creation
 , we can know that the OOM crash we encountered is at this time, that is, when the thread is created (Thread::CreateNativeThread).

  • There are other error messages such as "[XXXClassName] of length XXX would overflow" caused by the system limiting the length of String/Array, which are not discussed in this article.

So, what we care about is the OOM error thrown when Thread::CreateNativeThread, why does creating a thread cause OOM?

3.2 Inference

 Since the OOM is thrown, it must be that some restrictions that we do not know have been triggered during the thread creation process. Since it is not the heap upper limit set by the Art virtual machine for us, it may be a lower-level restriction.
 The Android system is based on linux, so the restrictions of linux are also applicable to Android. These restrictions are:

  1. /proc/pid/limits describes the limits of the linux system on the corresponding process , the following is a sample:

 

Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        0                    unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             13419                13419                processes 
Max open files            1024                 4096                 files     
Max locked memory         67108864             67108864             bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       13419                13419                signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         40                   40                   
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us 

        Figure 3-4 Example of Linux process limits
 Use the exclusion method to filter the limits in the above example:

  • Max stack size, the limit of Max processes is for the whole system, not for a certain process, except
  • Max locked memory, excluded, will be analyzed later, the mmap call used to allocate the thread private stack during the thread creation process does not set MAP_LOCKED, so this limitation has nothing to do with the thread creation process
  • Max pending signals, the threshold of the number of signals in layer c, irrelevant, excluded
  • Max msgqueue size, the Android IPC mechanism does not support message queues, exclude

 Among the remaining limits items, Max open files is the most suspicious .
 Max open files means the maximum number of open files for each process . Every time a process opens a file, a file descriptor fd will be generated (recorded in /proc/pid/ fd below) , this restriction indicates that the number of fd cannot exceed the number specified by Max open files .
 When analyzing the thread creation process later, it will be found that file descriptors are involved in the process.

  1. Limitations described in /proc/sys/kernel

 Among these restrictions, the one related to threads is /proc/sys/kernel/threads-max, which specifies the upper limit of the number of threads created by each process , so the cause of OOM caused by thread creation may also be related to this limitation.

3.3 Verification

The above inference is verified in two steps: local verification and online acceptance.

  • Local verification: Verify the inference locally, trying to reproduce the OOM consistent with the error messages shown in Figure [2-4] OOM 1 and Figure [2-5] OOM 2
  • Online acceptance: When the plug-in is delivered and the online user OOM is accepted, it is indeed caused by the reason inferred above .

local authentication

Experiment 1:
 Trigger a large number of network connections (each connection is in an independent thread) and maintain it, each time a socket is opened, an fd will be added (one more item under /proc/pid/fd) Note: This is not the only way to increase the number of
 fds You can also use other methods, such as opening files, creating handlerthreads, etc.

  • Experimental expectation:
    when the number of process fd (obtainable through ls /proc/pid/fd | wc -l) exceeds the Max open files specified in /proc/pid/limits, OOM will occur
  • Experimental results:
    When the number of fds reaches the Max open files specified in /proc/pid/limits, continuing to open threads will indeed lead to OOM. The error message and stack are as follows:

 

E/art: ashmem_create_region failed for 'indirect ref table': Too many open files
E/AndroidRuntime: FATAL EXCEPTION: main
                  Process: com.netease.demo.oom, PID: 2435
                  java.lang.OutOfMemoryError: Could not allocate JNI Env
                      at java.lang.Thread.nativeCreate(Native Method)
                      at java.lang.Thread.start(Thread.java:730)
                      ......

        Figure 3-5 The detailed information of OOM caused by the number of FDs exceeding the limit.
 It can be seen that the error message when this OOM occurs is indeed consistent with the "Could not allocate JNI Env" of the OOM found online , so the OOM reported online may be It is caused by the number of FDs exceeding the limit, but it is finally determined that it needs to be verified online (the next section).
 In addition, it can be seen from the Log of the ART virtual machine that there is another key message "art: ashmem_create_region failed for 'indirect ref table': Too many open files" , which will be used for problem location and explanation later.

Experiment 2:
 Create a large number of empty threads (do nothing, sleep directly)

  • Experimental expectation:
    OOM crash occurs when the number of threads (which can be viewed in real time in the threads item in /proc/pid/status) exceeds the upper limit specified in /proc/sys/kernel/threads-max

  • Experimental results:

  1. Android 7.0 and above Huawei mobile phones (EmotionUI_5.0 and above) mobile phones generate OOM. The thread limit of these mobile phones is very small (it should be the limits specially modified by Huawei rom), and each process is only allowed to open a maximum of 500 threads at the same time. threads, so it's easy to reproduce. The error message during OOM is as follows:

 

W libc    : pthread_create failed: clone failed: Out of memory
W art     : Throwing OutOfMemoryError "pthread_create (1040KB stack) failed: Out of memory"
E AndroidRuntime: FATAL EXCEPTION: main
                  Process: com.netease.demo.oom, PID: 4973
                  java.lang.OutOfMemoryError: pthread_create (1040KB stack) failed: Out of memory
                      at java.lang.Thread.nativeCreate(Native Method)
                      at java.lang.Thread.start(Thread.java:745)
                      ......

        Figure 3-6 Detailed OOM information caused by the number of threads exceeding the limit.
 It can be seen that the error information is consistent with the OOM we encountered online: "pthread_create (1040KB stack) failed: Out of memory"
 In addition, the ART virtual machine has a key Log : "pthread_create failed: clone failed: Out of memory" , which will be used for problem location and explanation later.

  1. The upper limit of threads in other Rom phones is relatively large, and it is not easy to reproduce the above problems. However, for a 32-bit system, when the logical address space of the process is not enough, OOM will also occur . Each thread usually needs a stack space of about 1MB in mapp (the stack size can be set by yourself), 32 is the logical address of the system process 4GB, the user The space is less than 3GB. The logical address space is not enough ( you can check the VmPeak/VmSize record in /proc/pid/status for the used logical space address ). At this time, the OOM generated by creating the thread has the following information:

 

W/libc: pthread_create failed: couldn't allocate 1069056-bytes mapped space: Out of memory
W/art: Throwing OutOfMemoryError "pthread_create (1040KB stack) failed: Try again"
E/AndroidRuntime: FATAL EXCEPTION: main
                  Process: com.netease.demo.oom, PID: 8638
                  java.lang.OutOfMemoryError: pthread_create (1040KB stack) failed: Try again
                       at java.lang.Thread.nativeCreate(Native Method)
                       at java.lang.Thread.start(Thread.java:1063)
                       ......

        Figure 3-7 OOM caused by full logical address space

Online acceptance and problem solving

Figure [3-5] in the local attempt to reproduce the OOM error message is more consistent with the online OOM 1 situation, and Figure [3-6] is more consistent with the online OOM 2 situation , but the online OOM 1 is true when FD Is the number exceeding the limit, OOM 2 really caused by the number of threads exceeding the limit on Huawei mobile phones? The final determination also needs to take the data of the online equipment for verification.

Verification method:
 Send the plug-in to the online user, and record the following information in the /proc/pid directory when Thread.UncaughtExceptionHandler catches an OutOfMemoryError:

  1. The number of files in the /proc/pid/fd directory (fd number)
  2. threads item in /proc/pid/status (current number of threads)
  3. OOM log information (out of the stack information also contains some other warning information

Verify
 the information collected from the online device where OOM occurs online:

  1. The number of files in the /proc/pid/fd directory is equal to the number of Max open files in /proc/pid/limits , which proves that the number of FD is full
  2. The log information at the time of the crash is basically the same as that in Figure [3-5]

Therefore, it is proved that the OOM on the line is indeed the OOM caused by too many FDs, and it is inferred that the verification is successful .

The positioning and solution of OOM 1:
 The final reason is that the long connection library used in the App will sometimes send a large number of http requests instantaneously (leading to a surge in the number of FDs), which has been fixed.

The online OOM 2 verification
 focuses on the following information samples collected when the Huawei system OOM 2 crashes, (the device models included in the collected samples include VKY-AL00, TRT-AL00A, BLN-AL20, BLN-AL10, AL10, TRT-TL10, WAS-AL00, etc.):

  1. All threads records in /proc/pid/status have reached the upper limit: Threads: 500
  2. The log information at the time of the crash is basically the same as that in Figure [3-6]

It is inferred that the verification is successful, that is, the limited number of threads leads to clone failed when creating threads, which leads to online OOM 2 .

The positioning and solution of OOM 2:
 the problem in the App business code is still in the process of positioning and repairing

3.4 Explanation

Let's analyze how the OOM described in this article occurs from the code. First, the simplified version of the flowchart for thread creation is as follows:

Figure 3-8 Thread creation process

In the figure above, there are roughly two key steps in thread creation:

  • In the first column, create a thread-private structure JNIENV (JNI execution environment, used for C layer to call Java layer code)
  • Call the function pthread_create of the posix C library in the second column to create threads

The key nodes in the flow chart (marked in the figure) are described below:

  1. Node ① in the figure, the function Thread:CreateNativeThread in /art/runtime/thread.cc part of the excerpt code is as follows:

 

    std::string msg(child_jni_env_ext.get() == nullptr ?
        "Could not allocate JNI Env" :
        StringPrintf("pthread_create (%s stack) failed: %s", PrettySize(stack_size).c_str(), strerror(pthread_create_result)));
    ScopedObjectAccess soa(env);
    soa.Self()->ThrowOutOfMemoryError(msg.c_str());

        Figure 3-9 Thread:CreateNativeThread excerpt
It can be seen that:

  • When the creation of JNIENV is unsuccessful, the error message of OOM is "Could not allocate JNI Env", which is consistent with the OOM in the text
  • When pthread_create fails, the OOM error message is "pthread_create (%s stack) failed: %s". The detailed error information is given by the return value (error code) of pthread_create. For the correspondence between error codes and error descriptions, please refer to the definitions in bionic/libc/include/sys/_errdefs.h . The specific error message of OOM 2 in the article is "Out of memory", which means that the return value of pthread_create is 12.

 

...
__BIONIC_ERRDEF( EAGAIN         ,  11, "Try again" )
__BIONIC_ERRDEF( ENOMEM         ,  12, "Out of memory" )
...
__BIONIC_ERRDEF( EMFILE         ,  24, "Too many open files" )
...

        Figure 3-10 System error definition _errdefs.h

  1. Nodes ② and ③ in the figure are key nodes in the process of creating JNIENV. The function MemMap:MapAnonymous in node ②/art/runtime/mem_map.cc is used for Indirect_Reference_table in the JNIENV structure (the C layer is used to store JNI local/global variables) To apply for memory , the method of applying for memory is the function ashmem_create_region shown in node ③ (create an ashmen anonymous shared memory and return a file descriptor) . Node ② code excerpts are as follows:

 

  if (fd.get() == -1) {
      *error_msg = StringPrintf("ashmem_create_region failed for '%s': %s", name, strerror(errno));
      return nullptr;
  }

        Figure 3-11 MemMap:MapAnonymous excerpts the error message "ashmem_create_region failed for 'indirect ref table': Too many open files" from
 our online OOM 1, which is consistent with the information printed here . The error description of "Too many open files" indicates that the errno (system global error flag) here is 24 (see Figure [3-10] system error definition _errdefs.h).
 It can be seen from this that the OOM on our line is caused by the fact that the number of file descriptors is full and ashmem_create_region cannot return a new FD .

  1. Nodes ④ and ⑤ in the figure are the links when calling the C library to create a thread. To create a thread, first call the __allocate_thread function to apply for the thread's private stack memory (stack), etc. , and then call the clone method to create the thread . When applying for the stack, the mmap method is used. The code of node ⑤ is excerpted as follows:

 

  if (space == MAP_FAILED) {
    __libc_format_log(ANDROID_LOG_WARN,
                      "libc",
                      "pthread_create failed: couldn't allocate %zu-bytes mapped space: %s",
                      mmap_size, strerror(errno));
    return NULL;
  }

        Figure 3-12 __create_thread_mapped_space Excerpt
The printed error message is consistent with the OOM error message caused by the full process logical address in Figure [3-7] . The error message "Try again" in Figure [3-7] indicates that the system global error flag errno is 11 (see Figure [3-10] system error definition _errdefs.h).
 In the process of pthread_create, the relevant codes of node 4 are as follows:

 

 int rc = clone(__pthread_start, child_stack, flags, thread, &(thread->tid), tls, &(thread->tid));
  if (rc == -1) {
    int clone_errno = errno;
    // We don't have to unlock the mutex at all because clone(2) failed so there's no child waiting to
    // be unblocked, but we're about to unmap the memory the mutex is stored in, so this serves as a
    // reminder that you can't rewrite this function to use a ScopedPthreadMutexLocker.
    pthread_mutex_unlock(&thread->startup_handshake_mutex);
    if (thread->mmap_size != 0) {
      munmap(thread->attr.stack_base, thread->mmap_size);
    }
    __libc_format_log(ANDROID_LOG_WARN, "libc", "pthread_create failed: clone failed: %s", strerror(errno));
    return clone_errno;
  }

        Figure 3-13 pthread_create excerpt
The error log "pthread_create failed: clone failed: %s" output here is consistent with the OOM we found online . The error description "Out of memory" in Figure [3-6] shows the overall system The error flag errno is 12 (see Figure [3-10] system error definition _errdefs.h). The second OOM
 on the line is due to the limitation of the number of threads and the failure to clone at node 5 leads to OOM .

4. Conclusion and Monitoring

4.1 Causes of OOM

In summary, there are several reasons that can lead to OOM:

  1. The number of file descriptors (fd) exceeds the limit , that is, the number of files under proc/pid/fd exceeds the limit in /proc/pid/limits. Possible scenarios include:
    a large number of requests in a short period of time lead to a surge in the fd number of the socket, a large number of (repeated) open files, etc.
  2. The number of threads exceeds the limit , that is, the number of threads recorded in proc/pid/status (threads item) exceeds the maximum number of threads specified in /proc/sys/kernel/threads-max. Possible scenarios include:
    Unreasonable use of multi-threading in the app, such as multiple OKhttpclients that do not share the thread pool, etc.
  3. The traditional java heap memory exceeds the limit , that is, the requested heap memory size exceeds Runtime.getRuntime().maxMemory()
  4. (Low probability) 32 is OOM because the system process logic space is full.
  5. other

4.2 Monitoring measures

You can use the inotify mechanism of linux for monitoring:

  • watch /proc/pid/fd to monitor how the app opens the file,
  • watch /proc/pid/task to monitor thread usage.

5.Demo

POC (Proof of concept) code see: https://github.com/piece-the-world/OOMDemo



Author: Tao Caicai
Link: https://www.jianshu.com/p/e574f0ffdb42
Source: Jianshu
The copyright belongs to the author. For commercial reprint, please contact the author for authorization, for non-commercial reprint, please indicate the source.

Guess you like

Origin blog.csdn.net/hi_zhengjian/article/details/106437058