Benoit Lize | 72989a31 | 2022-03-24 08:56:10 | [diff] [blame] | 1 | # Investigating Out of Memory crashes |
| 2 | |
| 3 | A large fraction of process crashes in Chromium are due to Out Of Memory (OOM) |
| 4 | conditions. This page is meant to help Chromium developers understand stack |
| 5 | traces, and investigate. Note that some of the documentation here will only be |
| 6 | applicable to Google Chrome, as it is specific to the way Google's crash |
| 7 | reporting infrastructure aggregates and reports crashes. |
| 8 | |
| 9 | Some of the following also assumes that the `malloc()` implementation is |
| 10 | PartitionAlloc, which is as of 2022 the case on most platforms. |
| 11 | |
| 12 | [TOC] |
| 13 | |
| 14 | ## Identifying OOM crashes |
| 15 | |
| 16 | When a process crashes due to an Out Of Memory condition, this is usually |
| 17 | signaled by the presence of `base::internal::OnNoMemoryInternal()` on the stack. |
| 18 | |
| 19 | **Google Chrome only:** crash report infrastructure tags these as "[Out of |
| 20 | Memory]" based on this, and other function names. The full list is determined in |
| 21 | the (internal) crash server's code. |
| 22 | |
| 23 | Since Chromium configures its memory allocators to prefer crashing rather than |
| 24 | returning `nullptr`, an OOM crash can be triggered from anywhere in the code, |
| 25 | and most commonly from within the allocator, or higher-level functions such as |
| 26 | `operator new` in C++. |
| 27 | |
| 28 | ## Distinguishing between underlying causes |
| 29 | ### Different causes |
| 30 | |
| 31 | A process can reach an OOM condition for several reasons: |
| 32 | |
| 33 | * **The OS is truly out of memory**, regardless of how much memory the *current* |
| 34 | process is using |
| 35 | * **Some limit inside the OS is reached**. For instance, on Windows, there |
| 36 | exists a global "commit limit", which is the amount of memory that the system |
| 37 | can commit. Note that it is possible to commit more memory than what is |
| 38 | actually in use. This may also happen on Linux systems configured with no or |
| 39 | limited "overcommit", though the majority of systems don't have a limit. |
| 40 | * **Virtual address space exhaustion**. This is most likely to happen for relatively |
| 41 | large allocations, on 32 bit systems, where total addressable space is |
| 42 | typically 2GiB (most Windows systems), 3GiB (e.g. some Windows configurations, |
| 43 | Linux) or 4GiB (e.g. WoW64). However, it may also happen on 64 bit systems, |
| 44 | either due to: |
| 45 | * Limited virtual addressable space in the CPU/OS. For instance most Android |
| 46 | ARM64 systems have only 40 bits of address space as of 2022. |
| 47 | * "Cage" exhaustion. This is most likely to happen with PartitionAlloc on 64 |
| 48 | bit systems, where all allocations are grouped into a single contiguous |
| 49 | virtual address space "cage". |
| 50 | * **Sandbox per-process memory limit**. For some process types (e.g. Renderers) |
| 51 | and on most platforms, the sandbox enforces a maximum per-process memory |
| 52 | limit. Given that this limit is typically set at the OS level, it may not be |
| 53 | distinguishable from e.g. commit limit exhaustion. |
| 54 | * **Excessive allocation size**. Some allocators (notably PartitionAlloc) |
| 55 | purposely limit the maximum allocation size. |
| 56 | |
| 57 | ### Identifying the cause |
| 58 | |
| 59 | In the case of PartitionAlloc, it is possible to distinguish some of these cases: |
| 60 | |
| 61 | * **Virtual address space exhaustion**. This is identified by the presence of |
| 62 | `PartitionOutOfMemoryMappingFailure()` on the stack. It means that the |
| 63 | allocator was unable to find enough address space, either for its internal |
| 64 | memory allocation unit size, or the requested size. Since memory is *not* |
| 65 | committed as this step, this signals an address space issue. |
| 66 | * **Commit**. This is identified by the presence of |
| 67 | `PartitionOutOfMemoryCommitFailure()` on the stack. This signals that either |
| 68 | the OS or the sandbox limit has been reached. |
| 69 | * **Excessive allocation size**. Shown by `PartitionExcessiveAllocationSize()` |
| 70 | on the stack. |
| 71 | |
| 72 | |
| 73 | ## What to do? |
| 74 | |
| 75 | ### Commit Limit Reached |
| 76 | |
| 77 | The process is "truly" out of memory, or the system is. Some amount of these |
| 78 | crashes is expected, and the crashing location is not necessarily the |
| 79 | culprit. Indeed, as a rough approximation, the failing allocation is more likely |
| 80 | to be from a component naturally allocating a lot of memory, e.g. V8 or |
| 81 | rendering. |
| 82 | |
| 83 | However, if there is a spike, and many stack traces come from an unusual |
| 84 | location (e.g. newly added code), this may signal a memory leak in the component |
| 85 | on the stack, or excessive temporary allocations. |
| 86 | |
| 87 | Also, if `PartitionAllocDirectMap()` is on the stack, the memory allocation was |
| 88 | large. It may come from a large buffer, and potentially made worse by buffer |
| 89 | resizing. For instance, `std::vector` often double their size when out of |
| 90 | capacity. In which case, `reserve()`-ing the right size ahead of time may help. |
| 91 | |
| 92 | ### Excessive allocation size |
| 93 | |
| 94 | Is the calling code expected to allocate more than 2GiB? Or it is an underflow |
| 95 | somewhere in the calling code? |
| 96 | |
| 97 | ### Virtual address space |
| 98 | |
| 99 | On 32 bit systems, this is most likely to occur when overall memory usage is |
| 100 | high, or when the allocation size request is large. Is the calling code |
| 101 | allocating a very large buffer? |
| 102 | |
| 103 | ## Debugging |
| 104 | |
| 105 | ### General |
| 106 | |
| 107 | On Windows, the allocation size is added into the exception record. In Google |
| 108 | Chrome's crash dashboard, this is shown in "Parameter[0]" of the exception |
| 109 | info. On other operating systems, the allocation size if put on the stack before |
| 110 | crashing, and thus visible in minidumps. |
| 111 | |
| 112 | ### PartitionAlloc and Google specific |
| 113 | |
| 114 | 1. Starting from a specific report, click on the bug icon to start a cloud lldb |
| 115 | instance |
| 116 | 2. Locate the `PartitionRoot<true>::OutOfMemory()` frame on the stack, move to it with `f 5` |
| 117 | 3. Locate the stack addresses by printing registers `re re` |
| 118 | 4. Show the stack content with `x <stack_pointer> <frame pointer>` |
| 119 | |
| 120 | Below is an example for a crash on x86_64: |
| 121 | |
| 122 | ``` |
| 123 | ( lizeb ) bt |
| 124 | * thread #1, stop reason = EXC_BREAKPOINT (code=EXC_I386_BPT, subcode=0x10c45912f) |
| 125 | * frame #0: 0x000000010c45912f Google Chrome Framework`base::internal::OnNoMemoryInternal(unsigned long) at memory.cc:62 |
| 126 | frame #1: 0x000000010c459149 Google Chrome Framework`base::TerminateBecauseOutOfMemory(unsigned long) at memory.cc:69 |
| 127 | frame #2: 0x000000010c4f39c6 Google Chrome Framework`OnNoMemory(unsigned long) at oom.cc:17 |
| 128 | frame #3: 0x000000010d7e5794 Google Chrome Framework`WTF::PartitionsOutOfMemoryUsing2G(unsigned long) at partitions.cc:281 |
| 129 | frame #4: 0x000000010d7e4d2c Google Chrome Framework`WTF::Partitions::HandleOutOfMemory(unsigned long) at partitions.cc:415 |
| 130 | frame #5: 0x000000010c4f7474 Google Chrome Framework`base::PartitionRoot<true>::OutOfMemory(unsigned long) at partition_root.cc:521 |
| 131 | [...] |
| 132 | ( lizeb ) f 5 |
| 133 | frame #5: 0x000000010c4f7474 Google Chrome Framework`base::PartitionRoot<true>::OutOfMemory(unsigned long) at partition_root.cc:521 |
| 134 | ( lizeb ) re re |
| 135 | General Purpose Registers: |
| 136 | rbp = 0x00007ffee7012c50 |
| 137 | rsp = 0x00007ffee7012bf0 |
| 138 | rip = 0x000000010c4f7474 Google Chrome Framework`base::PartitionRoot<true>::OutOfMemory(unsigned long) + 196 at partition_root.cc:522 |
| 139 | 21 registers were unavailable. |
| 140 | ( lizeb ) x 0x00007ffee7012bf0 0x00007ffee7012c50 |
| 141 | 0x7ffee7012bf0: 76 61 5f 73 69 7a 65 00 00 00 00 07 00 00 00 00 va_size......... |
| 142 | 0x7ffee7012c00: 61 6c 6c 6f 63 00 20 20 00 2d 2d 01 00 00 00 00 alloc. .--..... |
| 143 | 0x7ffee7012c10: 63 6f 6d 6d 69 74 00 20 00 a0 9d 01 00 00 00 00 commit. ........ |
| 144 | 0x7ffee7012c20: 73 69 7a 65 00 20 20 20 00 00 20 00 00 00 00 00 size. .. ..... |
| 145 | 0x7ffee7012c30: aa aa aa aa aa aa aa aa 00 18 b0 12 01 00 00 00 ................ |
| 146 | 0x7ffee7012c40: 00 00 20 00 00 00 00 00 48 22 b0 12 01 00 00 00 .. .....H"...... |
| 147 | ``` |
| 148 | |
| 149 | The results here can help the PartitionAlloc team to identify issues, as |
| 150 | important metrics from PartitionAlloc are saved above. For instance virtual |
| 151 | address space usage is (in little endian) 0x70000000. |