blob: 6fe65a3338b99dd99e29ed1c87e5240e373d4822 [file] [log] [blame] [view]
Benoit Lize72989a312022-03-24 08:56:101# Investigating Out of Memory crashes
2
3A large fraction of process crashes in Chromium are due to Out Of Memory (OOM)
4conditions. This page is meant to help Chromium developers understand stack
5traces, and investigate. Note that some of the documentation here will only be
6applicable to Google Chrome, as it is specific to the way Google's crash
7reporting infrastructure aggregates and reports crashes.
8
9Some of the following also assumes that the `malloc()` implementation is
10PartitionAlloc, which is as of 2022 the case on most platforms.
11
12[TOC]
13
14## Identifying OOM crashes
15
16When a process crashes due to an Out Of Memory condition, this is usually
17signaled by the presence of `base::internal::OnNoMemoryInternal()` on the stack.
18
19**Google Chrome only:** crash report infrastructure tags these as "[Out of
20Memory]" based on this, and other function names. The full list is determined in
21the (internal) crash server's code.
22
23Since Chromium configures its memory allocators to prefer crashing rather than
24returning `nullptr`, an OOM crash can be triggered from anywhere in the code,
25and most commonly from within the allocator, or higher-level functions such as
26`operator new` in C++.
27
28## Distinguishing between underlying causes
29### Different causes
30
31A process can reach an OOM condition for several reasons:
32
33* **The OS is truly out of memory**, regardless of how much memory the *current*
34 process is using
35* **Some limit inside the OS is reached**. For instance, on Windows, there
36 exists a global "commit limit", which is the amount of memory that the system
37 can commit. Note that it is possible to commit more memory than what is
38 actually in use. This may also happen on Linux systems configured with no or
39 limited "overcommit", though the majority of systems don't have a limit.
40* **Virtual address space exhaustion**. This is most likely to happen for relatively
41 large allocations, on 32 bit systems, where total addressable space is
42 typically 2GiB (most Windows systems), 3GiB (e.g. some Windows configurations,
43 Linux) or 4GiB (e.g. WoW64). However, it may also happen on 64 bit systems,
44 either due to:
45 * Limited virtual addressable space in the CPU/OS. For instance most Android
46 ARM64 systems have only 40 bits of address space as of 2022.
47 * "Cage" exhaustion. This is most likely to happen with PartitionAlloc on 64
48 bit systems, where all allocations are grouped into a single contiguous
49 virtual address space "cage".
50* **Sandbox per-process memory limit**. For some process types (e.g. Renderers)
51 and on most platforms, the sandbox enforces a maximum per-process memory
52 limit. Given that this limit is typically set at the OS level, it may not be
53 distinguishable from e.g. commit limit exhaustion.
54* **Excessive allocation size**. Some allocators (notably PartitionAlloc)
55 purposely limit the maximum allocation size.
56
57### Identifying the cause
58
59In the case of PartitionAlloc, it is possible to distinguish some of these cases:
60
61* **Virtual address space exhaustion**. This is identified by the presence of
62 `PartitionOutOfMemoryMappingFailure()` on the stack. It means that the
63 allocator was unable to find enough address space, either for its internal
64 memory allocation unit size, or the requested size. Since memory is *not*
65 committed as this step, this signals an address space issue.
66* **Commit**. This is identified by the presence of
67 `PartitionOutOfMemoryCommitFailure()` on the stack. This signals that either
68 the OS or the sandbox limit has been reached.
69* **Excessive allocation size**. Shown by `PartitionExcessiveAllocationSize()`
70 on the stack.
71
72
73## What to do?
74
75### Commit Limit Reached
76
77The process is "truly" out of memory, or the system is. Some amount of these
78crashes is expected, and the crashing location is not necessarily the
79culprit. Indeed, as a rough approximation, the failing allocation is more likely
80to be from a component naturally allocating a lot of memory, e.g. V8 or
81rendering.
82
83However, if there is a spike, and many stack traces come from an unusual
84location (e.g. newly added code), this may signal a memory leak in the component
85on the stack, or excessive temporary allocations.
86
87Also, if `PartitionAllocDirectMap()` is on the stack, the memory allocation was
88large. It may come from a large buffer, and potentially made worse by buffer
89resizing. For instance, `std::vector` often double their size when out of
90capacity. In which case, `reserve()`-ing the right size ahead of time may help.
91
92### Excessive allocation size
93
94Is the calling code expected to allocate more than 2GiB? Or it is an underflow
95somewhere in the calling code?
96
97### Virtual address space
98
99On 32 bit systems, this is most likely to occur when overall memory usage is
100high, or when the allocation size request is large. Is the calling code
101allocating a very large buffer?
102
103## Debugging
104
105### General
106
107On Windows, the allocation size is added into the exception record. In Google
108Chrome's crash dashboard, this is shown in "Parameter[0]" of the exception
109info. On other operating systems, the allocation size if put on the stack before
110crashing, and thus visible in minidumps.
111
112### PartitionAlloc and Google specific
113
1141. Starting from a specific report, click on the bug icon to start a cloud lldb
115 instance
1162. Locate the `PartitionRoot<true>::OutOfMemory()` frame on the stack, move to it with `f 5`
1173. Locate the stack addresses by printing registers `re re`
1184. Show the stack content with `x <stack_pointer> <frame pointer>`
119
120Below is an example for a crash on x86_64:
121
122```
123( lizeb ) bt
124* thread #1, stop reason = EXC_BREAKPOINT (code=EXC_I386_BPT, subcode=0x10c45912f)
125 * frame #0: 0x000000010c45912f Google Chrome Framework`base::internal::OnNoMemoryInternal(unsigned long) at memory.cc:62
126 frame #1: 0x000000010c459149 Google Chrome Framework`base::TerminateBecauseOutOfMemory(unsigned long) at memory.cc:69
127 frame #2: 0x000000010c4f39c6 Google Chrome Framework`OnNoMemory(unsigned long) at oom.cc:17
128 frame #3: 0x000000010d7e5794 Google Chrome Framework`WTF::PartitionsOutOfMemoryUsing2G(unsigned long) at partitions.cc:281
129 frame #4: 0x000000010d7e4d2c Google Chrome Framework`WTF::Partitions::HandleOutOfMemory(unsigned long) at partitions.cc:415
130 frame #5: 0x000000010c4f7474 Google Chrome Framework`base::PartitionRoot<true>::OutOfMemory(unsigned long) at partition_root.cc:521
131[...]
132( lizeb ) f 5
133frame #5: 0x000000010c4f7474 Google Chrome Framework`base::PartitionRoot<true>::OutOfMemory(unsigned long) at partition_root.cc:521
134( lizeb ) re re
135General Purpose Registers:
136 rbp = 0x00007ffee7012c50
137 rsp = 0x00007ffee7012bf0
138 rip = 0x000000010c4f7474 Google Chrome Framework`base::PartitionRoot<true>::OutOfMemory(unsigned long) + 196 at partition_root.cc:522
13921 registers were unavailable.
140( lizeb ) x 0x00007ffee7012bf0 0x00007ffee7012c50
1410x7ffee7012bf0: 76 61 5f 73 69 7a 65 00 00 00 00 07 00 00 00 00 va_size.........
1420x7ffee7012c00: 61 6c 6c 6f 63 00 20 20 00 2d 2d 01 00 00 00 00 alloc. .--.....
1430x7ffee7012c10: 63 6f 6d 6d 69 74 00 20 00 a0 9d 01 00 00 00 00 commit. ........
1440x7ffee7012c20: 73 69 7a 65 00 20 20 20 00 00 20 00 00 00 00 00 size. .. .....
1450x7ffee7012c30: aa aa aa aa aa aa aa aa 00 18 b0 12 01 00 00 00 ................
1460x7ffee7012c40: 00 00 20 00 00 00 00 00 48 22 b0 12 01 00 00 00 .. .....H"......
147```
148
149The results here can help the PartitionAlloc team to identify issues, as
150important metrics from PartitionAlloc are saved above. For instance virtual
151address space usage is (in little endian) 0x70000000.