[docs] Add oom.md, documenting out of memory debugging
Change-Id: Ic8f0aeb7d2477538c887de147a540ba965e5f231 Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/3515937 Reviewed-by: Bartek Nowierski <bartekn@chromium.org> Reviewed-by: Takashi Sakamoto <tasak@google.com> Commit-Queue: Benoit Lize <lizeb@chromium.org> Cr-Commit-Position: refs/heads/main@{#984737}
This commit is contained in:

committed by
Chromium LUCI CQ

parent
a5c1780ab4
commit
72989a31d2
151
docs/memory/oom.md
Normal file
151
docs/memory/oom.md
Normal file
@ -0,0 +1,151 @@
|
||||
# Investigating Out of Memory crashes
|
||||
|
||||
A large fraction of process crashes in Chromium are due to Out Of Memory (OOM)
|
||||
conditions. This page is meant to help Chromium developers understand stack
|
||||
traces, and investigate. Note that some of the documentation here will only be
|
||||
applicable to Google Chrome, as it is specific to the way Google's crash
|
||||
reporting infrastructure aggregates and reports crashes.
|
||||
|
||||
Some of the following also assumes that the `malloc()` implementation is
|
||||
PartitionAlloc, which is as of 2022 the case on most platforms.
|
||||
|
||||
[TOC]
|
||||
|
||||
## Identifying OOM crashes
|
||||
|
||||
When a process crashes due to an Out Of Memory condition, this is usually
|
||||
signaled by the presence of `base::internal::OnNoMemoryInternal()` on the stack.
|
||||
|
||||
**Google Chrome only:** crash report infrastructure tags these as "[Out of
|
||||
Memory]" based on this, and other function names. The full list is determined in
|
||||
the (internal) crash server's code.
|
||||
|
||||
Since Chromium configures its memory allocators to prefer crashing rather than
|
||||
returning `nullptr`, an OOM crash can be triggered from anywhere in the code,
|
||||
and most commonly from within the allocator, or higher-level functions such as
|
||||
`operator new` in C++.
|
||||
|
||||
## Distinguishing between underlying causes
|
||||
### Different causes
|
||||
|
||||
A process can reach an OOM condition for several reasons:
|
||||
|
||||
* **The OS is truly out of memory**, regardless of how much memory the *current*
|
||||
process is using
|
||||
* **Some limit inside the OS is reached**. For instance, on Windows, there
|
||||
exists a global "commit limit", which is the amount of memory that the system
|
||||
can commit. Note that it is possible to commit more memory than what is
|
||||
actually in use. This may also happen on Linux systems configured with no or
|
||||
limited "overcommit", though the majority of systems don't have a limit.
|
||||
* **Virtual address space exhaustion**. This is most likely to happen for relatively
|
||||
large allocations, on 32 bit systems, where total addressable space is
|
||||
typically 2GiB (most Windows systems), 3GiB (e.g. some Windows configurations,
|
||||
Linux) or 4GiB (e.g. WoW64). However, it may also happen on 64 bit systems,
|
||||
either due to:
|
||||
* Limited virtual addressable space in the CPU/OS. For instance most Android
|
||||
ARM64 systems have only 40 bits of address space as of 2022.
|
||||
* "Cage" exhaustion. This is most likely to happen with PartitionAlloc on 64
|
||||
bit systems, where all allocations are grouped into a single contiguous
|
||||
virtual address space "cage".
|
||||
* **Sandbox per-process memory limit**. For some process types (e.g. Renderers)
|
||||
and on most platforms, the sandbox enforces a maximum per-process memory
|
||||
limit. Given that this limit is typically set at the OS level, it may not be
|
||||
distinguishable from e.g. commit limit exhaustion.
|
||||
* **Excessive allocation size**. Some allocators (notably PartitionAlloc)
|
||||
purposely limit the maximum allocation size.
|
||||
|
||||
### Identifying the cause
|
||||
|
||||
In the case of PartitionAlloc, it is possible to distinguish some of these cases:
|
||||
|
||||
* **Virtual address space exhaustion**. This is identified by the presence of
|
||||
`PartitionOutOfMemoryMappingFailure()` on the stack. It means that the
|
||||
allocator was unable to find enough address space, either for its internal
|
||||
memory allocation unit size, or the requested size. Since memory is *not*
|
||||
committed as this step, this signals an address space issue.
|
||||
* **Commit**. This is identified by the presence of
|
||||
`PartitionOutOfMemoryCommitFailure()` on the stack. This signals that either
|
||||
the OS or the sandbox limit has been reached.
|
||||
* **Excessive allocation size**. Shown by `PartitionExcessiveAllocationSize()`
|
||||
on the stack.
|
||||
|
||||
|
||||
## What to do?
|
||||
|
||||
### Commit Limit Reached
|
||||
|
||||
The process is "truly" out of memory, or the system is. Some amount of these
|
||||
crashes is expected, and the crashing location is not necessarily the
|
||||
culprit. Indeed, as a rough approximation, the failing allocation is more likely
|
||||
to be from a component naturally allocating a lot of memory, e.g. V8 or
|
||||
rendering.
|
||||
|
||||
However, if there is a spike, and many stack traces come from an unusual
|
||||
location (e.g. newly added code), this may signal a memory leak in the component
|
||||
on the stack, or excessive temporary allocations.
|
||||
|
||||
Also, if `PartitionAllocDirectMap()` is on the stack, the memory allocation was
|
||||
large. It may come from a large buffer, and potentially made worse by buffer
|
||||
resizing. For instance, `std::vector` often double their size when out of
|
||||
capacity. In which case, `reserve()`-ing the right size ahead of time may help.
|
||||
|
||||
### Excessive allocation size
|
||||
|
||||
Is the calling code expected to allocate more than 2GiB? Or it is an underflow
|
||||
somewhere in the calling code?
|
||||
|
||||
### Virtual address space
|
||||
|
||||
On 32 bit systems, this is most likely to occur when overall memory usage is
|
||||
high, or when the allocation size request is large. Is the calling code
|
||||
allocating a very large buffer?
|
||||
|
||||
## Debugging
|
||||
|
||||
### General
|
||||
|
||||
On Windows, the allocation size is added into the exception record. In Google
|
||||
Chrome's crash dashboard, this is shown in "Parameter[0]" of the exception
|
||||
info. On other operating systems, the allocation size if put on the stack before
|
||||
crashing, and thus visible in minidumps.
|
||||
|
||||
### PartitionAlloc and Google specific
|
||||
|
||||
1. Starting from a specific report, click on the bug icon to start a cloud lldb
|
||||
instance
|
||||
2. Locate the `PartitionRoot<true>::OutOfMemory()` frame on the stack, move to it with `f 5`
|
||||
3. Locate the stack addresses by printing registers `re re`
|
||||
4. Show the stack content with `x <stack_pointer> <frame pointer>`
|
||||
|
||||
Below is an example for a crash on x86_64:
|
||||
|
||||
```
|
||||
( lizeb ) bt
|
||||
* thread #1, stop reason = EXC_BREAKPOINT (code=EXC_I386_BPT, subcode=0x10c45912f)
|
||||
* frame #0: 0x000000010c45912f Google Chrome Framework`base::internal::OnNoMemoryInternal(unsigned long) at memory.cc:62
|
||||
frame #1: 0x000000010c459149 Google Chrome Framework`base::TerminateBecauseOutOfMemory(unsigned long) at memory.cc:69
|
||||
frame #2: 0x000000010c4f39c6 Google Chrome Framework`OnNoMemory(unsigned long) at oom.cc:17
|
||||
frame #3: 0x000000010d7e5794 Google Chrome Framework`WTF::PartitionsOutOfMemoryUsing2G(unsigned long) at partitions.cc:281
|
||||
frame #4: 0x000000010d7e4d2c Google Chrome Framework`WTF::Partitions::HandleOutOfMemory(unsigned long) at partitions.cc:415
|
||||
frame #5: 0x000000010c4f7474 Google Chrome Framework`base::PartitionRoot<true>::OutOfMemory(unsigned long) at partition_root.cc:521
|
||||
[...]
|
||||
( lizeb ) f 5
|
||||
frame #5: 0x000000010c4f7474 Google Chrome Framework`base::PartitionRoot<true>::OutOfMemory(unsigned long) at partition_root.cc:521
|
||||
( lizeb ) re re
|
||||
General Purpose Registers:
|
||||
rbp = 0x00007ffee7012c50
|
||||
rsp = 0x00007ffee7012bf0
|
||||
rip = 0x000000010c4f7474 Google Chrome Framework`base::PartitionRoot<true>::OutOfMemory(unsigned long) + 196 at partition_root.cc:522
|
||||
21 registers were unavailable.
|
||||
( lizeb ) x 0x00007ffee7012bf0 0x00007ffee7012c50
|
||||
0x7ffee7012bf0: 76 61 5f 73 69 7a 65 00 00 00 00 07 00 00 00 00 va_size.........
|
||||
0x7ffee7012c00: 61 6c 6c 6f 63 00 20 20 00 2d 2d 01 00 00 00 00 alloc. .--.....
|
||||
0x7ffee7012c10: 63 6f 6d 6d 69 74 00 20 00 a0 9d 01 00 00 00 00 commit. ........
|
||||
0x7ffee7012c20: 73 69 7a 65 00 20 20 20 00 00 20 00 00 00 00 00 size. .. .....
|
||||
0x7ffee7012c30: aa aa aa aa aa aa aa aa 00 18 b0 12 01 00 00 00 ................
|
||||
0x7ffee7012c40: 00 00 20 00 00 00 00 00 48 22 b0 12 01 00 00 00 .. .....H"......
|
||||
```
|
||||
|
||||
The results here can help the PartitionAlloc team to identify issues, as
|
||||
important metrics from PartitionAlloc are saved above. For instance virtual
|
||||
address space usage is (in little endian) 0x70000000.
|
Reference in New Issue
Block a user