06-bufferpool_2
06-bufferpool_2
1 Introduction
This semester, our focus will be on disk-oriented database management systems. A disk-oriented architecture means
that the DBMS’s primary storage location is in persistent storage, like a hard drive (HDD) or flash storage (SSDs).
This is different from an in-memory DBMS, where data is stored in volatile memory.
In the Von Neumann architecture, data must be in memory before we can operate on it (a few exceptions apply for
running operations directly on persistent storage). Any DBMS must be able to efficiently move data back and forth
between disk and memory if it wants to operate on large amounts of data. The DBMS can achieve this with a Buffer
Pool Manager. A diagram of this interaction is shown in Figure 1.
At a high level, the buffer pool manager is responsible for moving physical pages of data back and forth from buffers
in main memory to persistent storage. It also behaves as a cache, keeping frequently used pages in memory for faster
access, and evicting unused or cold pages back out to storage.
From the DBMS’s perspective, it should “appear” as if the entire database resides in memory, when in reality the
database might occupy more space than the available memory in the system. It should not have to worry about how
data is fetched into memory or how it is managed. The DBMS only requires valid pointers to memory locations to
perform its operations.
There are two main things that we will try to optimize for:
1. Spatial Control, which refers to where pages are physically located on disk. The goal of spatial control is to
keep pages that are used together often as physically close together as possible on disk. This can potentially
help with prefetching and other optimizations.
2. Temporal Control, which refers to when pages have been brought into memory and when they should be
written back out to disk. Temporal control aims to minimize the number of stalls from having to read data
from disk.
Fall 2024 – Lecture #06 Buffer Pools
2 Buffer Pool
The buffer pool is an in-memory cache of pages between memory and disk. It is essentially a large memory region
allocated inside of the database to temporarily store pages. It is organized as an array of fixed-size frames. When
the DBMS requests a page, the buffer pool manager first checks if the page is already stored in a frame of memory,
and if it is not found, the page is read / copied into a free frame from disk. We consider the buffer pool manager
as a write-back cache, where dirty pages are buffered and not written to disk immediately on mutation. This is in
contrast to a write-through cache, where any changes are immediately propagated to disk.
The DBMS needs this buffer pool memory for many different things (just as most computer programs need memory
for different types of data structures):
• Tuple Storage and Indexes
• Sorting and Join Buffers
• Query and Dictionary Caches
• Maintenance and Log Buffers
Depending on the implementation, the things listed above do not always have to be backed by disk / the buffer pool
manager, and can simply rely on memory allocators like malloc.
A page directory is also maintained on disk, which is the mapping from page IDs to page locations in database
files. All changes to the page directory must be recorded on disk to allow the DBMS to find on restart. It is often (not
always) kept in memory to minimize latency to page accesses since it has to be read before accessing any physical
page.
See Figure 2 for a diagram of the buffer pool’s memory organization.
table also maintains additional meta-data per page, a dirty flag, and a pin / reference counter.
The dirty flag is set by a thread whenever it modifies a page. This indicates to the storage manager that the page
must be written back to disk before eviction.
The Pin / reference counter tracks the number of threads that are currently accessing that page (either reading or
modifying it). A thread has to increment the counter before they access the page. If a page’s pin count is greater
than zero, then the storage manager is not allowed to evict that page from memory. Pinning does not prevent other
transactions from accessing the page concurrently. If the buffer pool runs out of non-pinned pages to evict and the
buffer pool is full, an out-of-memory error will be thrown.
CLOCK
The CLOCK policy is an approximation of LRU without needing a separate timestamp per page. In the CLOCK policy,
each page is given a reference bit. When a page is accessed, it is set to 1. Note: Some implementations may allow
an actual ref counter greater than 1.
To visualize this, organize the pages in a circular buffer with a “clock hand”. When an eviction is requested, sweep
the hand and check if a page’s bit is set to 1. If yes, set it to zero, if no, then evict, bring in the new page in its place,
and move the hand forward. Additionally, the clock remembers the position between evictions.
Issues
There are a number of problems with LRU and CLOCK replacement policies.
LRU and CLOCK are both susceptible to sequential flooding, where the buffer pool’s contents are polluted due
to a sequential scan. Since sequential scans read many pages quickly, the buffer pool fills up and pages from other
queries are evicted as they would have earlier timestamps. In this scenario, the most recent timestamp is not the
optimal one to evict.
LRU also does not account for the frequency of accesses. For example, if we have a page that is frequently accessed
over time, we do not want to evict it if it wasn’t accessed recently.
Alternatives
There are three solutions to address the shortcomings of LRU and CLOCK policies.
One solution is LRU-K which tracks the history of the last K references as timestamps and computes the interval
between subsequent accesses. This history is used to predict the next time a page is going to be accessed. However,
this again has a higher storage overhead. Additionally, need to maintain an in-memory cache of recently evicted
pages to prevent thrashing and have some history for recently evicted pages.
An approximation for LRU-2 done by SQL is to have two linked lists and only evict from the old list. Whenever a
page is accessed and it is already in the old list put it at the start of the young queue, else put it at the start of the old
list.
Another optimization is localization per query. The DBMS chooses which pages to evict on a per query / transaction
basis. This minimizes the pollution of the buffer pool from each query.
Lastly, priority hints allow transactions to tell the buffer pool whether page is important or not based on the context
of each page during query execution.
Dirty Pages
Pages / frames keep track of a dirty flag / bit that denotes a page that has been modified by the DBMS. There are two
cases in handling eviction of dirty pages. The fast path is when the page is not dirty, and the buffer pool manager
can simply drop it. The slow path is when the page is dirty, and the buffer pool manager must write the changes
back to disk to ensure data synchronization between memory and disk.
One way to avoid the problem of having to incur the cost of writing out pages is background writing. Through
background writing, the DBMS can periodically walk through the page table and write dirty pages to disk. When a
dirty page is safely written, the DBMS can either evict the page or just unset the dirty flag.
pages to a specific buffer pool. Object IDs involve extending the record IDs to have an object identifier. A mapping
from objects to specific buffer pools can be maintained via these object IDs. This allows a finer-grained control over
buffer pool allocations but has a storage overhead. Another approach is hashing, where the DBMS hashes the page
ID to select which buffer pool to access. This is a more general and uniform approach.
Pre-fetching
The DBMS can also be optimized by pre-fetching pages based on the query plan. While the first set of pages is
being processed, the second can be pre-fetched into the buffer pool. This method is commonly used by DBMSs when
accessing many pages sequentially during a sequential scan. It is also possible for a buffer pool manager to prefetch
leaf pages in a tree index data structure benefiting index scans. Note that this does not necessarily need to be the
next physical page on disk, but instead the next logical page in the leaf scan.
7 Conclusion
The DBMS can almost always manage memory better than the OS. It can leverage the semantics about the query
plan to make better decisions:
• Evictions
• Allocations
• Pre-fetching
To reiterate, the Operating System is not your friend.