Comparing changes

Previously we always initialized the SpinDelayStatus. That is sufficiently expensive / buffer header lock acquisitions are sufficiently frequent to make it worthwhile to instead have a fastpath that does not initialize the SpinDelayStatus. While this is a small gain on its own, it mainly is aimed at preventing a regression after a future commit, which requires additional locking to set hint bits. Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff

This makes lookups faster, due to allowing auto-vectorized lookups. It is also beneficial for an upcoming patch, independent of auto-vectorization, as the upcoming patch wants to track more information for each pinned buffer, making the existing loop, iterating over an array of PrivateRefCountEntry, more expensive due to increasing its size. Reviewed-by: Melanie Plageman <[email protected]> Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff

The private refcount entry for a buffer is often looked up repeatedly for the same buffer, e.g. to pin and then unpin a buffer. Benchmarking shows that it's worthwhile to have a one-entry cache for that case. With that cache in place, it's worth splitting GetPrivateRefCountEntry() into a small inline portion (for the cache hit case) and an out-of-line helper for the rest. This is helpful for some workloads today, but becomes more important in an upcoming patch that will utilize the private refcount infrastructure to also store whether the buffer is currently locked, as that increases the rate of lookups substantially. Reviewed-by: Melanie Plageman <[email protected]> Discussion: https://postgr.es/m/6rgb2nvhyvnszz4ul3wfzlf5rheb2kkwrglthnna7qhe24onwr@vw27225tkyar

Before this commit fsm_vacuum_page() modified the page without any lock on the page. Historically that was kind of ok, as we didn't rely on the freespace to really stay consistent and we did not have checksums. But these days pages are checksummed and there are ways for FSM pages to be included in WAL records, even if the FSM itself is still not WAL logged. If a FSM page ever were modified while a WAL record referenced that page, we'd be in trouble, as the WAL CRC could end up getting corrupted. The reason to address this right now is a series of patches with the goal to only allow modifications of pages with an appropriate lock level. Obviously not having any lock is not appropriate :) Discussion: https://postgr.es/m/4wggb7purufpto6x35fd2kwhasehnzfdy3zdcu47qryubs2hdz@fa5kannykekr Discussion: https://postgr.es/m/[email protected]

Before we dealt with this in 6 near identical and one very similar copy. The helper function errors out when encountering a HEAP_MOVED_IN/HEAP_MOVED_OUT tuple with xvac considered current or in-progress. It'd be preferrable to do that change separately, but otherwise it'd not be possible to deduplicate the handling in HeapTupleSatisfiesVacuum(). Author: Reviewed-by: Discussion: https://postgr.es/m/ Backpatch:

To be able to guarantee that we can set the hint bit, acquire an exclusive lock on the old buffer. We need the hint bits to be set as otherwise reform_and_rewrite_tuple() -> rewrite_heap_tuple() -> heap_freeze_tuple() will get confused. It'd be better if we somehow could avoid setting hint bits on the old page. A commonreason to use VACUUM FULL are very bloated tables - rewriting most of the old table before during VACUUM FULL doesn't exactly help. Author: Reviewed-by: Discussion: https://postgr.es/m/ Backpatch:

There are two reasons for doing so: 1) It is generally faster to perform checks in a batched fashion and making sequential scans faster is nice. 2) We would like to stop setting hint bits while pages are being written out. The necessary locking becomes visible for page mode scans if done for every tuple. With batching the overhead can be amortized to only happen once per page. There are substantial further optimization opportunities along these lines: - Right now HeapTupleSatisfiesMVCCBatch() simply uses the single-tuple HeapTupleSatisfiesMVCC(), relying on the compiler to inline it. We could instead write an explicitly optimized version that avoids repeated xid tests. - Introduce batched version of the serializability test - Introduce batched version of HeapTupleSatisfiesVacuum Author: Reviewed-by: Discussion: https://postgr.es/m/ Backpatch:

This is motivated by wanting to merge buffer content locks into BufferDesc.state in a future commit, rather than having a separate lwlock (see commit c75ebc6 more details). As this change is rather mechanical, it seems to make sense to split it out into a separate commit, for easier review. Reviewed-by: Melanie Plageman <[email protected]> Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff

Until now buffer content locks were implemented using lwlocks. That has the obvious advantage of not needing a separate efficient implementation of locks. However, the time for a dedicated buffer content lock implementation has come: 1) Hint bits are currently set while holding only a share lock. This leads to having to copy pages while they are being written out if checksums are enabled, which is not cheap. We would like to add AIO writes, however once many buffers can be written out at the same time, it gets a lot more expensive to copy them, particularly because that copy needs to reside in shared buffers (for worker mode to have access to the buffer). In addition, modifying buffers while they are being written out can cause issues with unbuffered/direct-IO, as some filesystems (like btrfs) do not like that, due to filesystem internal checksums getting corrupted. The solution to this is to require a new share-exclusive lock-level to set hint bits and to write out buffers, making those operations mutually exclusive. We could introduce such a lock level into the generic lwlock implementation, however it does not look like there would be other users, and it does add some overhead into important codepaths. 2) For AIO writes we need to be able to race-freely check whether a buffer is undergoing IO and whether an exclusive lock on the page can be acquired. That is rather hard to do efficiently when the buffer state and the lock state are separate atomic variables. This is a major hindrance to allowing writes to be done asynchronously. 3) Buffer locks are by far the most frequently taken locks. Optimizing them specifically for their use case is worth the effort. E.g. by merging content locks into buffer locks we will be able to release a buffer lock and pin in one atomic operation. 4) There are more complicated optimizations, like long-lived "super pinned & locked" pages, that cannot realistically be implemented with the generic lwlock implementation. Therefore implement content locks inside bufmgr.c. The lockstate is stored as part of BufferDesc.state. The implementation of buffer content locks is fairly similar to lwlocks, with a few important differences: 1) An additional lock-level share-exclusive has been added. This lock level conflicts with exclusive locks and itself, but not share locks. 2) Error recovery for content locks is implemented as part of the already existing private-refcount tracking mechanism in combination with resowners, instead of a bespoke mechanism as the case for lwlocks. This means we do not need to add dedicated error-recovery codepaths to release all content locks (like done with LWLockReleaseAll() for content locks). 3) The lock state is embedded in BufferDesc.state instead of having its own struct. 4) The wakeup logic is a tad more complicated due to needing to support the additional lock level This commit unfortunately introduces some code that is very similar to the code in lwlock.c, however the code is not equivalent enough to easily merge it. The future wins that this commit makes possible seem worth the cost. As of this commit nothing uses the new share-exclusive lock mode. It will be used in a future commit. It seemed too complicated to introduce the lock-level in a separate commit. TODO: - Address FIXMEs - Perhaps move the locking code into a buffer_locking.h or such? Needs to be inline functions for efficiency unfortunately. - reflow some comments that I didn't reflow to make the diff more readable Reviewed-by: Melanie Plageman <[email protected]> Reviewed-by: Greg Burd <[email protected]> Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff

At the moment hint bits can be set with just a share lock on a page (and in one place even without any lock). Because of this we need to copy pages while writing them out, as otherwise the checksum could be corrupted. The need to copy the page is problematic to implement AIO writes: 1) Instead of just needing a single buffer for a copied page we need one for each page that's potentially undergoing IO 2) To be able to use the "worker" AIO implementation the copied page needs to reside in shared memory It also causes problems for using unbuffered/direct-IO, independent of AIO: Some filesystems, raid implementations, ... do not tolerate the data being written out to change during the write. E.g. they may compute internal checksums that can be invalidated by concurrent modifications, leading e.g. to filesystem errors (as the case with btrfs). It also just is plain odd to allow modifications of buffers that are just share locked. To address these issue, this commit changes the rules so that modifications to pages are not allowed anymore while holding a share lock. Instead the new share-exclusive lock (introduced in FIXME XXXX TODO) allows at most one backend to modify a buffer while other backends have the same page share locked. An existing share-lock can be upgraded to a share-exclusive lock, if there are no conflicting locks. For that BufferBeginSetHintBits()/BufferBeginSetHintBits() and BufferSetHintBits16() have been introduced. To prevent hint bits from being set while the buffer is being written out, writing out buffers now requires a share-exclusive lock. The use of share-exclusive to gate setting hint bits means that from now on only one backend can set hint bits at a time. To allow multiple backends setting hint bits would require more complicated locking, for setting hint bits we'd need to store the count of backends currently setting hint bits and we would need another lock-level for I/O conflicting with the lock-level to set hint bits. Given that the share-exclusive lock for setting hint bits is only held for a short time, that often backends would just set the same hint bits and that the cost of occasionally not setting hint bits in hotly accessed pages is fairly low, this seems like an acceptable tradeoff. The biggest change to adapt to this is in heapam. To avoid performance regressions for sequential scans that need to set a lot of hint bits, we need to amortize the cost of BufferBeginSetHintBits() for cases where hint bits are set at a high frequency, HeapTupleSatisfiesMVCCBatch() uses the new SetHintBitsExt() which defers BufferFinishSetHintBits() until all hint bits on a page have been set. Conversely, to avoid regressions in cases where we can't set hint bits in bulk (because we're looking only at individual tuples), use BufferSetHintBits16() when setting hint bits without batching. Several other places also need to be adapted, but those changes are comparatively simpler. After this we do not need to copy buffers to write them out anymore. That change is done separately however. TODO: - Update commit reference above - reflow parts of storage/buffer/README that I didn't reindent to make the diff more readable Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf%40gcnactj4z56m

Now that the buffer content lock is implemented as part of BufferDesc.state, releasing the lock and unpinning the buffer can be implemented as a single atomic operation. Author: Reviewed-By: Discussion: https://postgr.es/m/ Backpatch:

After the series of preceding commits introducing and using BufferBeginSetHintBits()/BufferSetHintBits16() hint bits are not set anymore while IO is going on. Therefore we do not need to copy pages while they are being written out anymore. TODO: Update comments Author: Reviewed-by: Discussion: https://postgr.es/m/ Backpatch:

This branch was automatically generated by a robot using patches from an email thread registered at: https://commitfest.postgresql.org/patch/5483 The branch will be overwritten each time a new patch version is posted to the thread, and also periodically to check for bitrot caused by changes on the master branch. Patch(es): https://www.postgresql.org/message-id/lneuyxqxamqoayd2ntau3lqjblzdckw6tjgeu4574ezwh4tzlg@noioxkquezdw Author(s): Andres Freund

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Comparing changes

Open a pull request

Commits on Dec 13, 2025

This comparison is taking too long to generate.

Uh oh!