Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: postgresql-cfbot/postgresql
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: cf/5483~1
Choose a base ref
...
head repository: postgresql-cfbot/postgresql
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: cf/5483
Choose a head ref
  • 13 commits
  • 29 files changed
  • 2 contributors

Commits on Dec 13, 2025

  1. bufmgr: Optimize LockBufHdr() by delaying spin-delay setup

    Previously we always initialized the SpinDelayStatus. That is sufficiently
    expensive / buffer header lock acquisitions are sufficiently frequent to make
    it worthwhile to instead have a fastpath that does not initialize the
    SpinDelayStatus.
    
    While this is a small gain on its own, it mainly is aimed at preventing a
    regression after a future commit, which requires additional locking to set
    hint bits.
    
    Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
    anarazel authored and Commitfest Bot committed Dec 13, 2025
    Configuration menu
    Copy the full SHA
    7abb3da View commit details
    Browse the repository at this point in the history
  2. bufmgr: Separate keys for private refcount infrastructure

    This makes lookups faster, due to allowing auto-vectorized lookups. It is also
    beneficial for an upcoming patch, independent of auto-vectorization, as the
    upcoming patch wants to track more information for each pinned buffer, making
    the existing loop, iterating over an array of PrivateRefCountEntry, more
    expensive due to increasing its size.
    
    Reviewed-by: Melanie Plageman <[email protected]>
    Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
    anarazel authored and Commitfest Bot committed Dec 13, 2025
    Configuration menu
    Copy the full SHA
    6f0f84f View commit details
    Browse the repository at this point in the history
  3. bufmgr: Add one-entry cache for private refcount

    The private refcount entry for a buffer is often looked up repeatedly for the
    same buffer, e.g. to pin and then unpin a buffer. Benchmarking shows that it's
    worthwhile to have a one-entry cache for that case. With that cache in place,
    it's worth splitting GetPrivateRefCountEntry() into a small inline
    portion (for the cache hit case) and an out-of-line helper for the rest.
    
    This is helpful for some workloads today, but becomes more important in an
    upcoming patch that will utilize the private refcount infrastructure to also
    store whether the buffer is currently locked, as that increases the rate of
    lookups substantially.
    
    Reviewed-by: Melanie Plageman <[email protected]>
    Discussion: https://postgr.es/m/6rgb2nvhyvnszz4ul3wfzlf5rheb2kkwrglthnna7qhe24onwr@vw27225tkyar
    anarazel authored and Commitfest Bot committed Dec 13, 2025
    Configuration menu
    Copy the full SHA
    64c4557 View commit details
    Browse the repository at this point in the history
  4. freespace: Don't modify page without any lock

    Before this commit fsm_vacuum_page() modified the page without any lock on the
    page. Historically that was kind of ok, as we didn't rely on the freespace to
    really stay consistent and we did not have checksums. But these days pages are
    checksummed and there are ways for FSM pages to be included in WAL records,
    even if the FSM itself is still not WAL logged. If a FSM page ever were
    modified while a WAL record referenced that page, we'd be in trouble, as the
    WAL CRC could end up getting corrupted.
    
    The reason to address this right now is a series of patches with the goal to
    only allow modifications of pages with an appropriate lock level. Obviously
    not having any lock is not appropriate :)
    
    Discussion: https://postgr.es/m/4wggb7purufpto6x35fd2kwhasehnzfdy3zdcu47qryubs2hdz@fa5kannykekr
    Discussion: https://postgr.es/m/[email protected]
    anarazel authored and Commitfest Bot committed Dec 13, 2025
    Configuration menu
    Copy the full SHA
    2ea00d8 View commit details
    Browse the repository at this point in the history
  5. heapam: Move logic to handle HEAP_MOVED into a helper function

    Before we dealt with this in 6 near identical and one very similar copy.
    
    The helper function errors out when encountering a
    HEAP_MOVED_IN/HEAP_MOVED_OUT tuple with xvac considered current or
    in-progress. It'd be preferrable to do that change separately, but otherwise
    it'd not be possible to deduplicate the handling in
    HeapTupleSatisfiesVacuum().
    
    Author:
    Reviewed-by:
    Discussion: https://postgr.es/m/
    Backpatch:
    anarazel authored and Commitfest Bot committed Dec 13, 2025
    Configuration menu
    Copy the full SHA
    59fe2bc View commit details
    Browse the repository at this point in the history
  6. heapam: Use exclusive lock on old page in CLUSTER

    To be able to guarantee that we can set the hint bit, acquire an exclusive
    lock on the old buffer. We need the hint bits to be set as otherwise
    reform_and_rewrite_tuple() -> rewrite_heap_tuple() -> heap_freeze_tuple() will
    get confused.
    
    It'd be better if we somehow could avoid setting hint bits on the old page. A
    commonreason to use VACUUM FULL are very bloated tables - rewriting most of
    the old table before during VACUUM FULL doesn't exactly help.
    
    Author:
    Reviewed-by:
    Discussion: https://postgr.es/m/
    Backpatch:
    anarazel authored and Commitfest Bot committed Dec 13, 2025
    Configuration menu
    Copy the full SHA
    dc8b0a0 View commit details
    Browse the repository at this point in the history
  7. heapam: Add batch mode mvcc check and use it in page mode

    There are two reasons for doing so:
    
    1) It is generally faster to perform checks in a batched fashion and making
       sequential scans faster is nice.
    
    2) We would like to stop setting hint bits while pages are being written
       out. The necessary locking becomes visible for page mode scans if done for
       every tuple. With batching the overhead can be amortized to only happen
       once per page.
    
    There are substantial further optimization opportunities along these
    lines:
    
    - Right now HeapTupleSatisfiesMVCCBatch() simply uses the single-tuple
      HeapTupleSatisfiesMVCC(), relying on the compiler to inline it. We could
      instead write an explicitly optimized version that avoids repeated xid
      tests.
    
    - Introduce batched version of the serializability test
    
    - Introduce batched version of HeapTupleSatisfiesVacuum
    
    Author:
    Reviewed-by:
    Discussion: https://postgr.es/m/
    Backpatch:
    anarazel authored and Commitfest Bot committed Dec 13, 2025
    Configuration menu
    Copy the full SHA
    bb28cf9 View commit details
    Browse the repository at this point in the history
  8. bufmgr: Change BufferDesc.state to be a 64bit atomic

    This is motivated by wanting to merge buffer content locks into
    BufferDesc.state in a future commit, rather than having a separate lwlock (see
    commit c75ebc6 more details). As this change is rather mechanical, it
    seems to make sense to split it out into a separate commit, for easier review.
    
    Reviewed-by: Melanie Plageman <[email protected]>
    Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
    anarazel authored and Commitfest Bot committed Dec 13, 2025
    Configuration menu
    Copy the full SHA
    8aff002 View commit details
    Browse the repository at this point in the history
  9. bufmgr: Implement buffer content locks independently of lwlocks

    Until now buffer content locks were implemented using lwlocks. That has the
    obvious advantage of not needing a separate efficient implementation of
    locks. However, the time for a dedicated buffer content lock implementation
    has come:
    
    1) Hint bits are currently set while holding only a share lock. This leads to
       having to copy pages while they are being written out if checksums are
       enabled, which is not cheap. We would like to add AIO writes, however once
       many buffers can be written out at the same time, it gets a lot more
       expensive to copy them, particularly because that copy needs to reside in
       shared buffers (for worker mode to have access to the buffer).
    
       In addition, modifying buffers while they are being written out can cause
       issues with unbuffered/direct-IO, as some filesystems (like btrfs) do not
       like that, due to filesystem internal checksums getting corrupted.
    
       The solution to this is to require a new share-exclusive lock-level to set
       hint bits and to write out buffers, making those operations mutually
       exclusive. We could introduce such a lock level into the generic lwlock
       implementation, however it does not look like there would be other users,
       and it does add some overhead into important codepaths.
    
    2) For AIO writes we need to be able to race-freely check whether a buffer is
       undergoing IO and whether an exclusive lock on the page can be acquired. That
       is rather hard to do efficiently when the buffer state and the lock state
       are separate atomic variables. This is a major hindrance to allowing writes
       to be done asynchronously.
    
    3) Buffer locks are by far the most frequently taken locks. Optimizing them
       specifically for their use case is worth the effort. E.g. by merging
       content locks into buffer locks we will be able to release a buffer lock
       and pin in one atomic operation.
    
    4) There are more complicated optimizations, like long-lived "super pinned &
       locked" pages, that cannot realistically be implemented with the generic
       lwlock implementation.
    
    Therefore implement content locks inside bufmgr.c. The lockstate is stored as
    part of BufferDesc.state. The implementation of buffer content locks is fairly
    similar to lwlocks, with a few important differences:
    
    1) An additional lock-level share-exclusive has been added. This lock level
       conflicts with exclusive locks and itself, but not share locks.
    
    2) Error recovery for content locks is implemented as part of the already
       existing private-refcount tracking mechanism in combination with resowners,
       instead of a bespoke mechanism as the case for lwlocks. This means we do
       not need to add dedicated error-recovery codepaths to release all content
       locks (like done with LWLockReleaseAll() for content locks).
    
    3) The lock state is embedded in BufferDesc.state instead of having its own
       struct.
    
    4) The wakeup logic is a tad more complicated due to needing to support the
       additional lock level
    
    This commit unfortunately introduces some code that is very similar to the
    code in lwlock.c, however the code is not equivalent enough to easily merge
    it. The future wins that this commit makes possible seem worth the cost.
    
    As of this commit nothing uses the new share-exclusive lock mode. It will be
    used in a future commit. It seemed too complicated to introduce the lock-level
    in a separate commit.
    
    TODO:
    - Address FIXMEs
    
    - Perhaps move the locking code into a buffer_locking.h or such? Needs to be
      inline functions for efficiency unfortunately.
    
    - reflow some comments that I didn't reflow to make the diff more readable
    
    Reviewed-by: Melanie Plageman <[email protected]>
    Reviewed-by: Greg Burd <[email protected]>
    Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
    anarazel authored and Commitfest Bot committed Dec 13, 2025
    Configuration menu
    Copy the full SHA
    b012bd3 View commit details
    Browse the repository at this point in the history
  10. Require share-exclusive lock to set hint bits and to flush

    At the moment hint bits can be set with just a share lock on a page (and in
    one place even without any lock). Because of this we need to copy pages while
    writing them out, as otherwise the checksum could be corrupted.
    
    The need to copy the page is problematic to implement AIO writes:
    
    1) Instead of just needing a single buffer for a copied page we need one for
       each page that's potentially undergoing IO
    2) To be able to use the "worker" AIO implementation the copied page needs to
       reside in shared memory
    
    It also causes problems for using unbuffered/direct-IO, independent of AIO:
    Some filesystems, raid implementations, ... do not tolerate the data being
    written out to change during the write. E.g. they may compute internal
    checksums that can be invalidated by concurrent modifications, leading e.g. to
    filesystem errors (as the case with btrfs).
    
    It also just is plain odd to allow modifications of buffers that are just
    share locked.
    
    To address these issue, this commit changes the rules so that modifications to
    pages are not allowed anymore while holding a share lock. Instead the new
    share-exclusive lock (introduced in FIXME XXXX TODO) allows at most one
    backend to modify a buffer while other backends have the same page share
    locked. An existing share-lock can be upgraded to a share-exclusive lock, if
    there are no conflicting locks. For that
    BufferBeginSetHintBits()/BufferBeginSetHintBits() and BufferSetHintBits16()
    have been introduced.
    
    To prevent hint bits from being set while the buffer is being written out,
    writing out buffers now requires a share-exclusive lock.
    
    The use of share-exclusive to gate setting hint bits means that from now on
    only one backend can set hint bits at a time. To allow multiple backends
    setting hint bits would require more complicated locking, for setting hint
    bits we'd need to store the count of backends currently setting hint bits and
    we would need another lock-level for I/O conflicting with the lock-level to
    set hint bits. Given that the share-exclusive lock for setting hint bits is
    only held for a short time, that often backends would just set the same hint
    bits and that the cost of occasionally not setting hint bits in hotly accessed
    pages is fairly low, this seems like an acceptable tradeoff.
    
    The biggest change to adapt to this is in heapam. To avoid performance
    regressions for sequential scans that need to set a lot of hint bits, we need
    to amortize the cost of BufferBeginSetHintBits() for cases where hint bits are
    set at a high frequency, HeapTupleSatisfiesMVCCBatch() uses the new
    SetHintBitsExt() which defers BufferFinishSetHintBits() until all hint bits on
    a page have been set.  Conversely, to avoid regressions in cases where we
    can't set hint bits in bulk (because we're looking only at individual tuples),
    use BufferSetHintBits16() when setting hint bits without batching.
    
    Several other places also need to be adapted, but those changes are
    comparatively simpler.
    
    After this we do not need to copy buffers to write them out anymore. That
    change is done separately however.
    
    TODO:
    - Update commit reference above
    - reflow parts of storage/buffer/README that I didn't reindent to make the
      diff more readable
    
    Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
    Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf%40gcnactj4z56m
    anarazel authored and Commitfest Bot committed Dec 13, 2025
    Configuration menu
    Copy the full SHA
    f600135 View commit details
    Browse the repository at this point in the history
  11. WIP: Make UnlockReleaseBuffer() more efficient

    Now that the buffer content lock is implemented as part of BufferDesc.state,
    releasing the lock and unpinning the buffer can be implemented as a single
    atomic operation.
    
    Author:
    Reviewed-By:
    Discussion: https://postgr.es/m/
    Backpatch:
    anarazel authored and Commitfest Bot committed Dec 13, 2025
    Configuration menu
    Copy the full SHA
    d0a6a48 View commit details
    Browse the repository at this point in the history
  12. WIP: bufmgr: Don't copy pages while writing out

    After the series of preceding commits introducing and using
    BufferBeginSetHintBits()/BufferSetHintBits16() hint bits are not set
    anymore while IO is going on. Therefore we do not need to copy pages while
    they are being written out anymore.
    
    TODO: Update comments
    
    Author:
    Reviewed-by:
    Discussion: https://postgr.es/m/
    Backpatch:
    anarazel authored and Commitfest Bot committed Dec 13, 2025
    Configuration menu
    Copy the full SHA
    3668689 View commit details
    Browse the repository at this point in the history
  13. [CF 5483] v7 - Don't dirty pages while they are getting flushed out

    This branch was automatically generated by a robot using patches from an
    email thread registered at:
    
    https://commitfest.postgresql.org/patch/5483
    
    The branch will be overwritten each time a new patch version is posted to
    the thread, and also periodically to check for bitrot caused by changes
    on the master branch.
    
    Patch(es): https://www.postgresql.org/message-id/lneuyxqxamqoayd2ntau3lqjblzdckw6tjgeu4574ezwh4tzlg@noioxkquezdw
    Author(s): Andres Freund
    Commitfest Bot committed Dec 13, 2025
    Configuration menu
    Copy the full SHA
    018ec7a View commit details
    Browse the repository at this point in the history
Loading