Skip to content

[DocDB] Post split compaction may end up too early not compacting some parent files #27426

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 task done
arybochkin opened this issue May 29, 2025 · 0 comments
Open
1 task done

Comments

@arybochkin
Copy link
Contributor

arybochkin commented May 29, 2025

Jira Link: DB-16966

Description

During TTL file expiration feature testing, it was found that post split compaction stoped too early ignoring several parent files.
The parent tablet had the following files before applying split operation:

Filename		Size	Size (MB)
000798.sst.sblock.0	1.1G	1126.40
000874.sst.sblock.0	1.1G	1126.40
000950.sst.sblock.0	1.1G	1126.40
001026.sst.sblock.0	1.1G	1126.40
001102.sst.sblock.0	1.1G	1126.40
001178.sst.sblock.0	1.1G	1126.40
001254.sst.sblock.0	1.1G	1126.40
001330.sst.sblock.0	1.1G	1126.40
001406.sst.sblock.0	1.1G	1126.40
001482.sst.sblock.0	1.1G	1126.40
001558.sst.sblock.0	1.1G	1126.40
001634.sst.sblock.0	1.1G	1126.40
001710.sst.sblock.0	1.1G	1126.40
001716.sst.sblock.0	85M	85.00
001721.sst.sblock.0	68M	68.00
001722.sst.sblock.0	17M	17.00
001723.sst.sblock.0	17M	17.00
001724.sst.sblock.0	17M	17.00
001725.sst.sblock.0	17M	17.00
001725.sst.sblock.0	11M	11.00

The child tablet had the following compactions:

  1. Background compaction by size ratio took files [1727, 1725, 1724, 1723, 1722, 1721, 1716]
  2. Post split compaction run in 5 iterations taking 1 file per iteration [798, 874, 950, 1026, 1102]
  3. All files files got expired by TTL file expiration

Configuration:

table default time to live = 6 hours
flag rocksdb_max_file_size_for_compaction = 1000000000
flag tablet_enable_ttl_file_filter = true
flag sst_files_soft_limit = 60
flag sst_files_hard_limit = 90

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@arybochkin arybochkin self-assigned this May 29, 2025
@arybochkin arybochkin added area/docdb YugabyteDB core features priority/high High Priority status/awaiting-triage Issue awaiting triage labels May 29, 2025
@yugabyte-ci yugabyte-ci added the kind/bug This issue is a bug label May 29, 2025
arybochkin added a commit that referenced this issue Jun 5, 2025
…ting some parent files

Summary:
Post split compaction may stop too early, skipping some SST files that should be compacted, if
a background compaction is triggered while the post-split compaction is still in progress. This
issue can occur only when the post-split compaction operates in multiple iterations, and
a background compaction starts between those iterations. As a reminder: multiple iterations are
used to avoid locking too many files at once by limiting the input size per compaction iteration.
The total size of input files for one iteration should not exceed
`post_split_compaction_input_size_threshold_bytes` (default: 256 MB).

The problem comes from an incorrect assumption in the post-split compaction logic:
if the compaction picker encounters an SST file marked as `being compacted`, it stops and returns
the current set of selected files. If the returned set is empty, the post-split compaction is
considered complete. This logic breaks if a background compaction is triggered just before
the post-split compaction picker attempts to fetch the next subset of SST files and if
the background compaction picks the subset of most earliest files: the very first file becomes
marked as `being compacted`, which lead to the picker stops immediately and returns an empty set,
causing the post-split compaction to terminate prematurely.

Despite the fact the issue is a generic and may be observed in any configuration where tablet
splitting happens, the easiest way to capture the issue is to configure default table TTL and
turn on TTL file expiration feature in order to limit background compactions by SST file size.

Example:
```
1. Default table TTL is confugred, TTL file expiration is configured, rocksdb max file size
   for compaction is set to 3 GB, post split compaction input threshold is 256 MB.

2. SST files before post split compaction:
   SST1 [seq1], SST2 [seq2], SST3 [seq3], SST4 [seq4], SST5 [seq5], SST6 [seq6], SST7 [seq7]
   SST1 ... SST7 are 3.1 GB each

3. The state after 5 out of 6 iterations (where only 1 SST is taken by during each iteration):
   SST8 [seq1], SST9 [seq2], SST10 [seq3], SST11 [seq4], SST12 [seq5], SST6 [seq6], SST7 [seq7]
   SST8 ... SST12 are 1.5 GB each, SST6 and SST7 are 3.1 GB

4. Background compaction kicks in immediately after SST11 has been flushed.

5. Background compaction takes SST8...SST12 by size amp criteria and markis them `being compacted`.

6. Post split compaction runs the next iteration but it sees the first file by sequence, SST8, is
   locked and immediately stops compacing due to no files got picked on this iteration.

7. Background compaction processes SST8...SST12 producing SST13 [seq1] of size 7.5 GB (5 x 1.5 GB).

8. The final set of SST files after both post split and background compactions are completed:
   SST13 [seq1, 7.5 GB], SST6 [seq6, 3.1 GB], SST7 [seq7, 3.1 GB]
```

The fix: if post split compaction picker meets the locked file (under compaction), do compact files
which are currently collected, or move to the next file (if nothing got picked yet).

This fix is required to be backported into all the active releases, but we don't necessary need
to backport the unit test for the fix. Hence the unit test is being landed by a separated diff
https://phorge.dev.yugabyte.com/D44537
Jira: DB-16966

Test Plan: Jenkins

Reviewers: timur, rthallam

Reviewed By: timur, rthallam

Subscribers: ybase

Differential Revision: https://phorge.dev.yugabyte.com/D44394
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants