Use the system index descriptor in the snapshot blob cache cleanup task #120937

pxsalehi · 2025-01-27T16:56:00Z

Clean up of the .snapshot-blob-cache* system index is done only on the node that hosts the primary of the shard 0 of that index. When the index is migrated as part of an upgrade test e.g. v7 -> v8, the index is reindexed to a new index .snapshot-blob-cache-reindexed-for-9. The code scheduling this clean up task is not able to locate the shard and would never trigger a clean up after the upgrade. This change uses the system index descriptor to find the matching shard and would work for future versions too.

Closes #120518

elasticsearchmachine · 2025-01-27T16:56:24Z

Hi @pxsalehi, I've created a changelog YAML for you.

pxsalehi · 2025-01-28T13:41:35Z

...k/searchablesnapshots/cache/blob/SearchableSnapshotsBlobStoreCacheMaintenanceIntegTests.java

+     *  Mimics migration of the {@link SearchableSnapshots#SNAPSHOT_BLOB_CACHE_INDEX} as done in
+     *  {@link org.elasticsearch.upgrades.SystemIndexMigrator}, where the index is re-indexed, and replaced by an alias.
+     */
+    private void migrateTheSystemIndex() {


I did have a quick look at SystemIndexMigrator hoping I could trigger this migration step by creating/running such a migrator. But that doesn't seem straight-forward and it would still be mimicking a proper upgrade test which is what is missing here. Considering the lack of automated upgrade tests that include the migration steps is a bigger issue and not addressed currently, I kept the test here short just trying to reindex/alias/remove, which is what the migrator is doing.

I have also manually verified this by creating/mounting a searchable snapshot index in 7.x, and upgrading to 8.x (with migration) and then 9.x and removing the searchable snapshot index in 9.x which cleans up the migrated snapshot blob cache index.

So we have no actual upgrade tests that run system index migrations? That indeed sounds problematic and something we should ensure happens (but fine outside this PR, for another team too I think).

My understanding is that we don't have that. which is why we resorted to manual testing in the first place. I can follow up and ask around, and create a follow up issue or PR if it turns out to be possible.

In #121517 Christoph reused the upgrade tests we made for N-2 support in order to test the upgrade of the async-search system index.

Thanks. I'll look into that. back then I asked core/infra about v7->v8->v9 and they mentioned there is currently no option. But this could be partially useful.

elasticsearchmachine · 2025-01-28T13:48:55Z

Hi @pxsalehi, I've created a changelog YAML for you.

elasticsearchmachine · 2025-01-28T13:57:45Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

elasticsearchmachine · 2025-01-28T13:57:45Z

Hi @pxsalehi, I've created a changelog YAML for you.

henningandersen

LGTM.

henningandersen · 2025-01-28T14:07:57Z

...k/searchablesnapshots/cache/blob/SearchableSnapshotsBlobStoreCacheMaintenanceIntegTests.java

+     *  Mimics migration of the {@link SearchableSnapshots#SNAPSHOT_BLOB_CACHE_INDEX} as done in
+     *  {@link org.elasticsearch.upgrades.SystemIndexMigrator}, where the index is re-indexed, and replaced by an alias.
+     */
+    private void migrateTheSystemIndex() {


So we have no actual upgrade tests that run system index migrations? That indeed sounds problematic and something we should ensure happens (but fine outside this PR, for another team too I think).

henningandersen · 2025-01-28T14:10:49Z

...org/elasticsearch/xpack/searchablesnapshots/cache/blob/BlobStoreCacheMaintenanceService.java

-            if (indexRoutingTable != null) {
-                return indexRoutingTable.shard(0).primaryShard();
+    private boolean systemIndexPrimaryShardActiveAndAssignedToLocalNode(final ClusterState state) {
+        for (IndexMetadata indexMetadata : state.metadata()) {


Can we use getMatchingIndices instead and then assert that it has <= 1 results? I think that simplifies the code here too.

@henningandersen I wasn't sure that's safe. During migration there will be a brief period after reindexing and before alias/remove that there will be two indices matching the pattern. So that assertion might not hold.

elasticsearchmachine · 2025-01-28T15:17:14Z

💚 Backport successful

Status	Branch	Result
✅	8.x

…sk (elastic#120937) Clean up of the `.snapshot-blob-cache*` system index is done only on the node that hosts the primary of the shard 0 of that index. When the index is migrated as part of an upgrade test e.g. v7 -> v8, the index is reindexed to a new index `.snapshot-blob-cache-reindexed-for-9`. The code scheduling this clean up task is not able to locate the shard and would never trigger a clean up after the upgrade. This change uses the system index descriptor to find the matching shard and would work for future versions too. Closes elastic#120518

…sk (#120937) (#121053) Clean up of the `.snapshot-blob-cache*` system index is done only on the node that hosts the primary of the shard 0 of that index. When the index is migrated as part of an upgrade test e.g. v7 -> v8, the index is reindexed to a new index `.snapshot-blob-cache-reindexed-for-9`. The code scheduling this clean up task is not able to locate the shard and would never trigger a clean up after the upgrade. This change uses the system index descriptor to find the matching shard and would work for future versions too. Closes #120518

pxsalehi added >bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v9.0.0 v8.18.0 labels Jan 27, 2025

pxsalehi force-pushed the ps250127-snapshot-blob-cache-cleanup branch 2 times, most recently from 4b3409b to dd132d8 Compare January 28, 2025 11:23

pxsalehi commented Jan 28, 2025

View reviewed changes

Use system index descriptor to fetch the snapshot blob cache shard

fb359a3

pxsalehi force-pushed the ps250127-snapshot-blob-cache-cleanup branch from a9d07aa to fb359a3 Compare January 28, 2025 13:55

pxsalehi requested review from tlrx and henningandersen January 28, 2025 13:57

pxsalehi marked this pull request as ready for review January 28, 2025 13:57

elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Jan 28, 2025

Update docs/changelog/120937.yaml

1abc563

[CI] Auto commit changes from spotless

745d10f

henningandersen approved these changes Jan 28, 2025

View reviewed changes

pxsalehi added auto-backport Automatically create backport pull requests when merged auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) labels Jan 28, 2025

elasticsearchmachine merged commit 0666462 into elastic:main Jan 28, 2025
16 checks passed

pxsalehi deleted the ps250127-snapshot-blob-cache-cleanup branch January 28, 2025 15:16

pxsalehi mentioned this pull request Jan 28, 2025

[8.x] Use the system index descriptor in the snapshot blob cache cleanup task (#120937) #121053

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use the system index descriptor in the snapshot blob cache cleanup task #120937

Use the system index descriptor in the snapshot blob cache cleanup task #120937

Uh oh!

pxsalehi commented Jan 27, 2025 •

edited

Loading

Uh oh!

elasticsearchmachine commented Jan 27, 2025

Uh oh!

pxsalehi Jan 28, 2025

Uh oh!

henningandersen Jan 28, 2025

Uh oh!

pxsalehi Jan 28, 2025

Uh oh!

tlrx Feb 3, 2025

Uh oh!

pxsalehi Feb 5, 2025

Uh oh!

elasticsearchmachine commented Jan 28, 2025

Uh oh!

elasticsearchmachine commented Jan 28, 2025

Uh oh!

elasticsearchmachine commented Jan 28, 2025

Uh oh!

henningandersen left a comment

Uh oh!

henningandersen Jan 28, 2025

Uh oh!

henningandersen Jan 28, 2025

Uh oh!

pxsalehi Jan 28, 2025

Uh oh!

Uh oh!

elasticsearchmachine commented Jan 28, 2025

Uh oh!

Uh oh!

Use the system index descriptor in the snapshot blob cache cleanup task #120937

Use the system index descriptor in the snapshot blob cache cleanup task #120937

Uh oh!

Conversation

pxsalehi commented Jan 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Jan 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Jan 28, 2025

Uh oh!

elasticsearchmachine commented Jan 28, 2025

Uh oh!

elasticsearchmachine commented Jan 28, 2025

Uh oh!

henningandersen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

elasticsearchmachine commented Jan 28, 2025

💚 Backport successful

Uh oh!

Uh oh!

pxsalehi commented Jan 27, 2025 •

edited

Loading