Skip to content

Use the system index descriptor in the snapshot blob cache cleanup task #120937

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

pxsalehi
Copy link
Member

@pxsalehi pxsalehi commented Jan 27, 2025

Clean up of the .snapshot-blob-cache* system index is done only on the node that hosts the primary of the shard 0 of that index. When the index is migrated as part of an upgrade test e.g. v7 -> v8, the index is reindexed to a new index .snapshot-blob-cache-reindexed-for-9. The code scheduling this clean up task is not able to locate the shard and would never trigger a clean up after the upgrade. This change uses the system index descriptor to find the matching shard and would work for future versions too.

Closes #120518

@pxsalehi pxsalehi added >bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v9.0.0 v8.18.0 labels Jan 27, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @pxsalehi, I've created a changelog YAML for you.

@pxsalehi pxsalehi force-pushed the ps250127-snapshot-blob-cache-cleanup branch 2 times, most recently from 4b3409b to dd132d8 Compare January 28, 2025 11:23
* Mimics migration of the {@link SearchableSnapshots#SNAPSHOT_BLOB_CACHE_INDEX} as done in
* {@link org.elasticsearch.upgrades.SystemIndexMigrator}, where the index is re-indexed, and replaced by an alias.
*/
private void migrateTheSystemIndex() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did have a quick look at SystemIndexMigrator hoping I could trigger this migration step by creating/running such a migrator. But that doesn't seem straight-forward and it would still be mimicking a proper upgrade test which is what is missing here. Considering the lack of automated upgrade tests that include the migration steps is a bigger issue and not addressed currently, I kept the test here short just trying to reindex/alias/remove, which is what the migrator is doing.

I have also manually verified this by creating/mounting a searchable snapshot index in 7.x, and upgrading to 8.x (with migration) and then 9.x and removing the searchable snapshot index in 9.x which cleans up the migrated snapshot blob cache index.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we have no actual upgrade tests that run system index migrations? That indeed sounds problematic and something we should ensure happens (but fine outside this PR, for another team too I think).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that we don't have that. which is why we resorted to manual testing in the first place. I can follow up and ask around, and create a follow up issue or PR if it turns out to be possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In #121517 Christoph reused the upgrade tests we made for N-2 support in order to test the upgrade of the async-search system index.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I'll look into that. back then I asked core/infra about v7->v8->v9 and they mentioned there is currently no option. But this could be partially useful.

@elasticsearchmachine
Copy link
Collaborator

Hi @pxsalehi, I've created a changelog YAML for you.

@pxsalehi pxsalehi force-pushed the ps250127-snapshot-blob-cache-cleanup branch from a9d07aa to fb359a3 Compare January 28, 2025 13:55
@pxsalehi pxsalehi marked this pull request as ready for review January 28, 2025 13:57
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Jan 28, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@elasticsearchmachine
Copy link
Collaborator

Hi @pxsalehi, I've created a changelog YAML for you.

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

* Mimics migration of the {@link SearchableSnapshots#SNAPSHOT_BLOB_CACHE_INDEX} as done in
* {@link org.elasticsearch.upgrades.SystemIndexMigrator}, where the index is re-indexed, and replaced by an alias.
*/
private void migrateTheSystemIndex() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we have no actual upgrade tests that run system index migrations? That indeed sounds problematic and something we should ensure happens (but fine outside this PR, for another team too I think).

if (indexRoutingTable != null) {
return indexRoutingTable.shard(0).primaryShard();
private boolean systemIndexPrimaryShardActiveAndAssignedToLocalNode(final ClusterState state) {
for (IndexMetadata indexMetadata : state.metadata()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use getMatchingIndices instead and then assert that it has <= 1 results? I think that simplifies the code here too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@henningandersen I wasn't sure that's safe. During migration there will be a brief period after reindexing and before alias/remove that there will be two indices matching the pattern. So that assertion might not hold.

@pxsalehi pxsalehi added auto-backport Automatically create backport pull requests when merged auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) labels Jan 28, 2025
@elasticsearchmachine elasticsearchmachine merged commit 0666462 into elastic:main Jan 28, 2025
16 checks passed
@pxsalehi pxsalehi deleted the ps250127-snapshot-blob-cache-cleanup branch January 28, 2025 15:16
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.x

pxsalehi added a commit to pxsalehi/elasticsearch that referenced this pull request Jan 28, 2025
…sk (elastic#120937)

Clean up of the `.snapshot-blob-cache*` system index is done only on the
node that hosts the primary of the shard 0 of that index. When the index
is migrated as part of an upgrade test e.g. v7 -> v8, the index is
reindexed to a new index `.snapshot-blob-cache-reindexed-for-9`. The
code scheduling this clean up task is not able to locate the shard and
would never trigger a clean up after the upgrade. This change uses the
system index descriptor to find the matching shard and would work for
future versions too.

Closes elastic#120518
elasticsearchmachine pushed a commit that referenced this pull request Jan 28, 2025
…sk (#120937) (#121053)

Clean up of the `.snapshot-blob-cache*` system index is done only on the
node that hosts the primary of the shard 0 of that index. When the index
is migrated as part of an upgrade test e.g. v7 -> v8, the index is
reindexed to a new index `.snapshot-blob-cache-reindexed-for-9`. The
code scheduling this clean up task is not able to locate the shard and
would never trigger a clean up after the upgrade. This change uses the
system index descriptor to find the matching shard and would work for
future versions too.

Closes #120518
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) >bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed Coordination Meta label for Distributed Coordination team v8.18.0 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Clean up of the .snapshot-blob-cache* index not working on an upgraded 8.x cluster
4 participants