Trigger merges after recovery #113102

DaveCTurner · 2024-09-18T11:30:57Z

We may have shut a shard down while merges were still pending (or
adjusted the merge policy while the shard was down) meaning that after
recovery its segments do not reflect the desired state according to the
merge policy. With this commit we invoke IndexWriter#maybeMerge() at
the end of recovery to check for, and execute, any such lost merges.

We may have shut a shard down while merges were still pending (or adjusted the merge policy while the shard was down) meaning that after recovery its segments do not reflect the desired state according to the merge policy. With this commit we invoke `IndexWriter#maybeMerge()` at the end of recovery to check for, and execute, any such lost merges.

elasticsearchmachine · 2024-09-18T11:31:20Z

Pinging @elastic/es-distributed (Team:Distributed)

elasticsearchmachine · 2024-09-18T11:31:21Z

Hi @DaveCTurner, I've created a changelog YAML for you.

tlrx

LGTM, left two comments.

tlrx · 2024-09-18T13:13:27Z

server/src/main/java/org/elasticsearch/indices/PostRecoveryMerger.java

+            return recoveryListener;
+        }
+
+        logger.trace(Strings.format("wrapping listener for post-recovery merge of [%s]", shardId));


Suggested change

logger.trace(Strings.format("wrapping listener for post-recovery merge of [%s]", shardId));

logger.trace(() -> Strings.format("wrapping listener for post-recovery merge of [%s]", shardId));

(same remark for other log traces)

server/src/main/java/org/elasticsearch/indices/IndicesService.java

henningandersen

Looks good, one question on how to ensure a flush happens.

henningandersen · 2024-09-18T16:16:42Z

server/src/internalClusterTest/java/org/elasticsearch/indices/recovery/IndexRecoveryIT.java

+
+        ensureGreen(indexName);
+        assertBusy(() -> {
+            refresh(indexName); // pick up the result of any merges


I wonder if we need to let afterMerge set active to true to ensure that we get a flush after a merge always? I think otherwise we risk a flush happening too early, before the merge completes (especially if the processing is lagging due to running on one thread) and then no more flush after the merge.

And if we do that, we should probably change this to call flushOnIdle instead?

ensure that we get a flush after a merge always?

That sounds sensible to me but I'm not sure if it could have any bad consequences. Why don't we do it already?

Note that we already flush just before snapshotting, and before relocating the shard in stateless ES. Not saying that we shouldn't try and get this to flush in flushOnIdle too, but maybe we can think about that in a follow-up?

Thanks, that helps. I also now realize that the check here:

elasticsearch/server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

Line 2849 in a59c182

&& System.nanoTime() - lastWriteNanos >= engineConfig.getFlushMergesAfter().nanos()) {

is more than likely to kick in if the active->inactive flush already occurred, making this more of a benign race condition problem than something that would repeatedly happen.

I wonder if it was better to set indices.memory.shard_inactive_time=0 in this test and then avoid the refresh here, demonstrating that it will indeed flush after the merge?

Sure yes that makes sense.

ywangd

I have a question about the throttling behaviour. I'd appreciate if you could help me understand it better. Thanks!

ywangd · 2024-09-19T05:59:45Z

server/src/main/java/org/elasticsearch/indices/PostRecoveryMerger.java

+                    return;
+                }
+
+                indexShard.triggerPendingMerges();


This ultimately calls IndexWriter#maybeMerge which kicks off actual merges with "Lucene Merge Thread". It does not wait for the merges to complete. So my understanding is that we can kick off merges for multiple shards even with the throttled task runner of a single thread? Did I miss something, or maybe this is intended?

That's intended, see comment on ES-9313. I copied this info into a code comment in e70887c.

ywangd · 2024-09-19T06:06:38Z

server/src/main/java/org/elasticsearch/index/shard/IndexShard.java

@@ -1514,6 +1514,13 @@ public void forceMerge(ForceMergeRequest forceMerge) throws IOException {
        engine.forceMerge(forceMerge.flush(), forceMerge.maxNumSegments(), forceMerge.onlyExpungeDeletes(), forceMerge.forceMergeUUID());
    }

+    public void triggerPendingMerges() throws IOException {
+        switch (state /* single volatile read */) {
+            case STARTED, POST_RECOVERY -> getEngine().forceMerge(false, ForceMergeRequest.Defaults.MAX_NUM_SEGMENTS, false, null);


Nit: Can we add a comment about passing flush=false and null for the uuid? My understanding is that they do not make sense when the downstream code calls IndexWriter#maybeMerge. But it is not obvious from here.

Also for my knowledge: Since we don't flush actively, I guess it relies on flushes triggered by other mechanisms, such as scheduled refresh and indexing disk and memory controllers and maybe other things?

More comments in c07131b.

Henning and I are still discussing how to ensure we flush the result of the merge.

henningandersen

LGTM.

...erTest/java/org/elasticsearch/cluster/routing/allocation/decider/DiskThresholdDeciderIT.java

henningandersen · 2024-09-20T13:59:21Z

test/framework/src/main/java/org/elasticsearch/test/ESIntegTestCase.java

@@ -1600,7 +1600,7 @@ protected final BroadcastResponse flush(String... indices) {
     */
    protected BroadcastResponse forceMerge() {
        waitForRelocation();
-        BroadcastResponse actionGet = indicesAdmin().prepareForceMerge().setMaxNumSegments(1).get();
+        BroadcastResponse actionGet = indicesAdmin().prepareForceMerge().setMaxNumSegments(1).setFlush(true).get();


I think that is the default anyway, not sure why this change is necessary?

Ah sorry that was leftover from debugging the DiskThresholdDeciderIT failure, not necessary indeed. I'll remove it.

DaveCTurner added >enhancement :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. v9.0.0 labels Sep 18, 2024

DaveCTurner requested review from tlrx, ywangd and henningandersen September 18, 2024 11:30

elasticsearchmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Sep 18, 2024

Update docs/changelog/113102.yaml

505d652

tlrx approved these changes Sep 18, 2024

View reviewed changes

DaveCTurner added 6 commits September 18, 2024 14:37

Merge branch 'main' into 2024/09/18/trigger-merge-after-recovery

c503f39

Drop extra call to setConcreteIndex

36c9429

Skip on unpromotable shards

950d330

Unnecessary trace logging

709def3

Disable on warm/cold nodes

9f9e7f5

Merge branch 'main' into 2024/09/18/trigger-merge-after-recovery

22d2728

henningandersen reviewed Sep 18, 2024

View reviewed changes

ywangd reviewed Sep 19, 2024

View reviewed changes

DaveCTurner added 6 commits September 19, 2024 07:49

Merge branch 'main' into 2024/09/18/trigger-merge-after-recovery

edd0dc2

Remove duplicate comment

8a26eab

explain params

c07131b

More detailed comment about throttling behaviour

e70887c

Avoid deleted docs in DiskThresholdDeciderIT

f286d97

Flush immediately on merge completion in test

98f8fd5

DaveCTurner requested a review from henningandersen September 19, 2024 09:51

ywangd approved these changes Sep 19, 2024

View reviewed changes

henningandersen approved these changes Sep 20, 2024

View reviewed changes

Merge branch 'main' into 2024/09/18/trigger-merge-after-recovery

4fe73d0

Revert

7176ab1

DaveCTurner added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Sep 20, 2024

DaveCTurner merged commit 33a73a8 into elastic:main Sep 20, 2024
15 checks passed

DaveCTurner deleted the 2024/09/18/trigger-merge-after-recovery branch September 20, 2024 16:16

	logger.trace(Strings.format("wrapping listener for post-recovery merge of [%s]", shardId));
	logger.trace(() -> Strings.format("wrapping listener for post-recovery merge of [%s]", shardId));

Trigger merges after recovery #113102

Trigger merges after recovery #113102

Uh oh!

Conversation

DaveCTurner commented Sep 18, 2024

Uh oh!

elasticsearchmachine commented Sep 18, 2024

Uh oh!

elasticsearchmachine commented Sep 18, 2024

Uh oh!

tlrx left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

henningandersen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ywangd left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Sep 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

henningandersen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

DaveCTurner Sep 19, 2024 •

edited

Loading