Skip to content

Retry throttled snapshot deletions #113237

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

nicktindall
Copy link
Contributor

@nicktindall nicktindall commented Sep 20, 2024

The change is

  • For bulk deletes with OperationPurpose SNAPSHOT_DATA or SNAPSHOT_METADATA we will retry when throttled with a progressive back-off up to (by default) 8 times over about 5 seconds.
    • This is in addition to the (default 3) retries performed by the AWS client for each request
    • The initial delay and number of retries is configurable via repository settings
  • If the thread is interrupted while delayed between retries, we preserve the interrupt and fail the current batch with the throttling exception. If there are subsequent batches waiting, the first of those will trigger an AWS AbortedException which will propagate wrapped in an IOException
  • I've added an APM histogram of how many retries were necessary for each batch, I'll create a ticket to add this to the S3 dashboard if we go ahead with it
    • The delete publishes the number of retries if it either completes successfully or runs out of retries. Other types of failure don't publish retry metrics

Closes ES-8562

aex.set(ExceptionsHelper.useOrSuppress(aex.get(), e));
return;
} catch (AmazonClientException e) {
if (shouldRetryDelete(purpose) && RetryUtils.isThrottlingException(e)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used AWS client's isThrottlingException. It looks like they come in many forms, and I assume this logic will be kept up to date.

int expectedNumberOfBatches = (blobsToDelete.size() / 1_000) + (blobsToDelete.size() % 1_000 == 0 ? 0 : 1);
assertThat(numberOfDeleteAttempts.get(), equalTo(throttleTimesBeforeSuccess + expectedNumberOfBatches));
assertThat(numberOfSuccessfulDeletes.get(), equalTo(expectedNumberOfBatches));
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test class seemed very geared towards retrying input stream, but it sounded like the most appropriate home for this.

@nicktindall nicktindall added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >bug labels Sep 20, 2024
@elasticsearchmachine
Copy link
Collaborator

Hi @nicktindall, I've created a changelog YAML for you.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume we don't want to change behaviour in the event of an interrupt, I've written it initially to abort the retries for any batch in progress and any subsequent batches (via the preserved interrupt flag), but to continue attempting to delete the remaining batches.

I think preserving the interrupt flag should abort subsequent operations on this thread, since these should be interruptible IO operations. That seems like the right behaviour to me.

Do we want to restrict this behaviour to SNAPSHOT_DATA and SNAPSHOT_METADATA (or even just SNAPSHOT_DATA) or do we think it's desirable across the board? If we're applying it more broadly perhaps we should limit how long we retry for?

I suspect we should limit retries everywhere, so that if the repository happens to be completely wedged then this snapshot thread will eventually go and do something else on a different repository. I think we should also drop and re-acquire the clientReference on each attempt, so that if the repository is closed then we'll fail sooner too.

Do we want the interval and back-off to be configurable via settings?

Yes.

I've put a histogram of how many attempts it takes to delete the batches, do we want something similar for how long?

I don't have a strong opinion on this.

assertThrows(
Exception.class, /* ? */
() -> blobContainer.deleteBlobsIgnoringIfNotExists(randomFrom(operationPurposesThatRetryOnDelete()), blobsToDelete.iterator())
);
Copy link
Contributor Author

@nicktindall nicktindall Sep 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea here was to close the S3Service (mimicking the repository being closed) during the retries and observe that it aborted, but it wasn't obvious why that would work. It doesn't work in the test, perhaps something has been stubbed and it changes the behaviour?

The S3Service#close method releases the cached clients, but I still seem to be able to create a new one after that's occurred, and the comments appear to indicate that's by design.

Perhaps we could set a flag when the S3BlobStore is closed and check that instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm ok I see, bit weird that we are so lenient here and let you acquire a client with an arbitrary name. I'm not sure we really need a test for this, at least not specifically for this context.

private void deletePartition(
OperationPurpose purpose,
AmazonS3Reference clientReference,
AtomicReference<AmazonS3Reference> clientReferenceHolder,
List<String> partition,
AtomicReference<Exception> aex
) {
Copy link
Contributor Author

@nicktindall nicktindall Sep 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love this change in the API, but it seemed the lightest-touch way to allow the client reference to be re-acquired when a batch is retried while still allowing re-use between batches in the happy case. Open to moving to reference-per-batch if we think the extra allocations are fine.
The clients themselves are cached so perhaps reference-per-batch is OK. We'd be building the settings object for each acquisition though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah reference-per-batch should be fine.

repositoryName,
Settings.builder()
.put(repositorySettings(repositoryName))
.put(S3ClientSettings.MAX_RETRIES_SETTING.getConcreteSettingForNamespace("placeholder").getKey(), 0)
Copy link
Contributor Author

@nicktindall nicktindall Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems as though putting client settings in the repository settings like this is deprecated, and/or I've done it wrong. Any advice on how to override AWS client internal retries on a per-test basis would be appreciated.

I tried making a different client with that setting in the node settings (set at the class level), but there are a lot of additional settings configured for the test client that I'd need to duplicate to make that work.

Copy link
Contributor Author

@nicktindall nicktindall Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like the correct way to do this would be to change the client settings in S3BlobStoreRepositoryTests to be configured for default client, allowing them to be selectively overridden in individual tests. But that's a larger change. Perhaps there's an easier/better way?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ok in tests, we don't have a plan for removing this apparently-deprecated functionality any time soon.

@nicktindall nicktindall marked this pull request as ready for review September 24, 2024 06:09
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@elasticsearchmachine elasticsearchmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Sep 24, 2024
@nicktindall nicktindall changed the title Retry unsuccessful snapshot deletions Retry throttled snapshot deletions Sep 24, 2024
Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me, I left a few tiny comments. We'll learn from experience whether this is enough to keep S3 happy or whether further adjustments are needed.

assertThat(handler.numberOfSuccessfulDeletes.get(), equalTo(0));
} finally {
// Clear the interrupt (this seemed to leak between tests)
Thread.interrupted();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we assertTrue(Thread.interrupted()); instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 53dad4b

logger.warn("Aborting delete retries due to interrupt");
}
} else {
logger.warn("Exceeded maximum delete retries, aborting");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have some more detail in this message about what exactly we're aborting, and the retry strategy we were using?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 52a315b

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I meant more details like the number of times we retried, and refs to the settings that can be used to influence this and/or a link to the relevant docs. Otherwise IME we eventually get support cases asking for that information.

Which reminds me, we don't document these new settings either, but we should.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a go at this in 3ed3bc4

@@ -302,4 +382,30 @@ public HttpHandler getDelegate() {
return delegate;
}
}

static class S3ErrorResponse {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: looks like a record to me?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in c579c56


S3MetricErroneousHttpHandler(HttpHandler delegate, Queue<RestStatus> errorStatusQueue) {
S3MetricErroneousHttpHandler(HttpHandler delegate, Queue<S3ErrorResponse> errorStatusQueue) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming nit:

Suggested change
S3MetricErroneousHttpHandler(HttpHandler delegate, Queue<S3ErrorResponse> errorStatusQueue) {
S3MetricErroneousHttpHandler(HttpHandler delegate, Queue<S3ErrorResponse> errorResponseQueue) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in b8467af

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a handful of nits.


`throttled_delete_retry.maximum_delay`::

(integer) This is the upper bound on how long the delays between retries will grow to. Default is 500ms, minimum is 0ms.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise

Suggested change
(integer) This is the upper bound on how long the delays between retries will grow to. Default is 500ms, minimum is 0ms.
(<<time-units,time value>>) This is the upper bound on how long the delays between retries will grow to. Default is 500ms, minimum is 0ms.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 1e207e0

@@ -77,6 +77,25 @@ public void testExponentialBackOff() {
}
}

public void testLinearBackoffWithLimit() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also a test without the limit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in ab10b44

);
static final Setting<TimeValue> RETRY_THROTTLED_DELETE_MAXIMUM_DELAY = Setting.timeSetting(
"throttled_delete_retry.maximum_delay",
new TimeValue(500, TimeUnit.MILLISECONDS),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

500ms feels a little short for my taste, I'd expect something more like 5s here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 8868f2b

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nicktindall nicktindall merged commit 16864e9 into elastic:main Oct 15, 2024
16 checks passed
@nicktindall nicktindall deleted the bugfix/ES-8562_retry_unsuccessful_snapshot_deletions branch October 15, 2024 22:08
georgewallace pushed a commit to georgewallace/elasticsearch that referenced this pull request Oct 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants