-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Retry throttled snapshot deletions #113237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry throttled snapshot deletions #113237
Conversation
aex.set(ExceptionsHelper.useOrSuppress(aex.get(), e)); | ||
return; | ||
} catch (AmazonClientException e) { | ||
if (shouldRetryDelete(purpose) && RetryUtils.isThrottlingException(e)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used AWS client's isThrottlingException
. It looks like they come in many forms, and I assume this logic will be kept up to date.
int expectedNumberOfBatches = (blobsToDelete.size() / 1_000) + (blobsToDelete.size() % 1_000 == 0 ? 0 : 1); | ||
assertThat(numberOfDeleteAttempts.get(), equalTo(throttleTimesBeforeSuccess + expectedNumberOfBatches)); | ||
assertThat(numberOfSuccessfulDeletes.get(), equalTo(expectedNumberOfBatches)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test class seemed very geared towards retrying input stream, but it sounded like the most appropriate home for this.
Hi @nicktindall, I've created a changelog YAML for you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume we don't want to change behaviour in the event of an interrupt, I've written it initially to abort the retries for any batch in progress and any subsequent batches (via the preserved interrupt flag), but to continue attempting to delete the remaining batches.
I think preserving the interrupt flag should abort subsequent operations on this thread, since these should be interruptible IO operations. That seems like the right behaviour to me.
Do we want to restrict this behaviour to
SNAPSHOT_DATA
andSNAPSHOT_METADATA
(or even justSNAPSHOT_DATA
) or do we think it's desirable across the board? If we're applying it more broadly perhaps we should limit how long we retry for?
I suspect we should limit retries everywhere, so that if the repository happens to be completely wedged then this snapshot
thread will eventually go and do something else on a different repository. I think we should also drop and re-acquire the clientReference
on each attempt, so that if the repository is closed then we'll fail sooner too.
Do we want the interval and back-off to be configurable via settings?
Yes.
I've put a histogram of how many attempts it takes to delete the batches, do we want something similar for how long?
I don't have a strong opinion on this.
assertThrows( | ||
Exception.class, /* ? */ | ||
() -> blobContainer.deleteBlobsIgnoringIfNotExists(randomFrom(operationPurposesThatRetryOnDelete()), blobsToDelete.iterator()) | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea here was to close the S3Service
(mimicking the repository being closed) during the retries and observe that it aborted, but it wasn't obvious why that would work. It doesn't work in the test, perhaps something has been stubbed and it changes the behaviour?
The S3Service#close
method releases the cached clients, but I still seem to be able to create a new one after that's occurred, and the comments appear to indicate that's by design.
Perhaps we could set a flag when the S3BlobStore
is closed and check that instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm ok I see, bit weird that we are so lenient here and let you acquire a client with an arbitrary name. I'm not sure we really need a test for this, at least not specifically for this context.
private void deletePartition( | ||
OperationPurpose purpose, | ||
AmazonS3Reference clientReference, | ||
AtomicReference<AmazonS3Reference> clientReferenceHolder, | ||
List<String> partition, | ||
AtomicReference<Exception> aex | ||
) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love this change in the API, but it seemed the lightest-touch way to allow the client reference to be re-acquired when a batch is retried while still allowing re-use between batches in the happy case. Open to moving to reference-per-batch if we think the extra allocations are fine.
The clients themselves are cached so perhaps reference-per-batch is OK. We'd be building the settings object for each acquisition though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah reference-per-batch should be fine.
modules/repository-s3/src/main/java/org/elasticsearch/repositories/s3/S3BlobStore.java
Show resolved
Hide resolved
modules/repository-s3/src/main/java/org/elasticsearch/repositories/s3/S3Repository.java
Show resolved
Hide resolved
repositoryName, | ||
Settings.builder() | ||
.put(repositorySettings(repositoryName)) | ||
.put(S3ClientSettings.MAX_RETRIES_SETTING.getConcreteSettingForNamespace("placeholder").getKey(), 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems as though putting client settings in the repository settings like this is deprecated, and/or I've done it wrong. Any advice on how to override AWS client internal retries on a per-test basis would be appreciated.
I tried making a different client with that setting in the node settings (set at the class level), but there are a lot of additional settings configured for the test
client that I'd need to duplicate to make that work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels like the correct way to do this would be to change the client settings in S3BlobStoreRepositoryTests
to be configured for default
client, allowing them to be selectively overridden in individual tests. But that's a larger change. Perhaps there's an easier/better way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's ok in tests, we don't have a plan for removing this apparently-deprecated functionality any time soon.
…unsuccessful_snapshot_deletions
Pinging @elastic/es-distributed (Team:Distributed) |
…unsuccessful_snapshot_deletions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine to me, I left a few tiny comments. We'll learn from experience whether this is enough to keep S3 happy or whether further adjustments are needed.
assertThat(handler.numberOfSuccessfulDeletes.get(), equalTo(0)); | ||
} finally { | ||
// Clear the interrupt (this seemed to leak between tests) | ||
Thread.interrupted(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we assertTrue(Thread.interrupted());
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 53dad4b
logger.warn("Aborting delete retries due to interrupt"); | ||
} | ||
} else { | ||
logger.warn("Exceeded maximum delete retries, aborting"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we have some more detail in this message about what exactly we're aborting, and the retry strategy we were using?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 52a315b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I meant more details like the number of times we retried, and refs to the settings that can be used to influence this and/or a link to the relevant docs. Otherwise IME we eventually get support cases asking for that information.
Which reminds me, we don't document these new settings either, but we should.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a go at this in 3ed3bc4
@@ -302,4 +382,30 @@ public HttpHandler getDelegate() { | |||
return delegate; | |||
} | |||
} | |||
|
|||
static class S3ErrorResponse { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: looks like a record
to me?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in c579c56
|
||
S3MetricErroneousHttpHandler(HttpHandler delegate, Queue<RestStatus> errorStatusQueue) { | ||
S3MetricErroneousHttpHandler(HttpHandler delegate, Queue<S3ErrorResponse> errorStatusQueue) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming nit:
S3MetricErroneousHttpHandler(HttpHandler delegate, Queue<S3ErrorResponse> errorStatusQueue) { | |
S3MetricErroneousHttpHandler(HttpHandler delegate, Queue<S3ErrorResponse> errorResponseQueue) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in b8467af
…unsuccessful_snapshot_deletions
…eout exception message
…unsuccessful_snapshot_deletions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just a handful of nits.
|
||
`throttled_delete_retry.maximum_delay`:: | ||
|
||
(integer) This is the upper bound on how long the delays between retries will grow to. Default is 500ms, minimum is 0ms. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise
(integer) This is the upper bound on how long the delays between retries will grow to. Default is 500ms, minimum is 0ms. | |
(<<time-units,time value>>) This is the upper bound on how long the delays between retries will grow to. Default is 500ms, minimum is 0ms. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 1e207e0
@@ -77,6 +77,25 @@ public void testExponentialBackOff() { | |||
} | |||
} | |||
|
|||
public void testLinearBackoffWithLimit() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also a test without the limit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in ab10b44
); | ||
static final Setting<TimeValue> RETRY_THROTTLED_DELETE_MAXIMUM_DELAY = Setting.timeSetting( | ||
"throttled_delete_retry.maximum_delay", | ||
new TimeValue(500, TimeUnit.MILLISECONDS), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
500ms
feels a little short for my taste, I'd expect something more like 5s
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 8868f2b
modules/repository-s3/src/main/java/org/elasticsearch/repositories/s3/S3Repository.java
Outdated
Show resolved
Hide resolved
modules/repository-s3/src/main/java/org/elasticsearch/repositories/s3/S3Repository.java
Outdated
Show resolved
Hide resolved
Co-authored-by: David Turner <[email protected]>
…ries/s3/S3Repository.java Co-authored-by: David Turner <[email protected]>
…ries/s3/S3Repository.java Co-authored-by: David Turner <[email protected]>
Co-authored-by: David Turner <[email protected]>
…unsuccessful_snapshot_deletions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Closes ES-8562
The change is
OperationPurpose
SNAPSHOT_DATA
orSNAPSHOT_METADATA
we will retry when throttled with a progressive back-off up to (by default) 8 times over about 5 seconds.AbortedException
which will propagate wrapped in anIOException
Closes ES-8562