Skip to content

Conversation

astefan
Copy link
Contributor

@astefan astefan commented Feb 28, 2022

In case of an exception (a partial shard failure for example) the circuit breaker used bytes tracker was not cleared. The same goes for some data structures used by the eql sequence algorithm.

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-ql (Team:QL)

Copy link
Contributor

@Luegg Luegg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 to the fix itself but I left some comments concerning the test.

The relative complexity of the test also made me wondering wether it might be better written as an integration test using a search plugin similar to SearchBlockPlugin to emulate the search failure?

}

private void assertMemoryCleared(int sequenceFiltersCount, PIT_ID_TYPE failureId) {
List<BreakerSettings> eqlBreakerSettings = Collections.singletonList(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need the original EQL circuit breaker here? Couldn't the test also use something like EqlTestCircuitBreaker that just accumulates the values? This would also allow to keep CIRCUIT_BREAKER_NAME etc private.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the test breaker doesn't reflect the original breaker behavior and these tests depend (not fully by checking the exact calculations it is doing) on that behavior.

assertMemoryCleared(stages, PIT_ID_TYPE.FAILURE_ID);
}

private void assertMemoryCleared(int sequenceFiltersCount, PIT_ID_TYPE failureId) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

failureId is never used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicely spot.

eqlBreakerSettings,
new ClusterSettings(Settings.EMPTY, ClusterSettings.BUILT_IN_CLUSTER_SETTINGS)
);
PitMockClient esClient = new PitMockClient(getTestName(), service.getBreaker(CIRCUIT_BREAKER_NAME), PIT_ID_TYPE.FAILURE_ID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Controlling the behavior of the client through the PIT ID seems surprising to me. Wouldn't it be more straightforward to use a flag or different instances?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That specific line you added the comment for should in fact be (as you spotted earlier that failureId is not used) PitMockClient esClient = new PitMockClient(getTestName(), service.getBreaker(CIRCUIT_BREAKER_NAME), failureId).

I found that the value of the pit ID is simple and flexible enough to allow me to run two different tests (success and failure) and, also, to control the behavior of the client response in multiple requests for one of the pid ID types.

Can you detail and/or suggest more specific code changes to use a flag or different (client, I assume) instances?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I'm mostly thinking of is to separate the various concerns packed into the PIT ID right now:

  • What type of client is used by the test (failing vs. succeeding)
  • Keeping track of the client state (how many requests have been answered)
  • Being the PIT ID itself

Instantiating PitMockClient with a succeeding flag (or using two different implementations) can address the first concern. I'm not entirely sure about the client state but that can probably be addressed by a field in the client. This would also resolve the issue mentioned in https://github.com/elastic/elasticsearch/pull/84451/files#r816743228.

ShardSearchFailure.EMPTY_ARRAY,
SearchResponse.Clusters.EMPTY,
// copy the pit from the request and "increment" its "+" sign
requestPitId.equals(PIT_ID_TYPE.FAILURE_ID.pitId) ? requestPitId + "+" : null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this works because EQL pipes the PIT id from the previous response into the next request. That's not a requirement of the PIT APIs though. A PIT user might also use the PIT id returned from the initial OpenPointInTimeRequest for all subsequent search requests. Not that this invalidates the test but it's imposing a certain use of the PIT APIs on the implementation.

Couldn't the client internally keep track of the number of requests instead?

pitContextCounter.incrementAndGet();
OpenPointInTimeResponse response = new OpenPointInTimeResponse(pitId);
listener.onResponse((Response) response);
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The multiple return statements can also be avoided by moving super.doExecute(action, request, listener); below into an else branch. This would also reduce the risk of accidentally calling the listener twice.

Copy link
Member

@costin costin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a comment around centralizing the call. Please add some comments to the test on what it tries to do and moreover how it goes about it and its setup.

log.trace("Starting sequence window w/ fetch size [{}]", windowSize);
startTime = System.currentTimeMillis();
tumbleWindow(0, listener);
tumbleWindow(0, wrap(listener::onResponse, e -> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the close method to be called in one central place - right now it's already called in two places and this fixes adds a third one for the exception branch.
There should be only one call, even removing the method itself to avoid having it reused.
Something along the lines of:

tumbleWindow(0, runAfter(listener, close(listener));

This should guarantee that after the listener executes, the clean-up is done both on onResponse and onFailure. It's worth checking whether onFailure can be called multiple times and if not, change close() method to not forward the exception.
Also I would remove close(listener) method and simply inline the code in the lambda above to avoid calling it from different places since that shouldn't be the case.

@astefan
Copy link
Contributor Author

astefan commented Mar 7, 2022

@elasticmachine update branch

@astefan astefan requested review from costin and Luegg March 8, 2022 07:17
Copy link
Member

@costin costin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@bpintea bpintea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice testing, lgtm.

listener.onResponse((Response) response);
} else if (request instanceof ClosePointInTimeRequest) {
ClosePointInTimeResponse response = new ClosePointInTimeResponse(true, 1);
assert pitContextCounter.get() > 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you choose the assert (here and below) intentionally?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought of it as a stronger statement about the pit opening/closing.

@astefan astefan merged commit 4c243d6 into elastic:master Mar 9, 2022
@astefan astefan deleted the clear_circuit_break_mem branch March 9, 2022 12:43
@astefan astefan added the auto-backport Automatically create backport pull requests when merged label Mar 9, 2022
astefan added a commit that referenced this pull request Mar 9, 2022
…r used bytes in case of exception (#84451)

(cherry picked from commit 4c243d6)
astefan added a commit that referenced this pull request Mar 9, 2022
…r used bytes in case of exception (#84451)

(cherry picked from commit 4c243d6)
astefan added a commit that referenced this pull request Mar 9, 2022
…r used bytes in case of exception (#84451) (#84804)

(cherry picked from commit 4c243d6)
astefan added a commit that referenced this pull request Mar 9, 2022
…r used bytes in case of exception (#84451) (#84803)

(cherry picked from commit 4c243d6)
@astefan astefan added v7.17.2 and removed auto-backport Automatically create backport pull requests when merged labels Mar 9, 2022
astefan added a commit to astefan/elasticsearch that referenced this pull request Mar 9, 2022
…r used bytes in case of exception (elastic#84451)

(cherry picked from commit 4c243d6)
astefan added a commit that referenced this pull request Mar 10, 2022
…r used bytes in case of exception (#84451) (#84821)

(cherry picked from commit 4c243d6)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants