[CELEBORN-1917] supports celeborn.client.push.maxBytesSizeInFlight #3248
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
add data size limitation to inflight data by introducing a new configuration:
celeborn.client.push.maxBytesInFlight.perWorker/total
and defaults toceleborn.client.push.buffer.max.size * celeborn.client.push.maxReqsInFlight.perWorker/total
.for backward compatibility, also add a control:
celeborn.client.push.maxReqsInFlight.enabled
.Why are the changes needed?
celeborn do supports limiting the number of push inflight requests via
celeborn.client.push.maxReqsInFlight.perWorker/total
. this is a good constraint to memory usage where most requests do not exceedceleborn.client.push.buffer.max.size
. however, in a vectorized shuffle (like blaze and gluten), a request might be greatly larger then the max buffer size, leading to too much inflight data and results OOM.Does this PR introduce any user-facing change?
Yes, add new config for client
How was this patch tested?
test on local env