Skip to content

[CELEBORN-1917] supports celeborn.client.push.maxBytesSizeInFlight #3248

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

DDDominik
Copy link

What changes were proposed in this pull request?

add data size limitation to inflight data by introducing a new configuration: celeborn.client.push.maxBytesInFlight.perWorker/total and defaults to celeborn.client.push.buffer.max.size * celeborn.client.push.maxReqsInFlight.perWorker/total.
for backward compatibility, also add a control: celeborn.client.push.maxReqsInFlight.enabled.

Why are the changes needed?

celeborn do supports limiting the number of push inflight requests via celeborn.client.push.maxReqsInFlight.perWorker/total. this is a good constraint to memory usage where most requests do not exceed celeborn.client.push.buffer.max.size. however, in a vectorized shuffle (like blaze and gluten), a request might be greatly larger then the max buffer size, leading to too much inflight data and results OOM.

Does this PR introduce any user-facing change?

Yes, add new config for client

How was this patch tested?

test on local env

@@ -67,8 +67,8 @@ public int nextBatchId() {
return inFlightRequestTracker.nextBatchId();
}

public void addBatch(int batchId, String hostAndPushPort) {
inFlightRequestTracker.addBatch(batchId, hostAndPushPort);
public void addBatch(int batchId, int batchBytesSize, String hostAndPushPort) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also update the invoker in FlinkShuffleClientImpl, then check the CI result.

@mridulm
Copy link
Contributor

mridulm commented May 9, 2025

+CC @venkata91, @rmcyang

@SteNicholas
Copy link
Member

@DDDominik, any update?

@SteNicholas SteNicholas force-pushed the main branch 2 times, most recently from 5590ef0 to 0dffcf6 Compare May 26, 2025 09:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants