Skip to content

CELEBORN-1991: add retries for registerShuffle on the lifecycleManager #3238

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

shahryart
Copy link

@shahryart shahryart commented Apr 28, 2025

What changes were proposed in this pull request?

Add retries to registerShuffle for lifecycleManager to address failures from FallbackPolicyRunner and/or other manager issues. While we do retry connection failures to manager, this addresses failures such as no active manager for a short period of time as well.

We add three separate configs:

  1. retryEnabled
  2. maxRetries
  3. retryWaitTime

Why are the changes needed?

We were running into issues in production with manager not being active for a short period of time, which results in application failures due to registerShuffle failure.
Adding retries fixes these failures.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing unit tests.
Running jobs in our pre-prod environment.

@turboFei
Copy link
Member

turboFei commented Apr 28, 2025

There are already too many configs in celeborn.

We can consider to reuse them first.

Please check

  val CLIENT_REGISTER_SHUFFLE_MAX_RETRIES: ConfigEntry[Int] =
    buildConf("celeborn.client.registerShuffle.maxRetries")
      .withAlternative("celeborn.shuffle.register.maxRetries")
      .categories("client")
      .version("0.3.0")
      .doc("Max retry times for client to register shuffle.")
      .intConf
      .createWithDefault(3)

  val CLIENT_REGISTER_SHUFFLE_RETRY_WAIT: ConfigEntry[Long] =
    buildConf("celeborn.client.registerShuffle.retryWait")
      .withAlternative("celeborn.shuffle.register.retryWait")
      .categories("client")
      .version("0.3.0")
      .doc("Wait time before next retry if register shuffle failed.")
      .timeConf(TimeUnit.MILLISECONDS)
      .createWithDefaultString("3s")


  val CLIENT_RPC_MAX_RETIRES: ConfigEntry[Int] =
    buildConf("celeborn.client.rpc.maxRetries")
      .categories("client")
      .version("0.3.2")
      .doc("Max RPC retry times in client.")
      .intConf
      .createWithDefault(3)

  val CLIENT_RPC_MAX_RETIRES: ConfigEntry[Int] =
    buildConf("celeborn.client.rpc.maxRetries")
      .categories("client")
      .version("0.3.2")
      .doc("Max RPC retry times in client.")
      .intConf
      .createWithDefault(3)


Copy link

This PR is stale because it has been open 20 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added stale and removed stale labels May 19, 2025
@SteNicholas SteNicholas force-pushed the main branch 2 times, most recently from 5590ef0 to 0dffcf6 Compare May 26, 2025 09:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants