Skip to content

[CELEBORN-1903] Enable connection heartbeat and close idle connection #3148

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

Austinfjq
Copy link
Contributor

What changes were proposed in this pull request?

Add new configs to enable connection heartbeat and close idle connection.

Why are the changes needed?

Enable client side config which will allow all connections to enable heartbeat and detect connection failure.

Does this PR introduce any user-facing change?

Yes. Added 2 new configs.

How was this patch tested?

Tested in a Flink batch job.

@Austinfjq Austinfjq changed the title Enable connection heartbeat and close idle connection [CELEBORN-1903] Enable connection heartbeat and close idle connection Mar 12, 2025
Copy link

github-actions bot commented Apr 1, 2025

This PR is stale because it has been open 20 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the stale label Apr 1, 2025
Copy link

This issue was closed because it has been staled for 10 days with no activity.

@FMX
Copy link
Contributor

FMX commented Apr 28, 2025

@Austinfjq Hi, can you fix the UTs first? This PR will be helpful.

@Austinfjq
Copy link
Contributor Author

@FMX Hi, I pushed the fix. PTAL.

@turboFei turboFei requested a review from Copilot April 29, 2025 22:58
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enables connection heartbeat and the closing of idle connections via new configuration options for different modules.

  • Added configuration options for idle connection closure and heartbeat enabling in the docs.
  • Introduced new methods in TransportConf and updated TransportContext to account for the new configurations.
  • Added a logging statement in TransportChannelHandler to report when heartbeat is enabled.

Reviewed Changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated no comments.

File Description
docs/configuration/network.md Added new configuration options for closing idle connections and enabling heartbeat
common/src/main/java/org/apache/celeborn/common/network/util/TransportConf.java Added new methods channelHeartbeatEnabled() and closeIdleConnections() to access the new configs
common/src/main/java/org/apache/celeborn/common/network/server/TransportChannelHandler.java Inserted logging to indicate heartbeat enabling
common/src/main/java/org/apache/celeborn/common/network/TransportContext.java Updated initialization logic for idle connections and heartbeat settings
Files not reviewed (1)
  • common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala: Language not supported
Comments suppressed due to low confidence (1)

docs/configuration/network.md:22

  • Typo in the description: 'by by default' should be corrected to 'by default'.
| celeborn.<module>.closeIdleConnections | false | false | Should idle connections be closed by by default for the module. "<module>" could be client, worker, etc. If you would like to provide a fallback value for all modules, please omit the ".<module>" and use "celeborn.closeIdleConnections". | 0.6.0 |  |

@SteNicholas
Copy link
Member

SteNicholas commented Apr 30, 2025

@Austinfjq, IMO, we could deprecate celeborn.worker.push.heartbeat.enabled and celeborn.worker.fetch.heartbeat.enabled and introduce celeborn.<module>.heartbeat.enabled instead. Meanwhile, it's better to introduce celeborn.<module>.closeIdleConnections to deprecate celeborn.worker.closeIdleConnections and celeborn.client.closeIdleConnections.
BTW, the heartbeat problem is fixed in #3239. PTAL.

@@ -86,12 +86,19 @@ public TransportContext(
AbstractSource source) {
this.conf = conf;
this.msgHandler = msgHandler;
this.closeIdleConnections = closeIdleConnections;
this.closeIdleConnections = closeIdleConnections || conf.closeIdleConnections();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to change the invoke of this contructor instead of changing here.

@SteNicholas
Copy link
Member

@Austinfjq, any update? Please resolve the conflict of the latest main branch.

@turboFei
Copy link
Member

Hi @Austinfjq We plan to release 0.6.0 soon, could you address the comments?

@SteNicholas SteNicholas force-pushed the main branch 2 times, most recently from 5590ef0 to 0dffcf6 Compare May 26, 2025 09:57
Austinfjq added 3 commits May 28, 2025 14:32
…eartbear

# Conflicts:
#	common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala
#	docs/configuration/network.md
@Austinfjq
Copy link
Contributor Author

@SteNicholas Thank you for the fix in #3239. Sorry for the late reply. Updated the pull request.
cc @turboFei

@turboFei
Copy link
Member

turboFei commented May 29, 2025

could you check the CI failure?

s0nskar and others added 5 commits May 29, 2025 11:27
### What changes were proposed in this pull request?
Support min number of workers to assign slots on for a shuffle.

### Why are the changes needed?

PR apache#3039 updated the default value of `celeborn.master.slot.assign.extraSlots` to avoid skew in shuffle stage with less number of reducers. However, it will also affect the stage with large number of reducers, thus not ideal.

We are introducing a new config `celeborn.master.slot.assign.minWorkers` which will ensure that shuffle stages with less number of reducers will not cause load imbalance on few nodes.

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
NA

Closes apache#3297 from s0nskar/min_workers.

Authored-by: Sanskar Modi <[email protected]>
Signed-off-by: SteNicholas <[email protected]>
### What changes were proposed in this pull request?

Support dependencies of `spark-4.0` profile.

Follow up apache#3282.

### Why are the changes needed?

apache#3282 is lack of dependencies support of `spark-4.0` profile.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Dependencies check: maven-jdk17 (spark-4.0).

Closes apache#3298 from SteNicholas/CELEBORN-1413.

Authored-by: SteNicholas <[email protected]>
Signed-off-by: SteNicholas <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants