Set max memory of SortBasedPusher based off Spark configs #3203

helenweng-stripe · 2025-04-07T21:45:23Z

What changes were proposed in this pull request?

Set the max memory threshold to actual memory allocated to the task. This is reverse-calculated from how Spark determines it since TaskMemoryManager in Spark does not expose how much memory is available to a task.

Calculation is based on whether mode is onheap or offheap:

((spark.memory.offHeap.size or spark.executor.memory) - reservedMemory (hardcoded to )) * spark.memory.fraction * celeborn.client.spark.push.sort.memory.maxMemoryFactor / spark.executor.cores

Based on calculations here: https://github.com/apache/spark/blob/branch-3.3/core/src/main/scala/org/apache/spark/memory/UnifiedMemoryManager.scala#L213-L235

MaxMemory can be set statically with new setting celeborn.client.spark.push.sort.memory.maxMemoryBytes. It is only dynamically calculated with celeborn.client.spark.push.sort.memory.calculateMaxMemoryBytes set to true (default false).

Better solution would probably be to expose getMaxMemory on the spark side, however it is currently not available.

Why are the changes needed?

Currently maxMemory is set to Runtime.getRuntime().maxMemory() * maxMemoryFactor where Runtime.getRuntime().maxMemory() equals the amount of memory available to the entire app. Thus for some large tasks, SortBasedPusher with celeborn.client.spark.push.sort.memory.useAdaptiveThreshold enabled will always OOM.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit tests
We also are running with this in prod.

turboFei · 2025-04-08T04:57:41Z

client-spark/common/src/main/java/org/apache/spark/shuffle/celeborn/SortBasedPusher.java

+      // set max task memory conf based on Spark conf
+      if (celebornConf.clientPushSortMaxMemoryBytes() <= 0L) {
+        double memoryStorageFraction =
+            sparkConf.getDouble(package$.MODULE$.MEMORY_FRACTION().key(), 0.6);


For spark-2.4 UT

Error: /home/runner/work/celeborn/celeborn/client-spark/common/src/main/java/org/apache/spark/shuffle/celeborn/SortBasedPusher.java:553:48: error: cannot find symbol

Maybe you can use the string value directly?

Can we get the values of maxOffHeapMemory/maxHeapMemory from the MemoryManager through reflection?

Thank you for the feedback, I will try using reflection and re-submit.

Copilot

Copilot reviewed 5 out of 6 changed files in this pull request and generated no comments.

Files not reviewed (1)

common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala: Language not supported

github-actions · 2025-05-06T08:36:59Z

This PR is stale because it has been open 20 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions · 2025-05-17T08:34:54Z

This issue was closed because it has been staled for 10 days with no activity.

turboFei · 2025-05-18T14:47:17Z

gentle ping @helenweng-stripe Any update?

Set max memory of SortBasedPusher based off Spark configs

ffb8693

github-actions bot added module:client module:spark kind:documentation module:common labels Apr 7, 2025

turboFei reviewed Apr 8, 2025

View reviewed changes

SteNicholas requested a review from Copilot April 10, 2025 03:11

Copilot AI reviewed Apr 10, 2025

View reviewed changes

github-actions bot added the stale label May 6, 2025

github-actions bot closed this May 17, 2025

turboFei reopened this May 18, 2025

github-actions bot removed the stale label May 19, 2025

SteNicholas force-pushed the main branch 2 times, most recently from 5590ef0 to 0dffcf6 Compare May 26, 2025 09:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Set max memory of SortBasedPusher based off Spark configs #3203

Set max memory of SortBasedPusher based off Spark configs #3203

Uh oh!

helenweng-stripe commented Apr 7, 2025

Uh oh!

turboFei Apr 8, 2025

Uh oh!

turboFei Apr 8, 2025

Uh oh!

RexXiong Apr 9, 2025

Uh oh!

helenweng-stripe Apr 15, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

github-actions bot commented May 6, 2025

Uh oh!

github-actions bot commented May 17, 2025

Uh oh!

turboFei commented May 18, 2025

Uh oh!

Uh oh!

Set max memory of SortBasedPusher based off Spark configs #3203

Are you sure you want to change the base?

Set max memory of SortBasedPusher based off Spark configs #3203

Uh oh!

Conversation

helenweng-stripe commented Apr 7, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

turboFei Apr 8, 2025

Choose a reason for hiding this comment

Uh oh!

turboFei Apr 8, 2025

Choose a reason for hiding this comment

Uh oh!

RexXiong Apr 9, 2025

Choose a reason for hiding this comment

Uh oh!

helenweng-stripe Apr 15, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented May 6, 2025

Uh oh!

github-actions bot commented May 17, 2025

Uh oh!

turboFei commented May 18, 2025

Uh oh!

Uh oh!