Skip to content

[CELEBORN-1099] Check register when handleGetReducerFileGroup #2056

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

onebox-li
Copy link
Contributor

What changes were proposed in this pull request?

For spark case, when stage outputPartitioning is satisfied and no longer needs shuffle exchange, there will be no shuffle write procedure, same goes for RegisterShuffle, currently the origin reduce stage will throw a NPE when LifeCycleManager handleGetReducerFileGroup.

ERROR [dispatcher-event-loop-11] Inbox: Ignoring error
java.lang.NullPointerException: null
    at org.apache.celeborn.client.commit.ReducePartitionCommitHandler.handleGetReducerFileGroup(ReducePartitionCommitHandler.scala:307)
    at org.apache.celeborn.client.CommitManager.handleGetReducerFileGroup(CommitManager.scala:266)
    at org.apache.celeborn.client.LifecycleManager.org$apache$celeborn$client$LifecycleManager$$handleGetReducerFileGroup(LifecycleManager.scala:556)
    at org.apache.celeborn.client.LifecycleManager$$anonfun$receiveAndReply$1.applyOrElse(LifecycleManager.scala:298) 
    at org.apache.celeborn.common.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
    at org.apache.celeborn.common.rpc.netty.Inbox.safelyCall(Inbox.scala:222)
    at org.apache.celeborn.common.rpc.netty.Inbox.process(Inbox.scala:110)
    at org.apache.celeborn.common.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:227)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Reproduce example like:
select count(*) as cnt from tableA;
And tableA is empty.

So here fix code to adapt to this normal situation. For Flink client, just follows the old behavior.

Why are the changes needed?

Fix the probable NPE.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Cluster test.

@codecov
Copy link

codecov bot commented Oct 30, 2023

Codecov Report

Merging #2056 (f7ef1a3) into main (70366ed) will decrease coverage by 0.15%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             main    #2056      +/-   ##
==========================================
- Coverage   46.79%   46.64%   -0.15%     
==========================================
  Files         165      165              
  Lines       10525    10525              
  Branches      957      957              
==========================================
- Hits         4924     4908      -16     
- Misses       5281     5294      +13     
- Partials      320      323       +3     

see 3 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Contributor

@waitinfuture waitinfuture left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks! Merging to main(v0.4.0)/0.3(v0.3.2)

waitinfuture pushed a commit that referenced this pull request Oct 31, 2023
### What changes were proposed in this pull request?
For spark case, when stage outputPartitioning is satisfied and no longer needs shuffle exchange, there will be no shuffle write procedure, same goes for `RegisterShuffle`, currently the origin reduce stage will throw a NPE when LifeCycleManager `handleGetReducerFileGroup`.
```
ERROR [dispatcher-event-loop-11] Inbox: Ignoring error
java.lang.NullPointerException: null
    at org.apache.celeborn.client.commit.ReducePartitionCommitHandler.handleGetReducerFileGroup(ReducePartitionCommitHandler.scala:307)
    at org.apache.celeborn.client.CommitManager.handleGetReducerFileGroup(CommitManager.scala:266)
    at org.apache.celeborn.client.LifecycleManager.org$apache$celeborn$client$LifecycleManager$$handleGetReducerFileGroup(LifecycleManager.scala:556)
    at org.apache.celeborn.client.LifecycleManager$$anonfun$receiveAndReply$1.applyOrElse(LifecycleManager.scala:298)
    at org.apache.celeborn.common.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
    at org.apache.celeborn.common.rpc.netty.Inbox.safelyCall(Inbox.scala:222)
    at org.apache.celeborn.common.rpc.netty.Inbox.process(Inbox.scala:110)
    at org.apache.celeborn.common.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:227)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
```
Reproduce example like:
`select count(*) as cnt from tableA;`
And tableA is empty.

So here fix code to adapt to this normal situation. For Flink client, just follows the old behavior.

### Why are the changes needed?
Fix the probable NPE.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Cluster test.

Closes #2056 from onebox-li/fix-empty-shuffle-npe.

Authored-by: onebox-li <[email protected]>
Signed-off-by: zky.zhoukeyong <[email protected]>
(cherry picked from commit f6cc377)
Signed-off-by: zky.zhoukeyong <[email protected]>
@onebox-li onebox-li deleted the fix-empty-shuffle-npe branch November 1, 2023 01:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants