HDDS-13067. Container Balancer delete commands should not be sent with an expiration time in the past #8491

Tejaskriya · 2025-05-20T10:45:49Z

What changes were proposed in this pull request?

Problem
The method that sends the delete command for container balancer is MoveManager#sendDeleteCommand().
It calculates deleteTimeout as moveTimeout - replicationTimeout, and then sends the delete command with an SCM expiration timestamp of current time + deleteTimeout. This is wrong, the delete expiration timestamp should actually be "The time at which the move was started + moveTimeout."
This PR changes the timestamp considered for calculating the timeout duration.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-13067

How was this patch tested?

Existing tests, added a unit test in TestMoveManager

…h an expiration time in the past

siddhantsangwan

@Tejaskriya thanks for taking this up. I've left some comments. Please add tests to verify this change.

...hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/MoveManager.java

Tejaskriya · 2025-05-21T07:59:17Z

@siddhantsangwan thanks for the review, I have made the changes and added a test, For the test added, I checked without the changes made in this patch, it fails (hence verifying that the expiration time was set in the past).
Run of the failure without changes: https://github.com/Tejaskriya/ozone/actions/runs/15156730084/job/42613790646

siddhantsangwan

The source code looks good, but I have a comment on the test.

siddhantsangwan · 2025-05-22T12:24:27Z

.../server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/balancer/TestMoveManager.java

+    // 6 minutes is the datanodeTimeoutOffset set for datanodeCommands sent by replicationManager by default
+    assertTrue((Duration.ofMillis(longCaptor.getValue()).toMillis()
+        - Duration.ofMinutes(6).toMillis()) > clock.millis());


Thanks for adding the test. We want to ensure it asserts the delete command is sent with an SCM deadline of moveStartTime + moveTimeout, so the assertion needs to be changed. The 6-minute datanodeTimeoutOffset is used later when the Replication Manager sends the command to the datanode, so it's not relevant here.

It'd also be good to have a test that reproduces the situation where a delete command was being sent with a deadline in the past, and make sure that doesn't happen with the new changes. The example I added in the jira can be a good guide for you to reproduce the error.

siddhantsangwan · 2025-05-22T12:28:35Z

@siddhantsangwan thanks for the review, I have made the changes and added a test, For the test added, I checked without the changes made in this patch, it fails (hence verifying that the expiration time was set in the past).
Run of the failure without changes: https://github.com/Tejaskriya/ozone/actions/runs/15156730084/job/42613790646

Unfortunately this test is just failing because of:

moveManager.setMoveTimeout(Duration.ofMinutes(50).toMillis());

It's not really testing what we want. In general it's a good practice to use the master branch for testing the 'before' version of the code so that unintended changes don't sneak in. This test currently passes on the master branch.

HDDS-13067. Container Balancer delete commands should not be sent wit…

2264346

…h an expiration time in the past

Tejaskriya requested a review from siddhantsangwan May 20, 2025 10:46

siddhantsangwan reviewed May 20, 2025

View reviewed changes

Add test, rename to moveStartTime

17d9130

siddhantsangwan reviewed May 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HDDS-13067. Container Balancer delete commands should not be sent with an expiration time in the past #8491

HDDS-13067. Container Balancer delete commands should not be sent with an expiration time in the past #8491

Uh oh!

Tejaskriya commented May 20, 2025 •

edited

Loading

Uh oh!

siddhantsangwan left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Tejaskriya commented May 21, 2025 •

edited

Loading

Uh oh!

siddhantsangwan left a comment

Uh oh!

siddhantsangwan May 22, 2025

Uh oh!

siddhantsangwan commented May 22, 2025

Uh oh!

Uh oh!

HDDS-13067. Container Balancer delete commands should not be sent with an expiration time in the past #8491

Are you sure you want to change the base?

HDDS-13067. Container Balancer delete commands should not be sent with an expiration time in the past #8491

Uh oh!

Conversation

Tejaskriya commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

siddhantsangwan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Tejaskriya commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

siddhantsangwan left a comment

Choose a reason for hiding this comment

Uh oh!

siddhantsangwan May 22, 2025

Choose a reason for hiding this comment

Uh oh!

siddhantsangwan commented May 22, 2025

Uh oh!

Uh oh!

Tejaskriya commented May 20, 2025 •

edited

Loading

Tejaskriya commented May 21, 2025 •

edited

Loading