Skip to content

HDDS-13067. Container Balancer delete commands should not be sent with an expiration time in the past #8491

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

Tejaskriya
Copy link
Contributor

@Tejaskriya Tejaskriya commented May 20, 2025

What changes were proposed in this pull request?

Problem
The method that sends the delete command for container balancer is MoveManager#sendDeleteCommand().
It calculates deleteTimeout as moveTimeout - replicationTimeout, and then sends the delete command with an SCM expiration timestamp of current time + deleteTimeout. This is wrong, the delete expiration timestamp should actually be "The time at which the move was started + moveTimeout."
This PR changes the timestamp considered for calculating the timeout duration.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-13067

How was this patch tested?

Existing tests, added a unit test in TestMoveManager

Copy link
Contributor

@siddhantsangwan siddhantsangwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Tejaskriya thanks for taking this up. I've left some comments. Please add tests to verify this change.

@Tejaskriya
Copy link
Contributor Author

Tejaskriya commented May 21, 2025

@siddhantsangwan thanks for the review, I have made the changes and added a test, For the test added, I checked without the changes made in this patch, it fails (hence verifying that the expiration time was set in the past).
Run of the failure without changes: https://github.com/Tejaskriya/ozone/actions/runs/15156730084/job/42613790646

Copy link
Contributor

@siddhantsangwan siddhantsangwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The source code looks good, but I have a comment on the test.

Comment on lines +528 to +530
// 6 minutes is the datanodeTimeoutOffset set for datanodeCommands sent by replicationManager by default
assertTrue((Duration.ofMillis(longCaptor.getValue()).toMillis()
- Duration.ofMinutes(6).toMillis()) > clock.millis());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the test. We want to ensure it asserts the delete command is sent with an SCM deadline of moveStartTime + moveTimeout, so the assertion needs to be changed. The 6-minute datanodeTimeoutOffset is used later when the Replication Manager sends the command to the datanode, so it's not relevant here.

It'd also be good to have a test that reproduces the situation where a delete command was being sent with a deadline in the past, and make sure that doesn't happen with the new changes. The example I added in the jira can be a good guide for you to reproduce the error.

@siddhantsangwan
Copy link
Contributor

@siddhantsangwan thanks for the review, I have made the changes and added a test, For the test added, I checked without the changes made in this patch, it fails (hence verifying that the expiration time was set in the past).
Run of the failure without changes: https://github.com/Tejaskriya/ozone/actions/runs/15156730084/job/42613790646

Unfortunately this test is just failing because of:

moveManager.setMoveTimeout(Duration.ofMinutes(50).toMillis());

It's not really testing what we want. In general it's a good practice to use the master branch for testing the 'before' version of the code so that unintended changes don't sneak in. This test currently passes on the master branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants