-
Notifications
You must be signed in to change notification settings - Fork 535
HDDS-13067. Container Balancer delete commands should not be sent with an expiration time in the past #8491
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…h an expiration time in the past
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Tejaskriya thanks for taking this up. I've left some comments. Please add tests to verify this change.
...hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/MoveManager.java
Outdated
Show resolved
Hide resolved
...hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/MoveManager.java
Outdated
Show resolved
Hide resolved
...hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/MoveManager.java
Outdated
Show resolved
Hide resolved
...hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/MoveManager.java
Outdated
Show resolved
Hide resolved
@siddhantsangwan thanks for the review, I have made the changes and added a test, For the test added, I checked without the changes made in this patch, it fails (hence verifying that the expiration time was set in the past). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The source code looks good, but I have a comment on the test.
// 6 minutes is the datanodeTimeoutOffset set for datanodeCommands sent by replicationManager by default | ||
assertTrue((Duration.ofMillis(longCaptor.getValue()).toMillis() | ||
- Duration.ofMinutes(6).toMillis()) > clock.millis()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the test. We want to ensure it asserts the delete command is sent with an SCM deadline of moveStartTime + moveTimeout
, so the assertion needs to be changed. The 6-minute datanodeTimeoutOffset
is used later when the Replication Manager sends the command to the datanode, so it's not relevant here.
It'd also be good to have a test that reproduces the situation where a delete command was being sent with a deadline in the past, and make sure that doesn't happen with the new changes. The example I added in the jira can be a good guide for you to reproduce the error.
Unfortunately this test is just failing because of:
It's not really testing what we want. In general it's a good practice to use the master branch for testing the 'before' version of the code so that unintended changes don't sneak in. This test currently passes on the master branch. |
What changes were proposed in this pull request?
Problem
The method that sends the delete command for container balancer is MoveManager#sendDeleteCommand().
It calculates deleteTimeout as moveTimeout - replicationTimeout, and then sends the delete command with an SCM expiration timestamp of current time + deleteTimeout. This is wrong, the delete expiration timestamp should actually be "The time at which the move was started + moveTimeout."
This PR changes the timestamp considered for calculating the timeout duration.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-13067
How was this patch tested?
Existing tests, added a unit test in TestMoveManager