-
Notifications
You must be signed in to change notification settings - Fork 537
HDDS-12760. Intermittent Timeout in testImportedContainerIsClosed #8349
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
HDDS-12760. Intermittent Timeout in testImportedContainerIsClosed #8349
Conversation
Thanks @kostacie for the patch. For verifying fix of flaky integration tests, please trigger 10x10 repeated run using flaky-test-check. |
Thank you @adoroszlai for the review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kostacie for the patch
dnInfos.get(1).setNodeStatus(NodeStatus.inServiceHealthyReadOnly()); | ||
|
||
Exception e = | ||
assertThrows(Exception.class, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be simplified to assertThrows(SCMException.class,
Thanks @kostacie for updating the patch, and providing the test run.
Can you please also try running the check for |
The test run for the whole class: |
@@ -502,6 +502,11 @@ private DatanodeDetails chooseNode(List<DatanodeDetails> excludedNodes, | |||
" excludedNodes and affinityNode constrains.", null); | |||
} | |||
|
|||
if (usedNodes != null && usedNodes.contains(node)) { | |||
excludedNodesForCapacity.add(node.getNetworkFullPath()); | |||
continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
excludedNodesForCapacity
can be null, needs to initialise if it is null.
What changes were proposed in this pull request?
In this PR, a new condition for
SCMContainerPlacementRackAware#chooseNode
was added, it prevented Timeout exception in the test.The main problem was that a datanode, which was restarted, didn't become a target for replication and the source datanode was selected instead. The SCM was choosing the same datanode, where the container is located, as a target and it made the test work incorrectly.
The fix just checks whether a current datanode (e.g. the source) is already in the used list. In case if it is, then it'll be added as an exclusion and won't be selected in the future.
What is the link to the Apache JIRA
HDDS-12760
How was this patch tested?
CI: https://github.com/kostacie/ozone/actions/runs/14702987624