Skip to content

HDDS-12760. Intermittent Timeout in testImportedContainerIsClosed #8349

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

kostacie
Copy link
Contributor

@kostacie kostacie commented Apr 28, 2025

What changes were proposed in this pull request?

In this PR, a new condition for SCMContainerPlacementRackAware#chooseNode was added, it prevented Timeout exception in the test.

The main problem was that a datanode, which was restarted, didn't become a target for replication and the source datanode was selected instead. The SCM was choosing the same datanode, where the container is located, as a target and it made the test work incorrectly.

The fix just checks whether a current datanode (e.g. the source) is already in the used list. In case if it is, then it'll be added as an exclusion and won't be selected in the future.

What is the link to the Apache JIRA

HDDS-12760

How was this patch tested?

CI: https://github.com/kostacie/ozone/actions/runs/14702987624

@kerneltime kerneltime requested a review from adoroszlai April 28, 2025 16:35
@adoroszlai
Copy link
Contributor

Thanks @kostacie for the patch. For verifying fix of flaky integration tests, please trigger 10x10 repeated run using flaky-test-check.

@kostacie kostacie marked this pull request as draft April 29, 2025 11:51
@kostacie
Copy link
Contributor Author

Thank you @adoroszlai for the review.
The results of check: https://github.com/kostacie/ozone/actions/runs/15109038542
Also the description and the PR were updated.

@kostacie kostacie marked this pull request as ready for review May 19, 2025 10:10
Copy link
Contributor

@myskov myskov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kostacie for the patch

dnInfos.get(1).setNodeStatus(NodeStatus.inServiceHealthyReadOnly());

Exception e =
assertThrows(Exception.class,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be simplified to assertThrows(SCMException.class,

@adoroszlai
Copy link
Contributor

Thanks @kostacie for updating the patch, and providing the test run.

https://github.com/kostacie/ozone/actions/runs/15109038542

Can you please also try running the check for ALL test methods from the same class? (There may be interaction among test cases, so running all methods provides more safety wrt. flakiness.)

@kostacie
Copy link
Contributor Author

The test run for the whole class:
https://github.com/kostacie/ozone/actions/runs/15169178726

@adoroszlai adoroszlai requested a review from ashishkumar50 May 27, 2025 10:21
@@ -502,6 +502,11 @@ private DatanodeDetails chooseNode(List<DatanodeDetails> excludedNodes,
" excludedNodes and affinityNode constrains.", null);
}

if (usedNodes != null && usedNodes.contains(node)) {
excludedNodesForCapacity.add(node.getNetworkFullPath());
continue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

excludedNodesForCapacity can be null, needs to initialise if it is null.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants