-
Notifications
You must be signed in to change notification settings - Fork 262
fix munge mount on login failure due to slow ctrl slurmv6 setup #4702
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cboneti
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Federico,
Thank you for this PR, I left one comment asking about the smallest timeout that could fix the issue, or perhaps we should not hardcode this timeout, and rather use 1) a new variable or 2) one of the already existing timeouts such as controller_startup_scripts_timeout (if available to this script).
...heduler/schedmd-slurm-gcp-v6-controller/modules/slurm_files/scripts/setup_network_storage.py
Outdated
Show resolved
Hide resolved
While for sure a variable would make the code cleaner, the aim was to get this fixed quickly leveraging the existing logic, just through timeout extension. |
73f05a4 to
3e9e190
Compare
|
Adding a new commit: the entire deployment process takes longer the older the HPC images gets. This is particularly evident when also Spack is part of the deployment. In such a context controller's Spack install takes lots of time. The hardcoded 40 retries in the login is not enough time to mount the With this new commit, the hardcoded retries are much longer, set at 400 times. As you can see, the Ctrl got ready and let the Login node mount the ps: if needed I can create a dedicated PR |
...heduler/schedmd-slurm-gcp-v6-controller/modules/slurm_files/scripts/setup_network_storage.py
Outdated
Show resolved
Hide resolved
...heduler/schedmd-slurm-gcp-v6-controller/modules/slurm_files/scripts/setup_network_storage.py
Show resolved
Hide resolved
|
In my testing, |
|
Adding updated blueprint, which includes
Given the impact of
|
cboneti
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Directionally approved, I would like someone else in the CTK team to run the relevant slurm tests here to make sure everything is still fine.
|
/gcbrun |
|
follows the sample blueprint used to reproduce the original issues. |
|
/gcbrun |
The Slurm v6 controller setup may take quite long resulting in a failed cluster deployment due to a quite stringent hardcoded munge mount timeout. My proposal is bumping this up to 1200s (rather than the default 120s).
I confirm of having tested my fork version and now works as intended. Follows just the relevant log portion
Attached the full logs from the login and controller node.
login-setup.log
controller-setup.log
Follows my blueprint: