This is typically due to non-killable processes associated with the job.
Slurm will continue to attempt terminating the processes with SIGKILL,
but some jobs may be stuck performing I/O and non-killable.
This is typically due to a file system problem and may be addressed in a couple of ways.
1.Fix the file system and/or reboot the node.
2.Set the node to a DOWN state and then return it to service ("scontrol update NodeName=<node> State=down Reason=hung_proc" and "scontrol update NodeName=<node> State=resume").
This permits other jobs to use the node, but leaves the non-killable process in place.
If the process should ever complete the I/O, the pending SIGKILL should terminate it immediately.
3.Use the UnkillableStepProgram and UnkillableStepTimeout configuration parameters to automatically respond to processes which can not be killed, by sending email or rebooting the node.
slurm cg state
最新推荐文章于 2023-08-16 10:02:21 发布