slurm cg state

最新推荐文章于 2023-08-16 10:02:21 发布

原创最新推荐文章于 2023-08-16 10:02:21 发布 · 906 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#slurm

文章讨论了Slurm在遇到无法终止的作业进程时的问题，这些进程可能因进行I/O操作而无法被杀死。提出的解决方案包括修复文件系统或重启节点，将节点设置为DOWN状态，使用UnkillableStepProgram和UnkillableStepTimeout配置参数来自动处理无法杀死的进程。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

This is typically due to non-killable processes associated with the job. 
Slurm will continue to attempt terminating the processes with SIGKILL, 
	but some jobs may be stuck performing I/O and non-killable. 
	This is typically due to a file system problem and may be addressed in a couple of ways.

1.Fix the file system and/or reboot the node. 
2.Set the node to a DOWN state and then return it to service ("scontrol update NodeName=<node> State=down Reason=hung_proc" and "scontrol update NodeName=<node> State=resume"). 
	This permits other jobs to use the node, but leaves the non-killable process in place. 
	If the process should ever complete the I/O, the pending SIGKILL should terminate it immediately. 
3.Use the UnkillableStepProgram and UnkillableStepTimeout configuration parameters to automatically respond to processes which can not be killed, by sending email or rebooting the node.