深入理解linux内核hung_task机制，全网最全！原创！

原创已于 2024-09-02 00:04:43 修改 · 2.3k 阅读

37 ·

CC 4.0 BY-SA版权

文章标签：

#linux

于 2024-09-01 13:51:51 首次发布

背景

最近的一个项目里，发生的问题近乎多半都是hangdetect的问题，之前一直对这种问题总是一知半解，发现主要是因为对此种维测方案(hangdetect/hangtask/watchdog/hungdetect)的理解不够深刻，而更深层次的原因是对于内核的各种机(进程调度，管理，内存管理，锁，工作队列，timer)制没有深入的研究过，此个专题，我们从liunx原生的hang_task机制开始，进而对各类平台的各种hangtask(watchdog，hangdetect)的方案的原理流程图，做一个汇总，相信理解这些方案的原理之后，对于任何类hang的问题，都会手到擒来！

前言

hung_task 是linux内核原生的一个简单的检测进程卡死的维测方案；

在此篇问题开始之前我们提出以下两个问题，如果你可以快速给出以下问题的答案，那么这篇文章可能不太适合你...

问题1 我们都清楚hung_task是通过定期遍历内核进程树来达到检测D状态进程是否卡住，那hungtask是通过什么方式来确认进程已经处于D状态超时了60s呢，要知道正常状态下，进程也是会短暂进入D状态的；

问题2 当进程超时之后，打印如下log:

[338.922807] INFO: task hungtest_thread:1353 blocked for more than 120 seconds.

我们如何确认进程hungtest_thread卡住的时间范围什么（确认这个时间点的意义就就是可以看到对应时间点的上层log去分析，当前的操作的app以及具体行为,进而确认场景以及复现路径）。

一、hang_task 方案概览

1、核心流程框图

因为hung_task方案的核心是watchdog线程，此流程图仅针对watchdog线程，为求直观，简洁，明了，省略了不重要的部分；

此流程图可以回答前言中的问题1，hung_task是通过last_switch_count调度计数来确认是否进程是在一次调度中卡了120s，而不是多次调度；

2、基本配置功能

在linux内核配置以上选项使能hung_task功能:

CONFIG_DETECT_HUNG_TASK=y   // 使能hung_task功能

之后，在linux系统开始运行时，hang_task会启动一个名为 khungtaskd 的线程，该线程会循环遍历(周期默认为60s)内核的所有D状态的进程，如果这个进程持续在D状态超过60s，则hungtask进程会打印block 的进程信息(或者根据配置是否panic)，以及调用栈信息：

   338.922807] INFO: task hungtest_thread:1353 blocked for more than 30 seconds.
[  338.923036]       Not tainted 4.19.0-g29e94d5f-dirty #24
[  338.923180] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  338.923429] hungtest_thread D    0  1353      2 0x00000028
[  338.923461] Call trace:
[  338.923492]  __switch_to+0x98/0xe0
[  338.923511]  __schedule+0x254/0x838
[  338.923522]  schedule+0x3c/0xa0
[  338.923534]  schedule_preempt_disabled+0x1c/0x30
[  338.923548]  __mutex_lock.isra.1+0x17c/0x4b8
[  338.923559]  __mutex_lock_slowpath+0x24/0x30

进而根据调用栈信息确认进程当前执行状态，以及卡住的原因。

3、D状态进程卡死的时间范围分析

由于hung_task是一个简单的原生方案，因此解析此类问题一般不会分析进程实际卡死的时间范围，因为仅仅从调用栈就可以大体知道问题的原因，但是平台类的检测方案中，都会涉及对卡死进程实际卡死时间范围的分析，以求对问题可以精准定位：

由本章第一节可以知道，系统默认配置:

检测周期Tchk == 超时时长Tout == 120s

由上图，我们假设进程t在t0时刻卡死，在tc时刻hungtask检测到该进程超时并打印警告信息，那么进程t卡死的时间范围Tp就等于：

Tp = tchk+tb-t0

0 < tb-t0 < tchk

由此可以确认

120 < Tp < 240

因此，对于以下错误信息，

[338.922807] INFO: task hungtest_thread:1353 blocked for more than 120 seconds.

那么实际进程卡死的时间就在：

(338-240 ,338-120)

因此我们需要查看从 98s 开始的log确认卡死的场景以及具体原因；

同时还可以反向得出一个结论，就是根据进程卡顿时间点不同进程需要卡顿240s+ ，才能保证一定会触发hung_task机制(Tchk=Tout=120)；

二、源码解析

1、watchdog线程解析

由于hungtask方案的核心是watchdog线程，因此还是只对这个线程的源码进行分析：

int watchdog(void *dummy)
{
	unsigned long hung_last_checked = jiffies;   // 记录上次检测的时间戳
	set_user_nice(current, 0);                   // 设置当前进程的优先级

	for ( ; ; ) {
		unsigned long timeout = sysctl_hung_task_timeout_secs;  // 设置超时时长，也可以通过外部命令动态设置 
		unsigned long interval = sysctl_hung_task_check_interval_secs; //设置检测周期，也可以通过外部命令动态设置
		long t;

		if (interval == 0)
			interval = timeout;   // 超时时长== 检测周期 == 120s
		interval = min_t(unsigned long, interval, timeout);
		t = hung_timeout_jiffies(hung_last_checked, interval); 
		if (t <= 0) {             // (上次检测时间戳+检测周期 - 本次时间戳) < 0 ?
			if (!atomic_xchg(&reset_hung_task, 0))
				check_hung_uninterruptible_tasks(timeout); // 到达检测周期，开始遍历进程树
			hung_last_checked = jiffies;                   // 更新上次检测时间戳
			continue;
		}
		schedule_timeout_interruptible(t);
	}

	return 0;
}

对照第一节的流程图，watchdog线程主要就是,设置超时时长=检测周期=120s(默认配置)，然后根据上次的检测时间戳以及超时时长判断是否到达检测周期，如果到达就遍历进程树，否则就让出cpu，继续睡眠；

void check_hung_uninterruptible_tasks(unsigned long timeout)
{
	int max_count = sysctl_hung_task_check_count;  // 最大检测进程数目
	int batch_count = HUNG_TASK_BATCHING;          // 每次遍历进程数上限1024。
	struct task_struct *g, *t;

	/*
	 * If the system crashed already then all bets are off,
	 * do not report extra hung tasks:
	 */
	if (test_taint(TAINT_DIE) || did_panic)      // 如果系统处于panic的流程中，则直接返回
		return;

	hung_task_show_lock = false;
	rcu_read_lock();
	for_each_process_thread(g, t) {              // 迭代遍历每个进程
		if (!max_count--)
			goto unlock;
		if (!--batch_count) {
			batch_count = HUNG_TASK_BATCHING;
			if (!rcu_lock_break(g, t))         // 防止rcu_read_lock占用过长时间。释放rcu，并主动调度。调度回来后检查响应进程是否还在，不在则退出遍历，否则继续。
				goto unlock;
		}
		/* use "==" to skip the TASK_KILLABLE tasks waiting on NFS */
		if (t->state == TASK_UNINTERRUPTIBLE)  // 检测进程状态是否为D状态，是则继续check不是则跳过此进程
			check_hung_task(t, timeout);     // 针对D状态的进程做下一步check
	}
 unlock:
	rcu_read_unlock();
	if (hung_task_show_lock)
		debug_show_all_locks();
	if (hung_task_call_panic) {
		trigger_all_cpu_backtrace();
		panic("hung_task: blocked tasks");
	}
}

重点关注check_hung_uninterruptible_tasks(timeout) 函数，该函数遍历进程树，针对D状态进程，继续做下一步check，

static void check_hung_task(struct task_struct *t, unsigned long timeout)
{
	unsigned long switch_count = t->nvcsw + t->nivcsw; // 进程切换计数，包含主动和被动

	/*
	 * Ensure the task is not frozen.
	 * Also, skip vfork and any other user process that freezer should skip.
	 */
	if (unlikely(t->flags & (PF_FROZEN | PF_FREEZER_SKIP))) //如果进程被frozen了，则直接返回
	    return;

	/*
	 * When a freshly created task is scheduled once, changes its state to
	 * TASK_UNINTERRUPTIBLE without having ever been switched out once, it
	 * musn't be checked.
	 */
	if (unlikely(!switch_count))   //首次调度，则直接返回
		return;

	if (switch_count != t->last_switch_count) {  // 判断上次调度计数是否和本次相同，如果相同则代表，进程是真的卡了120+ 而不是刚切换到D状态
		t->last_switch_count = switch_count;
		t->last_switch_time = jiffies;
		return;
	}
	if (time_is_after_jiffies(t->last_switch_time + timeout * HZ)) // 是否卡住超过120s
		return;

	trace_sched_process_hang(t);

	if (!sysctl_hung_task_warnings && !sysctl_hung_task_panic) //如果设置了这两个变量，则直接返回让系统panic
		return;

	/*
	 * Ok, the task did not get scheduled for more than 2 minutes,
	 * complain:
	 */
	if (sysctl_hung_task_warnings) { // 进程在120+的一直卡在D状态，打印相关信息调用栈
		if (sysctl_hung_task_warnings > 0)
			sysctl_hung_task_warnings--;  // 默认最多打10次
		pr_err("INFO: task %s:%d blocked for more than %ld seconds.\n",
			t->comm, t->pid, timeout);
		pr_err("      %s %s %.*s\n",
			print_tainted(), init_utsname()->release,
			(int)strcspn(init_utsname()->version, " "),
			init_utsname()->version);
		pr_err("\"echo 0 > /proc/sys/kernel/hung_task_timeout_secs\""
			" disables this message.\n");
		sched_show_task(t);
		hung_task_show_lock = true;
	}

	touch_nmi_watchdog();

	if (sysctl_hung_task_panic) { // 使能panic
		hung_task_show_lock = true;
		hung_task_call_panic = true;
	}
}

check_hung_task，获取进程调度计数，判断本地调度计数是否和上一次相同，结合超时时长，以及上一次调度时间戳，确保进程是在一次调度中处于D状态超过120s，最终打印警告以及调用栈信息，或者根据是否配置sysctl_hung_task_panic 让系统死机。

三、实例分析

1、mutex构造hungtask实例

到此，我们理解了hungtask的原理，在此我们需要手动构造一个hungtask的例子，因为hungtask主要检测进程D状态的时长，因此只需要让进程D状态持续时长超过hungtask的超时时长就可以实现，这里采用：

两个线程竞争一个临界资源的方式，用mutex(mutex_lock阻塞时，进程就处于D状态)锁的方式构造例子，以下是源码：

/*======================================================================== 
        > File Name: hungtest.c
        > Author: zhangle
        > discription : for hung_task test
        > Mail: zhangle0320@163.com
        > Created Time: Sun 11 Aug 2024 06:05:14 AM UTC
 ========================================================================*/

#include <linux/interrupt.h>
#include <linux/module.h>
#include <linux/kthread.h>
#include <linux/sched.h>
#include <linux/delay.h>

#define DEBUG
struct mutex zl_mutex;
struct task_struct *task_thread1;
struct task_struct *task_thread2;

int global_resource = -1;

int thread_of_hungtask_test1(void *data)
{

    pr_err("thread1  start \n");       //进程开始运行时间点
    mutex_lock(&zl_mutex);
    pr_err("thread1  got  lock \n");   //获取到锁的时间点
    mdelay(250000);    // 必须设置240+   
    global_resource += 1;              
    mutex_unlock(&zl_mutex);           // 释放锁临界资源

    return 0;
}
int thread_of_hungtask_test2(void *data)
{
    pr_err("thread2  start \n");       
    mutex_lock(&zl_mutex);
    pr_err("thread2  got  lock \n");
    mdelay(250000);
    global_resource += 1;
    mutex_unlock(&zl_mutex);

    return 0;
}
void hungtask_test_start(void)
{
    pr_debug("test start!\n");

    mutex_init(&zl_mutex);
    task_thread1 = kthread_run(thread_of_hungtask_test1, NULL, "hungtest_thread1");
    task_thread2 = kthread_run(thread_of_hungtask_test2, NULL, "hungtest_thread2");

    return;
}

void hungtask_test_stop(void)
{
    pr_debug("tasklet exit\n");
}
module_init(hungtask_test_start);
module_exit(hungtask_test_stop);
MODULE_DESCRIPTION("Tasklet driver test");
MODULE_LICENSE("GPL");

这是个非常简单例子，创建了两个线程 hungtest_thread1，hungtest_thread2并运行，初始化一个mutex锁，在进程开始时，分别获取mutex锁，去访问临界资源，使用mdelay增加线程访问临界资源的时长，从而另一个线程无法获取锁，一直处于D状态，这个时间超过120s之后，就会造成hung_task警告信息的打印。。

这个例子中也打开了内核LOCKdebug，这样hungtask触发之后也打印进程lock依赖的信息：

2、实例验证

我们将上述的例子编译进内核，并在qemu上运行内核,以下是运行输出的log

[ 1.205746] thread1 start
[ 1.205878] thread1 got lock
[ 1.216906] thread2 start
[ 1.329502] Freeing unused kernel memory: 1920K
[ 1.330991] Run /linuxrc as init process
[ 1.531400] 9pnet_virtio: no channels available for device kmod_mount
[ 121.848699] zhangle : task check time = 120000
[ 242.680213] INFO: task hungtest_thread:1365 blocked for more than 120 seconds.
[ 242.680529] Not tainted 4.19.0-gba108edc-dirty #34
[ 242.680682] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 242.680872] hungtest_thread D 0 1365 2 0x00000028
[ 242.681059] Call trace:
[ 242.681099] __switch_to+0x94/0xf0
[ 242.681143] __schedule+0x2f8/0xb58
[ 242.681158] schedule+0x3c/0xa0
[ 242.681171] schedule_preempt_disabled+0x1c/0x30
[ 242.681184] __mutex_lock+0x210/0x828
[ 242.681196] mutex_lock_nested+0x3c/0x50
[ 242.681210] thread_of_hungtask_test2+0x44/0x90
[ 242.681223] kthread+0x100/0x130
[ 242.681236] ret_from_fork+0x10/0x1c
[ 242.681307]
[ 242.681307] Showing all locks held in the system:
[ 242.681396] 1 lock held by khungtaskd/470:
[ 242.681425] #0: (____ptrval____) (rcu_read_lock){....}, at: debug_show_all_locks+0x20/0x1b0
[ 242.681611] 1 lock held by hungtest_thread/1364:
[ 242.681624] 1 lock held by hungtest_thread/1365:
[ 242.681634] #0: (____ptrval____) (&zl_mutex){+.+.}, at: thread_of_hungtask_test2+0x44/0x90
[ 242.681722]
[ 242.681737] =============================================
[ 242.681737]
[ 242.681759] zhangle : task check time = 120000
[root@user ]# [ 251.748114] thread2 got lock

依据以上的log可以看到[ 1.216906] thread2 start 进程2在这个时间点获取不到mutex进入D状态，watchdog线程在 121.848699] zhangle : task check time 时候第一次check进程状态，此时由于是第一次调度会直接返回，又过了120s，在242.680213时，满足hangtask的条件，此时输出报错信息；

以下进程信息取自 121 -242s 之间：

# ps -ef | grep hungtest
1364 0 1:35 [hungtest_thread]
1365 0 0:00 [hungtest_thread]

Name: hungtest_thread
Umask: 0022
State: R (running)
Tgid: 1364
Ngid: 0
Pid: 1364
PPid: 2
TracerPid: 0
Uid: 0 0 0 0
Gid: 0 0 0 0
FDSize: 64
Groups:
NStgid: 1364
NSpid: 1364
NSpgid: 0
NSsid: 0
Threads: 1
SigQ: 0/7784
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: ffffffffffffffff
SigCgt: 0000000000000000
CapInh: 0000000000000000
CapPrm: 0000003fffffffff
CapEff: 0000003fffffffff
CapBnd: 0000003fffffffff
CapAmb: 0000000000000000
NoNewPrivs: 0
Seccomp: 0
Speculation_Store_Bypass: unknown
Cpus_allowed: 3
Cpus_allowed_list: 0-1
Mems_allowed: 1
Mems_allowed_list: 0
voluntary_ctxt_switches: 0
nonvoluntary_ctxt_switches: 262

[root@user ]# cat /proc/1364/stack
[<0>] __switch_to+0x94/0xf0
[<0>] 0x28
[<0>] 0xffffffffffffffff

thread1 pid 1364 由于拿到了锁，处于运行状态，而下面的thread 2 pid 1365 由于拿不到锁，一直处于D状态，这个也可以从LOCKdep的信息里面确认到。

Name: hungtest_thread
Umask: 0022
State: D (disk sleep)
Tgid: 1365
Ngid: 0
Pid: 1365
PPid: 2
TracerPid: 0
Uid: 0 0 0 0
Gid: 0 0 0 0
FDSize: 64
Groups:
NStgid: 1365
NSpid: 1365
NSpgid: 0
NSsid: 0
Threads: 1
SigQ: 0/7784
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: ffffffffffffffff
SigCgt: 0000000000000000
CapInh: 0000000000000000
CapPrm: 0000003fffffffff
CapEff: 0000003fffffffff
CapBnd: 0000003fffffffff
CapAmb: 0000000000000000
NoNewPrivs: 0
Seccomp: 0
Speculation_Store_Bypass: unknown
Cpus_allowed: 3
Cpus_allowed_list: 0-1
Mems_allowed: 1
Mems_allowed_list: 0
voluntary_ctxt_switches: 1
nonvoluntary_ctxt_switches: 1

[root@user ]# cat /proc/1365/stack
[<0>] __switch_to+0x94/0xf0
[<0>] thread_of_hungtask_test2+0x44/0x90
[<0>] kthread+0x100/0x130
[<0>] ret_from_fork+0x10/0x1c
[<0>] 0xffffffffffffffff