Linux内核学习笔记——内核页表隔离KPTI机制（源码分析）

本文链接：https://blog.csdn.net/xy010902100449/article/details/128605284

KPTI（KernelPageTableIsolation）是一种安全机制，通过分离用户空间和内核空间的页表防止页表泄露。每个进程拥有独立的内核态和用户态页表，内核态页表在内核态下访问，用户态页表仅包含用户空间内容。在中断发生时，通过切换CR3寄存器来快速在两种页表间切换。KPTI通过CR3的第13位切换实现快速用户态到内核态的转换，提升了安全性并降低了提权攻击的可能性。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

KPTI(Kernel PageTable Isolation)全称内核页表隔离，它通过完全分离用户空间与内核空间页表来解决页表泄露。

KPTI中每个进程有两套页表——内核态页表与用户态页表(两个地址空间)。

内核态页表只能在内核态下访问，可以创建到内核和用户的映射（不过用户空间受SMAP和SMEP保护）。

用户态页表只包含用户空间。不过由于涉及到上下文切换，所以在用户态页表中必须包含部分内核地址，用来建立到中断入口和出口的映射。

当中断在用户态发生时，就涉及到切换CR3寄存器，从用户态地址空间切换到内核态的地址空间。

中断上半部的要求是尽可能的快，从而切换CR3这个操作也要求尽可能的快。

为了达到这个目的，KPTI中将内核空间的PGD和用户空间的PGD连续的放置在一个8KB的内存空间中（内核态在低位，用户态在高位）。

这段空间必须是8K对齐的，这样将CR3的切换操作转换为将CR3值的第13位(由低到高)的置位或清零操作，提高了CR3切换的速度。

在这里插入图片描述
开启KPTI后，再想提权就比较有局限性，比如我们常用的ret2usr方式在KPTI下将成为过去时。

开启KPTI内核的entry_SYSCALL_64函数

ENTRY(entry_SYSCALL_64)
    /*
     * Interrupts are off on entry.
     * We do not frame this tiny irq-off block with TRACE_IRQS_OFF/ON,
     * it is too small to ever cause noticeable irq latency.
     */
    SWAPGS_UNSAFE_STACK
    // KPTI 进内核态需要切到内核页表
    SWITCH_KERNEL_CR3_NO_STACK
    /*
     * A hypervisor implementation might want to use a label
     * after the swapgs, so that it can do the swapgs
     * for the guest and jump here on syscall.
     */
GLOBAL(entry_SYSCALL_64_after_swapgs)
    // 将用户栈偏移保存到 per-cpu 变量 rsp_scratch 中
    movq    %rsp, PER_CPU_VAR(rsp_scratch)
    // 加载内核栈偏移
    movq    PER_CPU_VAR(cpu_current_top_of_stack), %rsp

    TRACE_IRQS_OFF

    /* Construct struct pt_regs on stack */
    pushq   $__USER_DS          /* pt_regs->ss */
    pushq   PER_CPU_VAR(rsp_scratch)    /* pt_regs->sp */
    pushq   %r11                /* pt_regs->flags */
    pushq   $__USER_CS          /* pt_regs->cs */
    pushq   %rcx                /* pt_regs->ip */
    pushq   %rax                /* pt_regs->orig_ax */
    pushq   %rdi                /* pt_regs->di */
    pushq   %rsi                /* pt_regs->si */
    pushq   %rdx                /* pt_regs->dx */
    pushq   %rcx                /* pt_regs->cx */
    pushq   $-ENOSYS            /* pt_regs->ax */
    pushq   %r8             /* pt_regs->r8 */
    pushq   %r9             /* pt_regs->r9 */
    pushq   %r10                /* pt_regs->r10 */
    pushq   %r11                /* pt_regs->r11 */
    // 为r12-r15, rbp, rbx保留位置
    sub $(6*8), %rsp            /* pt_regs->bp, bx, r12-15 not saved */

    /*
     * If we need to do entry work or if we guess we'll need to do
     * exit work, go straight to the slow path.
     */
    movq    PER_CPU_VAR(current_task), %r11
    testl   $_TIF_WORK_SYSCALL_ENTRY|_TIF_ALLWORK_MASK, TASK_TI_flags(%r11)
    jnz entry_SYSCALL64_slow_path

entry_SYSCALL_64_fastpath:
    /*
     * Easy case: enable interrupts and issue the syscall.  If the syscall
     * needs pt_regs, we'll call a stub that disables interrupts again
     * and jumps to the slow path.
     */
    TRACE_IRQS_ON
    ENABLE_INTERRUPTS(CLBR_NONE)
#if __SYSCALL_MASK == ~0
    // 确保系统调用号没超过最大值，超过了则跳转到后面的符号 1 处进行返回
    cmpq    $__NR_syscall_max, %rax
#else
    andl    $__SYSCALL_MASK, %eax
    cmpl    $__NR_syscall_max, %eax
#endif
    ja  1f              /* return -ENOSYS (already in pt_regs->ax) */
    // 除系统调用外的其他调用都通过 rcx 来传第四个参数，因此将 r10 的内容设置到 rcx
    movq    %r10, %rcx

    /*
     * This call instruction is handled specially in stub_ptregs_64.
     * It might end up jumping to the slow path.  If it jumps, RAX
     * and all argument registers are clobbered.
     */
    // 调用系统调用表中对应的函数
    call    *sys_call_table(, %rax, 8)
.Lentry_SYSCALL_64_after_fastpath_call:
    // 将函数返回值压到栈中，返回时弹出
    movq    %rax, RAX(%rsp)
1:

    /*
     * If we get here, then we know that pt_regs is clean for SYSRET64.
     * If we see that no exit work is required (which we are required
     * to check with IRQs off), then we can go straight to SYSRET64.
     */
    DISABLE_INTERRUPTS(CLBR_NONE)
    TRACE_IRQS_OFF
    movq    PER_CPU_VAR(current_task), %r11
    testl   $_TIF_ALLWORK_MASK, TASK_TI_flags(%r11)
    jnz 1f

    LOCKDEP_SYS_EXIT   // 宏的实现与 CONFIG_DEBUG_LOCK_ALLOC 内核配置选项相关，该配置允许在退出系统调用时调试锁。
    TRACE_IRQS_ON       /* user mode is traced as IRQs on */
    movq    RIP(%rsp), %rcx
    movq    EFLAGS(%rsp), %r11
    RESTORE_C_REGS_EXCEPT_RCX_R11
    // 恢复除 rxc 和 r11 外所有通用寄存器, 因为 rcx 寄存器为调用系统调用的应用程序的返回地址， r11 寄存器为老的 flags register
    /*
     * This opens a window where we have a user CR3, but are
     * running in the kernel.  This makes using the CS
     * register useless for telling whether or not we need to
     * switch CR3 in NMIs.  Normal interrupts are OK because
     * they are off here.
     */
    SWITCH_USER_CR3    // KPTI 返回用户态需要切回用户页表
    /* 根据压栈的内容，恢复 rsp 为用户态的栈顶 */
    movq    RSP(%rsp), %rsp
    USERGS_SYSRET64
   /* 调用宏 USERGS_SYSRET64 ，其扩展调用 swapgs 指令交换用户 GS 和内核GS， sysret 指令执行从系统调用处理退出 */

........
........

可以看出，在入口和结束的地方都加了SWITCH_CR3相关的宏定义，尝试着分析SWITCH_KERNEL_CR3_NO_STACK，里面汇编实现如下：

mov     rdi, cr3
nop
nop
nop
nop
nop
and     rdi, 0xFFFFFFFFFFFFE7FF
mov     cr3, rdi

拆分FFFFFFFFFFFFE7FF，它的第12和13位是零，这段代码目的就是将CR3的第12位与第13位置零（页表的第12位在CR4寄存器的PCIDE位未开启的情况下，都是保留给OS留做他用），我们只关心13位置零，就相当于CR3-0x1000，从用户态PGD转换成内核态PGD。

再看SWITCH_USER_CR3宏定义的汇编：

mov     rdi, cr3
or      rdi, 1000h
mov     cr3, rdi

同理，将CR3第13位置1，相当于CR3+0x1000，从内核态PGD切换成用户态PGD。