kernel/user mode 的切换细节和设计考量

本文深入探讨了操作系统中用户态与内核态的概念及其切换机制。通过解析指令执行权限控制的本质,揭示了利用异常处理实现权限切换的技术细节。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

我们知道,在 OS(operating system)中有一个 kernel mode, user mode 的概念,其用处在于限定 instruction 的执行权限。处于 user mode 的 instruction,不可以直接执行访问 hardware 等敏感操作;而处于 kernel mode 的 instruction 则可以。

如果不深究细节,似乎 user/kernel mode 是非常显然的模式,不就类似于调用某个 HTTP API 嘛,没啥了不起。但如果深究细节,问一下 user/kernel mode 的切换到底发生了什么事情,为什么要如此设计这样的切换流程,那么,很多东西就变得不再平凡。

看到 kernel/user mode,可能最直观的想法就是:OS 提供了一堆可供 user 使用的 kernel 函数,如 funcKernel() ,这些函数可以被 user mode 的任何方法 funcUser() 调用。而这些 funcKernel() 函数的实现,是用处于 kernel mode 的方法来实现的。

但,这样的叙述是有问题的,因为没有精准推敲所使用的术语。当我们讨论 user/kernel mode 的执行时,我们讨论的是 instruction,而不是方法。方法是可以被拆分为多条 instruction 的,而 instruction 可以被拆分吗?

也即是,按照上述符号, funcKernelfuncUser 不再是一个个的方法了,而是一条条 atomic 的 instruction(让我们将其对应的符号切换为 instrKernelinstrUser ),请问,可以让一条 instruction 的执行被拆分为多条其它 instruction 的执行吗?

显然,从 CPU 的角度讲,instruction 已经是最小执行单位了,是不可以被拆分的。

此时,我们终于看到了 devil in the detail:如何让不可拆分的 instruction instrKernelinstrUser 来实现类似于 funcKernelfuncUser 之间的嵌套调用?

一种最直观而粗糙的想法,当然就是强行模仿 method 之间的调用方式,即:再引入一层虚拟的 instruction set,类似于 virtual machine。任何 instrUser 都是 virtual instruction,而 instrKernel 才是真正的 hardware instruction。这可以算是 virtual machine 的解决思路,我们暂时不予讨论。

那如果不使用这样的 virtual instruction set 层,又该如何解决呢?此时,无论是 OS kernel 的 instruction,还是 user application 所使用的 instruction,对于 CPU 来讲都是无差别对待的 hardware instruction。例如,你不能说 user application 的 mov/add 操作,就不同于 OS kernel 的 mov/add 操作。它们都是 CPU 的真实 instruction,当然是无差别同等对待。

那 CPU 应该如何区分 instruction 的「出处」呢?如何知道在执行一条 instruction 的时候,是来自于 instrKernel 还是来自于 instrUser 呢?

一个直观的解决方案,当然就是引入 state 来区分 instruction。CPU ring 即是这样的 state variable,只不过它不是被放于 instruction 中的,而是放于 CPU 中。即:不是在每一条 instruction 中放入一个 state 来说明自己的出处,而是直接将权限 state 放于 CPU 中,让 CPU 根据自己的这个 ring variable 来判断执行权限。

紧随而来的问题是如何改变这个 state variable 呢?即:如何切换 kernel/user mode?

如果直接将这个「改变操作」放于一条常规的 instruction 中,那么执行这条 instruction 时,CPU ring 应该是什么状态呢?

  • 如果是 user mode,那岂不是 user application 可以随时改变 CPU 的权限,然后让 user application 继续自己指定的、后续任意 instruction 的执行?此时,权限控制不就没有意义了吗?
  • 如果是 kernel mode,那么,user mode 调用它本来就是为了改变 CPU ring 的,现在连改变的 instruction 也是没有权限执行的了,那它还怎么改变自己的权限呢?

如此,将这个「改变操作」直接放于一条常规的 instruction 中肯定是不合适的。那么,对于 CPU 来讲还有什么异常的 instruction 吗?那就只剩 exception 了。

当 user mode 调用 system call 时,其对应的操作是抛出一个 exception(称为 trap)。抛出的 exception,会有事先注册到 trap table 的 exception handler 来捕获、处理这个 exception。

此时,这个 trap handler 就可以改变 CPU ring,并执行这个 trap handler 所指定的后续 instructions。等到 trap handler 所指定的 instructions 全部被执行完后,它再将 CPU ring 改回 user mode 状态。

乍一看,整个过程似乎有些神奇和晦涩?但仔细来看的话,其实这样的通过「exception + trap handler」的方式所实现的,正是最开始我们提到的类似于 funcKernelfuncUser 之间的嵌套调用!

如上所述,虽然 instrUser 不能被拆分,但通过抛出 exception 的方式,这个 instruction 的执行流程被强制性转换到了 trap handler,这不就是函数中调用子函数的执行流程跳转吗?!也即是:通过抛出 exception 来实现 instruction 级别的流程控制。而所有的这些 system call 的 trap handler,不就等价于一开始的 instrKernel 吗?

此时,我们终于可以松一口气,引入 CPU ring、system call 对应 exception handler、trap handler 改变 CPU ring 这一些列晦涩的骚操作,不过是为了实现 instruction 级别的「子函数调用」罢了,通过抛出 exception 的方式来实现 instruction 级别的流程跳转,并没有什么玄幻的黑魔法。接下来,在子函数中来控制权限、改变 CPU ring,不过是自然而然的操作。

讨论至此,文章似乎应该直接结束。但真正的思考者,总是不会满足于对问题的解决,还会去重新回顾整个问题的来龙去脉,梳理其中的元问题和元认知。从我们得到的答案来看,似乎这些想法和操作都相当自然、直观啊,那为什么一开始的时候会困惑呢?一开始的时候,为什么不会按照如此“自然”的思路来理解 kernel/user mode 的细节呢?

我想,这是因为我们太习惯于待在 high level 了,以至于不会很自然地切换到 low level 的视角来思考问题。通过「暴露有限的 API 来控制权限」是一种常用的手段,但习惯于 high level 视角的我们会习惯于「调用 API 就意味着 caller 本身可以被拆分为多步」,而如果 caller function 本身不能被拆分了,那么这个经典的解决方案就不起作用了。而以 low level 的视角来讲,「是否能调用 function」并不是等价于「function 是否能被拆分」,而是等价于「流程控制」,即:「执行流」能否从一个地方被转移到另一个地方。

所以,为什么我们无法一开始就理解这样显然的解决方案?因为解决方案的思路虽然是平凡的,但解决方案的实现却是以 low level 的视角来实现的。CPU 虽然无法做到 high level 视角下的 caller function 拆分,但却可以通过抛出 exception 来做流程控制,从而实现子函数调用。其实本来,所谓函数调用,不就是流程控制的切换吗?

当然,我们还可以再进一步追问,如果是流程控制,为什么不能直接使用 jump 操作,为什么要使用如此曲折的抛出 exception 的方式呢?事实上,exception handler 本身也就是 jump 的一种实现方式,只不过,通过注册到 trap table 来管理各种 handler 的结构要更为清晰一些。并且,对于那些无效的 exception,可直接根据无法查询到相应的 trap handler 来放弃做处理。

绚烂至极,归于平淡,绕了一圈的骚操作,终于可以以最平凡的步骤来理解了。

/* -*- C -*- * main.c -- the bare scullp char module * * Copyright (C) 2001 Alessandro Rubini and Jonathan Corbet * Copyright (C) 2001 O'Reilly & Associates * * The source code in this file can be freely used, adapted, * and redistributed in source or binary form, so long as an * acknowledgment appears in derived source files. The citation * should list that the code comes from the book "Linux Device * Drivers" by Alessandro Rubini and Jonathan Corbet, published * by O'Reilly & Associates. No warranty is attached; * we cannot take responsibility for errors or fitness for use. * * $Id: _main.c.in,v 1.21 2004/10/14 20:11:39 corbet Exp $ */ #include <linux/config.h> #include <linux/module.h> #include <linux/moduleparam.h> #include <linux/init.h> #include <linux/kernel.h> /* printk() */ #include <linux/slab.h> /* kmalloc() */ #include <linux/fs.h> /* everything... */ #include <linux/errno.h> /* error codes */ #include <linux/types.h> /* size_t */ #include <linux/proc_fs.h> #include <linux/fcntl.h> /* O_ACCMODE */ #include <linux/aio.h> #include <asm/uaccess.h> #include "scullp.h" /* local definitions */ int scullp_major = SCULLP_MAJOR; int scullp_devs = SCULLP_DEVS; /* number of bare scullp devices */ int scullp_qset = SCULLP_QSET; int scullp_order = SCULLP_ORDER; module_param(scullp_major, int, 0); module_param(scullp_devs, int, 0); module_param(scullp_qset, int, 0); module_param(scullp_order, int, 0); MODULE_AUTHOR("Alessandro Rubini"); MODULE_LICENSE("Dual BSD/GPL"); struct scullp_dev *scullp_devices; /* allocated in scullp_init */ int scullp_trim(struct scullp_dev *dev); void scullp_cleanup(void); #ifdef SCULLP_USE_PROC /* don't waste space if unused */ /* * The proc filesystem: function to read and entry */ void scullp_proc_offset(char *buf, char **start, off_t *offset, int *len) { if (*offset == 0) return; if (*offset >= *len) { /* Not there yet */ *offset -= *len; *len = 0; } else { /* We're into the interesting stuff now */ *start = buf + *offset; *offset = 0; } } /* FIXME: Do we need this here?? It be ugly */ int scullp_read_procmem(char *buf, char **start, off_t offset, int count, int *eof, void *data) { int i, j, order, qset, len = 0; int limit = count - 80; /* Don't print more than this */ struct scullp_dev *d; *start = buf; for(i = 0; i < scullp_devs; i++) { d = &scullp_devices[i]; if (down_interruptible (&d->sem)) return -ERESTARTSYS; qset = d->qset; /* retrieve the features of each device */ order = d->order; len += sprintf(buf+len,"\nDevice %i: qset %i, order %i, sz %li\n", i, qset, order, (long)(d->size)); for (; d; d = d->next) { /* scan the list */ len += sprintf(buf+len," item at %p, qset at %p\n",d,d->data); scullp_proc_offset (buf, start, &offset, &len); if (len > limit) goto out; if (d->data && !d->next) /* dump only the last item - save space */ for (j = 0; j < qset; j++) { if (d->data[j]) len += sprintf(buf+len," % 4i:%8p\n",j,d->data[j]); scullp_proc_offset (buf, start, &offset, &len); if (len > limit) goto out; } } out: up (&scullp_devices[i].sem); if (len > limit) break; } *eof = 1; return len; } #endif /* SCULLP_USE_PROC */ /* * Open and close */ int scullp_open (struct inode *inode, struct file *filp) { struct scullp_dev *dev; /* device information */ /* Find the device */ dev = container_of(inode->i_cdev, struct scullp_dev, cdev); /* now trim to 0 the length of the device if open was write-only */ if ( (filp->f_flags & O_ACCMODE) == O_WRONLY) { if (down_interruptible (&dev->sem)) return -ERESTARTSYS; scullp_trim(dev); /* ignore errors */ up (&dev->sem); } /* and use filp->private_data to point to the device data */ filp->private_data = dev; return 0; /* success */ } int scullp_release (struct inode *inode, struct file *filp) { return 0; } /* * Follow the list */ struct scullp_dev *scullp_follow(struct scullp_dev *dev, int n) { while (n--) { if (!dev->next) { dev->next = kmalloc(sizeof(struct scullp_dev), GFP_KERNEL); memset(dev->next, 0, sizeof(struct scullp_dev)); } dev = dev->next; continue; } return dev; } /* * Data management: read and write */ ssize_t scullp_read (struct file *filp, char __user *buf, size_t count, loff_t *f_pos) { struct scullp_dev *dev = filp->private_data; /* the first listitem */ struct scullp_dev *dptr; int quantum = PAGE_SIZE << dev->order; int qset = dev->qset; int itemsize = quantum * qset; /* how many bytes in the listitem */ int item, s_pos, q_pos, rest; ssize_t retval = 0; if (down_interruptible (&dev->sem)) return -ERESTARTSYS; if (*f_pos > dev->size) goto nothing; if (*f_pos + count > dev->size) count = dev->size - *f_pos; /* find listitem, qset index, and offset in the quantum */ item = ((long) *f_pos) / itemsize; rest = ((long) *f_pos) % itemsize; s_pos = rest / quantum; q_pos = rest % quantum; /* follow the list up to the right position (defined elsewhere) */ dptr = scullp_follow(dev, item); if (!dptr->data) goto nothing; /* don't fill holes */ if (!dptr->data[s_pos]) goto nothing; if (count > quantum - q_pos) count = quantum - q_pos; /* read only up to the end of this quantum */ if (copy_to_user (buf, dptr->data[s_pos]+q_pos, count)) { retval = -EFAULT; goto nothing; } up (&dev->sem); *f_pos += count; return count; nothing: up (&dev->sem); return retval; } ssize_t scullp_write (struct file *filp, const char __user *buf, size_t count, loff_t *f_pos) { struct scullp_dev *dev = filp->private_data; struct scullp_dev *dptr; int quantum = PAGE_SIZE << dev->order; int qset = dev->qset; int itemsize = quantum * qset; int item, s_pos, q_pos, rest; ssize_t retval = -ENOMEM; /* our most likely error */ if (down_interruptible (&dev->sem)) return -ERESTARTSYS; /* find listitem, qset index and offset in the quantum */ item = ((long) *f_pos) / itemsize; rest = ((long) *f_pos) % itemsize; s_pos = rest / quantum; q_pos = rest % quantum; /* follow the list up to the right position */ dptr = scullp_follow(dev, item); if (!dptr->data) { dptr->data = kmalloc(qset * sizeof(void *), GFP_KERNEL); if (!dptr->data) goto nomem; memset(dptr->data, 0, qset * sizeof(char *)); } /* Here's the allocation of a single quantum */ if (!dptr->data[s_pos]) { dptr->data[s_pos] = (void *)__get_free_pages(GFP_KERNEL, dptr->order); if (!dptr->data[s_pos]) goto nomem; memset(dptr->data[s_pos], 0, PAGE_SIZE << dptr->order); } if (count > quantum - q_pos) count = quantum - q_pos; /* write only up to the end of this quantum */ if (copy_from_user (dptr->data[s_pos]+q_pos, buf, count)) { retval = -EFAULT; goto nomem; } *f_pos += count; /* update the size */ if (dev->size < *f_pos) dev->size = *f_pos; up (&dev->sem); return count; nomem: up (&dev->sem); return retval; } /* * The ioctl() implementation */ int scullp_ioctl (struct inode *inode, struct file *filp, unsigned int cmd, unsigned long arg) { int err = 0, ret = 0, tmp; /* don't even decode wrong cmds: better returning ENOTTY than EFAULT */ if (_IOC_TYPE(cmd) != SCULLP_IOC_MAGIC) return -ENOTTY; if (_IOC_NR(cmd) > SCULLP_IOC_MAXNR) return -ENOTTY; /* * the type is a bitmask, and VERIFY_WRITE catches R/W * transfers. Note that the type is user-oriented, while * verify_area is kernel-oriented, so the concept of "read" and * "write" is reversed */ if (_IOC_DIR(cmd) & _IOC_READ) err = !access_ok(VERIFY_WRITE, (void __user *)arg, _IOC_SIZE(cmd)); else if (_IOC_DIR(cmd) & _IOC_WRITE) err = !access_ok(VERIFY_READ, (void __user *)arg, _IOC_SIZE(cmd)); if (err) return -EFAULT; switch(cmd) { case SCULLP_IOCRESET: scullp_qset = SCULLP_QSET; scullp_order = SCULLP_ORDER; break; case SCULLP_IOCSORDER: /* Set: arg points to the value */ ret = __get_user(scullp_order, (int __user *) arg); break; case SCULLP_IOCTORDER: /* Tell: arg is the value */ scullp_order = arg; break; case SCULLP_IOCGORDER: /* Get: arg is pointer to result */ ret = __put_user (scullp_order, (int __user *) arg); break; case SCULLP_IOCQORDER: /* Query: return it (it's positive) */ return scullp_order; case SCULLP_IOCXORDER: /* eXchange: use arg as pointer */ tmp = scullp_order; ret = __get_user(scullp_order, (int __user *) arg); if (ret == 0) ret = __put_user(tmp, (int __user *) arg); break; case SCULLP_IOCHORDER: /* sHift: like Tell + Query */ tmp = scullp_order; scullp_order = arg; return tmp; case SCULLP_IOCSQSET: ret = __get_user(scullp_qset, (int __user *) arg); break; case SCULLP_IOCTQSET: scullp_qset = arg; break; case SCULLP_IOCGQSET: ret = __put_user(scullp_qset, (int __user *)arg); break; case SCULLP_IOCQQSET: return scullp_qset; case SCULLP_IOCXQSET: tmp = scullp_qset; ret = __get_user(scullp_qset, (int __user *)arg); if (ret == 0) ret = __put_user(tmp, (int __user *)arg); break; case SCULLP_IOCHQSET: tmp = scullp_qset; scullp_qset = arg; return tmp; default: /* redundant, as cmd was checked against MAXNR */ return -ENOTTY; } return ret; } /* * The "extended" operations */ loff_t scullp_llseek (struct file *filp, loff_t off, int whence) { struct scullp_dev *dev = filp->private_data; long newpos; switch(whence) { case 0: /* SEEK_SET */ newpos = off; break; case 1: /* SEEK_CUR */ newpos = filp->f_pos + off; break; case 2: /* SEEK_END */ newpos = dev->size + off; break; default: /* can't happen */ return -EINVAL; } if (newpos<0) return -EINVAL; filp->f_pos = newpos; return newpos; } /* * A simple asynchronous I/O implementation. */ struct async_work { struct kiocb *iocb; int result; struct work_struct work; }; /* * "Complete" an asynchronous operation. */ static void scullp_do_deferred_op(void *p) { struct async_work *stuff = (struct async_work *) p; aio_complete(stuff->iocb, stuff->result, 0); kfree(stuff); } static int scullp_defer_op(int write, struct kiocb *iocb, char __user *buf, size_t count, loff_t pos) { struct async_work *stuff; int result; /* Copy now while we can access the buffer */ if (write) result = scullp_write(iocb->ki_filp, buf, count, &pos); else result = scullp_read(iocb->ki_filp, buf, count, &pos); /* If this is a synchronous IOCB, we return our status now. */ if (is_sync_kiocb(iocb)) return result; /* Otherwise defer the completion for a few milliseconds. */ stuff = kmalloc (sizeof (*stuff), GFP_KERNEL); if (stuff == NULL) return result; /* No memory, just complete now */ stuff->iocb = iocb; stuff->result = result; INIT_WORK(&stuff->work, scullp_do_deferred_op, stuff); schedule_delayed_work(&stuff->work, HZ/100); return -EIOCBQUEUED; } static ssize_t scullp_aio_read(struct kiocb *iocb, char __user *buf, size_t count, loff_t pos) { return scullp_defer_op(0, iocb, buf, count, pos); } static ssize_t scullp_aio_write(struct kiocb *iocb, const char __user *buf, size_t count, loff_t pos) { return scullp_defer_op(1, iocb, (char __user *) buf, count, pos); } /* * Mmap *is* available, but confined in a different file */ extern int scullp_mmap(struct file *filp, struct vm_area_struct *vma); /* * The fops */ struct file_operations scullp_fops = { .owner = THIS_MODULE, .llseek = scullp_llseek, .read = scullp_read, .write = scullp_write, .ioctl = scullp_ioctl, .mmap = scullp_mmap, .open = scullp_open, .release = scullp_release, .aio_read = scullp_aio_read, .aio_write = scullp_aio_write, }; int scullp_trim(struct scullp_dev *dev) { struct scullp_dev *next, *dptr; int qset = dev->qset; /* "dev" is not-null */ int i; if (dev->vmas) /* don't trim: there are active mappings */ return -EBUSY; for (dptr = dev; dptr; dptr = next) { /* all the list items */ if (dptr->data) { /* This code frees a whole quantum-set */ for (i = 0; i < qset; i++) if (dptr->data[i]) free_pages((unsigned long)(dptr->data[i]), dptr->order); kfree(dptr->data); dptr->data=NULL; } next=dptr->next; if (dptr != dev) kfree(dptr); /* all of them but the first */ } dev->size = 0; dev->qset = scullp_qset; dev->order = scullp_order; dev->next = NULL; return 0; } static void scullp_setup_cdev(struct scullp_dev *dev, int index) { int err, devno = MKDEV(scullp_major, index); cdev_init(&dev->cdev, &scullp_fops); dev->cdev.owner = THIS_MODULE; dev->cdev.ops = &scullp_fops; err = cdev_add (&dev->cdev, devno, 1); /* Fail gracefully if need be */ if (err) printk(KERN_NOTICE "Error %d adding scull%d", err, index); } /* * Finally, the module stuff */ int scullp_init(void) { int result, i; dev_t dev = MKDEV(scullp_major, 0); /* * Register your major, and accept a dynamic number. */ if (scullp_major) result = register_chrdev_region(dev, scullp_devs, "scullp"); else { result = alloc_chrdev_region(&dev, 0, scullp_devs, "scullp"); scullp_major = MAJOR(dev); } if (result < 0) return result; /* * allocate the devices -- we can't have them static, as the number * can be specified at load time */ scullp_devices = kmalloc(scullp_devs*sizeof (struct scullp_dev), GFP_KERNEL); if (!scullp_devices) { result = -ENOMEM; goto fail_malloc; } memset(scullp_devices, 0, scullp_devs*sizeof (struct scullp_dev)); for (i = 0; i < scullp_devs; i++) { scullp_devices[i].order = scullp_order; scullp_devices[i].qset = scullp_qset; sema_init (&scullp_devices[i].sem, 1); scullp_setup_cdev(scullp_devices + i, i); } #ifdef SCULLP_USE_PROC /* only when available */ create_proc_read_entry("scullpmem", 0, NULL, scullp_read_procmem, NULL); #endif return 0; /* succeed */ fail_malloc: unregister_chrdev_region(dev, scullp_devs); return result; } void scullp_cleanup(void) { int i; #ifdef SCULLP_USE_PROC remove_proc_entry("scullpmem", NULL); #endif for (i = 0; i < scullp_devs; i++) { cdev_del(&scullp_devices[i].cdev); scullp_trim(scullp_devices + i); } kfree(scullp_devices); unregister_chrdev_region(MKDEV (scullp_major, 0), scullp_devs); } module_init(scullp_init); module_exit(scullp_cleanup);详细说一下这个文件里的异步IO机制
最新发布
06-27
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值