KML: 实现机制研究 - rqdmap

本文将参考作者给出的guide, 深入研究IA-32下KML机制的工作原理以及相关技术.

有关于KML技术的使用可以参考: <启用Kernel Model Linux - rqdmap | blog>

工作原理

To execute user programs in kernel mode, Kernel Mode Linux has a special start_thread (start_kernel_thread) routine, which is called in processing execve(2) and sets registers of a user process to specified initial values. The original start_thread routine sets CS segment register to __USER_CS. The start_kernel_thread routine sets the CS register to __KERNEL_CS. Thus, a user program is started as a user process executed in kernel mode.

文件arch/x86/kernel/process_32.c中涉及两处KML的补丁:

 1#ifndef CONFIG_KERNEL_MODE_LINUX
 2	/*
 3	 * Save away %gs. No need to save %fs, as it was saved on the
 4	 * stack on entry.  No need to save %es and %ds, as those are
 5	 * always kernel segments while inside the kernel.  Doing this
 6	 * before setting the new TLS descriptors avoids the situation
 7	 * where we temporarily have non-reloadable segments in %fs
 8	 * and %gs.  This could be an issue if the NMI handler ever
 9	 * used %fs or %gs (it does not today), or if the kernel is
10	 * running inside of a hypervisor layer.
11	 */
12	lazy_save_gs(prev->gs);
13#endif
14
15
16#ifdef CONFIG_KERNEL_MODE_LINUX
17void
18start_kernel_thread(struct pt_regs *regs, unsigned long new_ip, unsigned long new_sp)
19{
20	set_user_gs(regs, 0);
21	regs->fs		= __KERNEL_PERCPU;
22	set_fs(KERNEL_DS);
23	regs->ds		= __USER_DS;
24	regs->es		= __USER_DS;
25	regs->ss		= __KERNEL_DS;
26	regs->cs		= __KU_CS_EXCEPTION;
27	regs->ip		= new_ip;
28	regs->sp		= new_sp;
29	regs->flags		= X86_EFLAGS_IF;
30
31	if (cpu_has_smap) {
32		regs->flags	|= X86_EFLAGS_AC;
33	}
34
35	/*
36	 * force it to the iret return path by making it look as if there was
37	 * some work pending.
38	 */
39	set_thread_flag(TIF_NOTIFY_RESUME);
40}
41EXPORT_SYMBOL_GPL(start_kernel_thread);
42#endif

据其所说, execve指令会调用到该函数, 因而默认的CS寄存器内容为KERNEL_CS, 做一个gdb调试.

 1(gdb) bt
 2#0  start_kernel_thread (regs=0xffff88000720ff58, new_ip=4199856, new_sp=140731794638672) at arch/x86/kernel/process_64.c:271
 3#1  0xffffffff811b5f30 in load_elf_binary (bprm=0xffff8800065a4500) at fs/binfmt_elf.c:1039
 4#2  0xffffffff8116b76a in search_binary_handler (bprm=0xffff8800065a4500) at fs/exec.c:1425
 5#3  0xffffffff8116d21f in exec_binprm (bprm=<optimized out>) at fs/exec.c:1467
 6#4  do_execve_common (filename=0xffff880000014000, argv=..., envp=...) at fs/exec.c:1564
 7#5  0xffffffff8116d482 in do_execve (filename=0xffff88000720ff58, __argv=0xffff880006af9200, __envp=0x25b4354a) at fs/exec.c:1606
 8#6  0xffffffff8116d4ba in SYSC_execve (envp=<optimized out>, argv=<optimized out>, filename=<optimized out>) at fs/exec.c:1660
 9#7  SyS_execve (filename=<optimized out>, argv=140096697936064, envp=140096697936080) at fs/exec.c:1658
10#8  0xffffffff818311eb in stub_execve () at arch/x86/kernel/entry_64.S:787
11#9  0x00000000004d4048 in ?? ()
12Backtrace stopped: previous frame inner to this frame (corrupt stack?)

好像并不能看出更多了…

KML最显著的特征是可以在内核态运行一个用户程序, 那么自然也要从其对执行可执行文件的机制开始研究.

在其官方介绍中提到了自定义了一个start_kernel_thread内核例程用于创建一个在内核态运行的用户程序, 那么就递归地检查该函数的来源!

fs/binfmt_elf.c:elf_format

1static struct linux_binfmt elf_format = {
2	.module		= THIS_MODULE,
3	.load_binary	= load_elf_binary,
4	.load_shlib	= load_elf_library,
5	.core_dump	= elf_core_dump,
6	.min_coredump	= ELF_EXEC_PAGESIZE,
7};

书到用时方很少, 不过在内核中事实上有关于可执行程序的科技树确实要很靠后, 其涉及的内容不说是平平无奇吧至少也是包罗万象… 不过目前就姑且浅浅看一下相关的内容!

ELF(Executable and Linking Format)在Unix世界十分的流行, 其本质上提供了3种方法: load_binary, load_shlib, core_dump. 我们这里主要关注load_binary方法, 其通过读取存放在可执行文件中的信息为当前进程建立一个新的执行环境.

那么, 在执行可执行文件时, 其在内核中所处的位置/层次在哪里呢?

当执行一个可执行文件时, 除了execve()以外的exec类函数都是C库的封装例程, 其本质都调用了execve()系统调用. 而该系统调用sys_execve会实质调用do_execve函数, 其将在执行文件前做一系列准备(比如把文件路径名、命令行参数、环境串等拷贝到一个或多个新分配的页框中, 填充linux_binprm结构等), 随后检查formats链表(这个链表是一个存放了所有linux_binfmt对象的单向链表), 尽力将linux_binprm传递给链表元素的load_binary方法; 一旦成功应答, 则对formats扫描结束.

下面就进入到load_binary字段内容: load_elf_binary函数, 看看其中到底做了什么:

fs/binfmt_elf.c:load_elf_binary

这个函数有点长, 大概400多行… 所以我们不去全部关注, 只看一下KML修改的部分:

 1static int load_elf_binary(struct linux_binprm *bprm)
 2{
 3#ifdef CONFIG_KERNEL_MODE_LINUX
 4	kernel_mode = is_safe(bprm->file);
 5#endif
 6
 7	...
 8
 9#ifndef CONFIG_KERNEL_MODE_LINUX
10	start_thread(regs, elf_entry, bprm->p);
11#else
12	if (kernel_mode) {
13		start_kernel_thread(regs, elf_entry, bprm->p);
14	} else {
15		start_thread(regs, elf_entry, bprm->p);
16	}
17#endif
18	...

其主要修改了两个地方:

调用is_safe判断当前文件是否符合规定KML机制运行的条件, 即在/trusted文件夹下
重写了一份KML的start_kernel_thread函数; 值得注意, start_kernel_thread以及start_thread函数其实位于load_elf_binary末尾位置, 这意味着其之前所做的所有其余工作与一个一般的可执行文件完全相同. 而start_thread做的是修改目前保存在内核栈的用户态的寄存器eip和esp的内容, 使得他们指向动态链接程序的入口点和新的用户态堆栈顶.

下面看一下这两个地方具体的代码:

fs/binfmt_elf.c:is_safe

 1#include <linux/fs_struct.h>
 2/*
 3 * XXX : we haven't implemented safety check of user programs.
 4 */
 5#define TRUSTED_DIR_STR		"/trusted/"
 6#define TRUSTED_DIR_STR_LEN	9
 7
 8static inline int is_safe(struct file* file)
 9{
10	int ret;
11	char* path;
12	char* tmp;
13
14#ifdef CONFIG_KML_CHECK_CHROOT
15	if (current_chrooted()) {
16		return 0;
17	}
18#endif
19
20	tmp = (char*)__get_free_page(GFP_KERNEL);
21
22	if (!tmp) {
23		return 0;
24	}
25
26	path = d_path(&file->f_path, tmp, PAGE_SIZE);
27	ret = (0 == strncmp(TRUSTED_DIR_STR, path, TRUSTED_DIR_STR_LEN));
28
29        free_page((unsigned long)tmp);
30        return ret;
31}
32#endif

如果开启了CHROOT检查, 则调用原生函数判断是否在CHROOT环境下; 随后分配一片页框, 比较一下文件的目录是不是和/trusted一致, 最后释放页框, 返回结果即可.

d_path的函数签名如下:

 1/**
* d_path - return the path of a dentry
* @path: path to report
* @buf: buffer to return value in
* @buflen: buffer length
*
* Convert a dentry into an ASCII path name. If the entry has been deleted
* the string " (deleted)" is appended. Note that this is ambiguous.
*
* Returns a pointer into the buffer or an error code if the path was
* too long. Note: Callers should use the returned pointer, not the passed
* in buffer, to use the name! The implementation often starts at an offset
* into the buffer, and may leave 0 bytes at the start.
*
* "buflen" should be positive.
*/
17char *d_path(const struct path *path, char *buf, int buflen)

而start_kernel_thread在这里:

arch/x86/kernel/process_32.c:process_32.c

 1#ifdef CONFIG_KERNEL_MODE_LINUX
 2void
 3start_kernel_thread(struct pt_regs *regs, unsigned long new_ip, unsigned long new_sp)
 4{
 5	set_user_gs(regs, 0);
 6	regs->fs		= __KERNEL_PERCPU;
 7	set_fs(KERNEL_DS);
 8	regs->ds		= __USER_DS;
 9	regs->es		= __USER_DS;
10	regs->ss		= __KERNEL_DS;
11	regs->cs		= __KU_CS_EXCEPTION;
12	regs->ip		= new_ip;
13	regs->sp		= new_sp;
14	regs->flags		= X86_EFLAGS_IF;
15
16	if (cpu_has_smap) {
17		regs->flags	|= X86_EFLAGS_AC;
18	}
19
20	/*
21	 * force it to the iret return path by making it look as if there was
22	 * some work pending.
23	 */
24	set_thread_flag(TIF_NOTIFY_RESUME);
25}
26EXPORT_SYMBOL_GPL(start_kernel_thread);
27#endif

可以与同目录下的内生的start_thread进行对比一下:

 1void
 2start_thread(struct pt_regs *regs, unsigned long new_ip, unsigned long new_sp)
 3{
set_user_gs(regs, 0);
regs->fs		= 0;
regs->ds		= __USER_DS;
regs->es		= __USER_DS;
regs->ss		= __USER_DS;
regs->cs		= __USER_CS;
regs->ip		= new_ip;
regs->sp		= new_sp;
regs->flags		= X86_EFLAGS_IF;
/*
* force it to the iret return path by making it look as if there was
* some work pending.
*/
set_thread_flag(TIF_NOTIFY_RESUME);
18}
19EXPORT_SYMBOL_GPL(start_thread);

可以看到, 确实该类函数的主要功能在于修改用户寄存器的esi和eip内容, 不过其余的一些寄存器字段也有部分修改.

首先做的是set_user_gs(regs, 0), 其实质做的是:

1#define set_user_gs(regs, v)	loadsegment(gs, (unsigned long)(v))

询问过chatgpt:

In general, set_user_gs could potentially be a function that sets the value of a user-level GS register to 0. GS is a segment register in x86 architecture that is used for thread-local storage. Setting it to 0 would effectively clear any thread-local storage associated with the current thread.

姑且可以认为是为了清理当前线程可能的TLS(Thread-Local Storage)..

而fs寄存器做的是:

在 Linux 中，fs 寄存器是一个段寄存器，它存储了当前进程的用户态数据段选择子（User Data Segment Selector）。用户态数据段是进程用来存储数据的内存区域，包括代码、堆、栈等。fs 寄存器的值用于计算用户态数据段的线性地址，从而让进程能够访问它自己的用户态数据段。

在 64 位 Linux 中，fs 寄存器通常用于存储线程本地存储（Thread Local Storage，TLS）的指针。TLS 是一种机制，它允许程序员在多线程环境下安全地访问全局变量，而不必担心竞态条件的问题。使用 TLS，每个线程都有自己的一份全局变量的副本，线程之间互不干扰。

需要注意的是，fs 寄存器在 Linux 中并不是用于保存函数的栈帧指针（Frame Pointer，FP）的。在 Linux 中，FP 通常存储在 rbp 寄存器中。

这里应该认为其指向的是额外的数据段, 但关于fs寄存器的为什么有两个操作呢?

第一个操作使得DS指向PERCPU内部的地址空间, 而第二个操作其实非常的迷, 追踪到具体的定义看一下:

 1/*
 2 * The fs value determines whether argument validity checking should be
 3 * performed or not.  If get_fs() == USER_DS, checking is performed, with
 4 * get_fs() == KERNEL_DS, checking is bypassed.
 5 *
 6 * For historical reasons, these macros are grossly misnamed.
 7 */
 8
 9#define MAKE_MM_SEG(s)	((mm_segment_t) { (s) })
10
11#define KERNEL_DS	MAKE_MM_SEG(-1UL)
12#define USER_DS 	MAKE_MM_SEG(TASK_SIZE_MAX)
13
14#define get_ds()	(KERNEL_DS)
15#define get_fs()	(current_thread_info()->addr_limit)
16#define set_fs(x)	(current_thread_info()->addr_limit = (x))

这段注释必须要加进来… 不然都不知道在干嘛! 据其所说, 这一对函数其实只是用于判断argument validity checking的, 那这个check是什么呢?

In the Linux kernel, argument validity checking is typically performed by the kernel’s system call interface. The system call interface is responsible for providing a secure and reliable way for user-space applications to interact with the kernel.

When a user-space application makes a system call, the kernel first checks the validity of the arguments provided by the application. This is done to ensure that the system call will not cause any unintended behavior or security vulnerabilities.

chatgpt也太懂内核了吧..! 其主要就是检查系统调用的参数是否合法, 而通过set_fs设置了KERNEL_DS则不进行检查.

随后继续设置和ds和es两个寄存器, 以及ss和cs段寄存器, ss指向了内核的数据段, 而cs指向了 __KU_CS_EXCEPTION这样一个奇怪的地方, 去看看:

arch/x86/include/asm/segment.h

 1#define __USER_DS	(GDT_ENTRY_DEFAULT_USER_DS*8+3)
 2#define __USER_CS	(GDT_ENTRY_DEFAULT_USER_CS*8+3)
 3
 4#ifdef CONFIG_KERNEL_MODE_LINUX
 5
 6#define __KU_CS_INTERRUPT	((1 << 16) | __USER_CS)
 7#define __KU_CS_EXCEPTION	((1 << 17) | __USER_CS)
 8
 9#define kernel_mode_user_process(cs) ((cs) & 0xffff0000)
10#define need_error_code_fix_on_page_fault(cs) ((cs) == __KU_CS_EXCEPTION)
11
12#endif

由于USER_CS的低2位是二进制的11(权限), 因而其的宏定义中有个+3; 而KML做的其实利用了其高16位, 塞了一些自己的标志进去.

这个标志在kernel_mode_user_process和need_error_code_fix_on_page_fault这两个相关的宏中有用, 第一个仅在arch/x86/kernel/signal.c中被使用, 而第二个在arch/x86/mm/fault.c中被使用用于处理页表. 这里不过多去看了, 下面看一下其在系统调用方面的应用:

~~这个标志其实十分的重要, 好像关乎到各个内核功能的实现..~~

系统调用

为什么x86的代码中要把__KU_CS_EXCEPTION塞到regs->cs中呢?

在x64的架构下, 其直接把高16位或上了特殊标志的__KERNEL_CS给压进去了! 这就导致了其实在我编译出来的内核跑的c代码读取cs寄存器确实CPL是0.. 但是在x86下可这不一定了啊?

 1#define __KU_CS			(0x7fff0003 | __KERNEL_CS)
 2
 3void
 4start_kernel_thread(struct pt_regs *regs, unsigned long new_ip, unsigned long new_sp)
 5{
 6	int cpu = smp_processor_id();
 7        loadsegment(fs, 0);
 8        loadsegment(es, 0);
 9        loadsegment(ds, 0);
10	load_gs_index(0);
11	current->thread.usersp	= new_sp;
12	regs->ip		= new_ip;
13	regs->sp		= new_sp;
14	this_cpu_write(old_rsp, new_sp);
15	regs->cs		= __KU_CS;
16	regs->ss		= __KERNEL_DS;
17	regs->flags		= X86_EFLAGS_IF;
18
19        if (cpu_has_smap) {
20                regs->flags     |= X86_EFLAGS_AC;
21        }
22
23	set_fs(KERNEL_DS);
24	set_thread_flag(TIF_KU);
25	wrmsrl(MSR_KERNEL_GS_BASE, (unsigned long)per_cpu(irq_stack_union.gs_base, cpu));
26}
27EXPORT_SYMBOL_GPL(start_kernel_thread);

那么在x86下如何体现目前确实是内核态呢? 我选择了研究其使用系统调用的方式来检查(理论上), 也就是说跟踪__KU_CS_EXCEPTION标志到底在什么地方起到了作用.

在arch/x86/kernel/direct_call_32.h下其定义了一个汇编宏, 目的是生成系统调用表, 由于其递归调用的层数过多, 因而不去研究了, 只看一下最直接的一层:

arch/x86/kernel/direct_call_32.h:MAKE_DIRECTCALL_SPECIAL

 1#define MAKE_DIRECTCALL_SPECIAL(entry, argnum, syscall_num) \
text; \
 3ENTRY(direct_ ## entry); \
pushl %ebx; \
pushl %edi; \
pushl %esi; \
pushl %ebp; \
add $-4, %esp; \
 9\
movl $(syscall_num), %eax; \
11\
call direct_special_work_ ## argnum; \
13\
pushfl; \
pushl %cs; \
pushl $direct_wrapper_int_post; \
jmp system_call;

可以看到其做了各种准备工作, 最后将%cs压入了栈中, 随后真正调用system_call:

arch/x86/kernel/entry_32.S:system_call

asm

 1	# system call handler stub
 2ENTRY(system_call)
 3	RING0_INT_FRAME			# can't unwind into user space anyway
 4	SWITCH_STACK_TO_KK_EXCEPTION
 5	...
 6#ifdef CONFIG_KERNEL_MODE_LINUX
 7restore_all_return:
 8/* Switch stack KK -> KU. */
 9	/* check whether if stack switch occured or not */
10	cmpw $0x0, 6(%esp)
11	jne ret_to_ku
12#endif
13	...
14
15
16ENTRY(ret_to_ku)
17	ASM_STAC	
18	cmpl $__KU_CS_EXCEPTION, 4(%esp)
19	je ret_to_ku_from_exception
20	jmp ret_to_ku_from_interrupt
21
22
23/*
24 * The stack layout for ret_to_ku_from_exception:
25 *
26 * %esp --> EIP
27 *          __KU_CS_EXCEPTION
28 *          EFLAGS
29 *          ESP
30 *          XXX
31 *          ...
32 */
33ENTRY(ret_to_ku_from_exception)
34	movl $__KERNEL_CS, 4(%esp)	/* XCS = __KERNEL_CS */
35	pushl %ebp
36
37	/* check whether if we can skip iret or not */
38	movl 12(%esp), %ebp		/* load EFLAGS to %ebp */
39	testl $~(0x240fd7), %ebp
40	movl 16(%esp), %ebp		/* load ESP to %ebp */
41	jz skip_iret
42
43	addl $-16, %ebp
44ret_to_ku_mov_ebp:	popl (%ebp)		/* old EBP */
45ret_to_ku_mov_eip:	popl 4(%ebp)		/* EIP */
46ret_to_ku_mov_cs:	popl 8(%ebp)		/* XCS */
47ret_to_ku_mov_eflags:	popl 12(%ebp)		/* EFLAGS */
48			movl %ebp, %esp		/* switch the stack! */
49ret_to_ku_pop_ebp:	popl %ebp		/* %ebp = old EBP */
50ret_to_ku_iret:		INTERRUPT_RETURN

可以看到, 在system_call中, 其比较栈中第6个位置, 这正是cs(5-8)的高16位, 只要KML对其打上了特殊的标记后才会被置位, 从而认为目前是KML机制下, 随后这个特殊的字段在ret_to_ku中起到了作用, 判断相等其会跳转到ret_to_ku_from_exception进一步处理, 并且将__KERNEL_CS真正替换掉原来这个特殊内容的字段.

具体的细节不予深究了.. 但是此时至少可以认为, 当发生系统调用的时候, 确实真正的KERNEL_CS被加载进入了cs寄存器, 从而程序的逻辑地址对应的CPL是0.

Stack starvation

KML又说明了几个实现KML机制的问题, 下面尝试进行介绍说明.

The biggest problem of implementing Kernel Mode Linux is a stack starvation problem. Let’s assume that a user program is executed in kernel mode and it causes a page fault on its user stack. To generate a page fault exception, an IA-32 CPU tries to push several registers (EIP, CS, and so on) to the same user stack because the program is executed in kernel mode and the IA-32 CPU doesn’t switch its stack to a kernel stack. Therefore, the IA-32 CPU cannot push the registers and generate a double fault exception and fail again. Finally, the IA-32 CPU gives up and reset itself. This is the stack starvation problem.

To solve the stack starvation problem, we use the IA-32 hardware task mechanism to handle exceptions. By using the mechanism, IA-32 CPU doesn’t push the registers to its stack. Instead, the CPU switches an execution context to another special context. Therefore, the stack starvation problem doesn’t occur. However, it is costly to handle all exceptions by the IA-32 task mechanism. So, in current Kernel Mode Linux implementation, double fault exceptions are handled by the IA-32 task. A page fault on a memory stack is not so often, so the cost of the IA-32 task mechanism is negligible for usual programs. In addition, non-maskable interrupts are also handled by the IA-32 task. The reason is described later in this document.

对于一个实现了KML的进程来说, 如果其在用户栈上发生了page fault, 为了产生页错误异常处理, 由于程序目前运行在内核态, ia32 cpu又会尝试将若干个寄存器压入该用户栈.. 这就导致了不能成功压栈并产生double fault异常, 此后继续失败… 最终ia32 CPU就会不断地反复失败与尝试..

KML的解决方案是: 使用ia32硬件的task机制来解决异常处理问题. 具体来说, 其不将寄存器压栈, 而是选择切换进入另一块特殊的上下文中进行处理. 这种切换花销较大, 但是KML中仅仅只有double fault问题会被task机制处理, 因而其开销也是可以容忍的.

原生doublefault处理

一个通常内核中的doublefault处理在arch/x86/kernel/doublefault.c:

 1#define DOUBLEFAULT_STACKSIZE (1024)
 2static unsigned long doublefault_stack[DOUBLEFAULT_STACKSIZE];
 3#define STACK_START (unsigned long)(doublefault_stack+DOUBLEFAULT_STACKSIZE)
 4
 5struct tss_struct doublefault_tss __cacheline_aligned = {
 6	.x86_tss = {
 7		.sp0		= STACK_START,
 8		.ss0		= __KERNEL_DS,
 9		.ldt		= 0,
10		.io_bitmap_base	= INVALID_IO_BITMAP_OFFSET,
11
12		.ip		= (unsigned long) doublefault_fn,
13		/* 0x2 bit is always set */
14		.flags		= X86_EFLAGS_SF | 0x2,
15		.sp		= STACK_START,
16		.es		= __USER_DS,
17		.cs		= __KERNEL_CS,
18		.ss		= __KERNEL_DS,
19		.ds		= __USER_DS,
20		.fs		= __KERNEL_PERCPU,
21
22		.__cr3		= __pa_nodebug(swapper_pg_dir),
23	}
24};

其调用doublefault_fn并使用一段特殊的内存空间STACK_START. 而在KML中事情就有所不同, 其自行实行了对doublefault的处理:

cpu_init

cpu_init函数中初始化了两个有关KML的TSS段, 其代码为:

 1#ifdef CONFIG_KERNEL_MODE_LINUX
 2	struct tss_struct* doublefault_tss = &per_cpu(doublefault_tsses, cpu);
 3	struct tss_struct* nmi_tss = &per_cpu(nmi_tsses, cpu);
 4#endif
 5
 6#ifndef CONFIG_KERNEL_MODE_LINUX
 7#ifdef CONFIG_DOUBLEFAULT
 8	/* Set up doublefault TSS pointer in the GDT */
 9	__set_tss_desc(cpu, GDT_ENTRY_DOUBLEFAULT_TSS, &doublefault_tss);
10#endif
11#else
12	init_doublefault_tss(cpu);
13	init_nmi_tss(cpu);
14	__set_tss_desc(cpu, GDT_ENTRY_DOUBLEFAULT_TSS, doublefault_tss);
15	__set_tss_desc(cpu, GDT_ENTRY_NMI_TSS, nmi_tss);
16#endif

第一段取了两个percpu的TSS段的指针, 后面再使用; 有关于tss段: ULK 进程 - rqdmap | blog.

不过先看看这两个percpu变量里面是什么吧:

task.c:INIT_TSS

这两个的段定义在task.c中, 这是完全由KML机制提供的:

 1extern void nmi_task(void);
 2extern void double_fault_task(void);
 3
 4#define INIT_DFT {						\
.x86_tss = {						\
.ss0		= __KERNEL_DS,			\
.ldt		= 0,				\
.fs		= __KERNEL_PERCPU,		\
.gs		= 0,				\
.io_bitmap_base	= INVALID_IO_BITMAP_OFFSET,	\
.ip		= (unsigned long) double_fault_task,	\
.flags		= X86_EFLAGS_SF | 0x2,		\
.es		= __USER_DS,			\
.cs		= __KERNEL_CS,			\
.ss		= __KERNEL_DS,			\
.ds		= __USER_DS			\
}							\
18}
19
20#define INIT_NMIT {						\
.x86_tss = {						\
.ss0		= __KERNEL_DS,			\
.ldt		= 0,				\
.fs		= __KERNEL_PERCPU,		\
.gs		= 0,				\
.io_bitmap_base	= INVALID_IO_BITMAP_OFFSET,	\
.ip		= (unsigned long) nmi_task,	\
.flags		= X86_EFLAGS_SF | 0x2,		\
.es		= __USER_DS,			\
.cs		= __KERNEL_CS,			\
.ss		= __KERNEL_DS,			\
.ds		= __USER_DS			\
}							\
34}
35
36DEFINE_PER_CPU(struct tss_struct, nmi_tsses) = INIT_NMIT;
37DEFINE_PER_CPU(struct tss_struct, doublefault_tsses) = INIT_DFT;

processor.h:tss_struct

给出tss_struct的结构介绍, 可以发现task.c中主要对x86_hw_tss硬件寄存器相关内容进行了修改与保存.

 1struct tss_struct {
 2	/*
 3	 * The hardware state:
 4	 */
 5	struct x86_hw_tss	x86_tss;
 6
 7	/*
 8	 * The extra 1 is there because the CPU will access an
 9	 * additional byte beyond the end of the IO permission
10	 * bitmap. The extra byte must be all 1 bits, and must
11	 * be within the limit.
12	 */
13	unsigned long		io_bitmap[IO_BITMAP_LONGS + 1];
14
15	/*
16	 * .. and then another 0x100 bytes for the emergency kernel stack:
17	 */
18	unsigned long		stack[64];
19
20#ifdef CONFIG_KERNEL_MODE_LINUX
21#define KML_STACK_SIZE (8*16)
22	char			kml_stack[KML_STACK_SIZE] __attribute__ ((aligned (16)));
23#endif
24} ____cacheline_aligned;

有关硬件的内容也截取过来, 后续在分析汇编代码时可以作为参考:

 1#ifdef CONFIG_X86_32
 2/* This is the TSS defined by the hardware. */
 3struct x86_hw_tss {
unsigned short		back_link, __blh;
unsigned long		sp0;
unsigned short		ss0, __ss0h;
unsigned long		sp1;
/* ss1 caches MSR_IA32_SYSENTER_CS: */
unsigned short		ss1, __ss1h;
unsigned long		sp2;
unsigned short		ss2, __ss2h;
unsigned long		__cr3;
unsigned long		ip;
unsigned long		flags;
unsigned long		ax;
unsigned long		cx;
unsigned long		dx;
unsigned long		bx;
unsigned long		sp;
unsigned long		bp;
unsigned long		si;
unsigned long		di;
unsigned short		es, __esh;
unsigned short		cs, __csh;
unsigned short		ss, __ssh;
unsigned short		ds, __dsh;
unsigned short		fs, __fsh;
unsigned short		gs, __gsh;
unsigned short		ldt, __ldth;
unsigned short		trace;
unsigned short		io_bitmap_base;
32
33} __attribute__((packed));
34#else
35struct x86_hw_tss {
u32			reserved1;
u64			sp0;
u64			sp1;
u64			sp2;
u64			reserved2;
u64			ist[7];
u32			reserved3;
u32			reserved4;
u16			reserved5;
u16			io_bitmap_base;
46
47} __attribute__((packed)) ____cacheline_aligned;
48#endif

在INIT_TSS中还涉及到了tss结构以及两个汇编函数的地址, 将在下面两节介绍:

entry_32.S:double_fault_task

asm

 1/*
 2 * This is a task-handler for double fault.
 3 * In Kernel Mode Linux, user programs may be executed in ring 0 (kernel mode).
 4 * Therefore, normal interrupt handling mechanism doesn't work.
 5 * For example, if a page fault occurs in a stack,
 6 * CPU cannot generate a page fault exception because there is no stack
 7 * to save the CPU context. We call this problem "stack starvation".
 8 * To solve the stack starvation, we handle double fault with task-handler. 
 9 *
10 * Initial stack layout (dft_stack_struct)
11 *
12 * %esp --> error_code (<-- pushed by CPU)
13 *          pointer to dft_tss
14 *          pointer to normal_tss
15 */
16ENTRY(double_fault_task)
17	movl 4(%esp), %edi		# get current TSS.
18/* %edi = current_tss */
19	movl 8(%esp), %ebx		# get normal TSS.
20/* %ebx = prev_tss */
21
22	# get kernel stack.
23	kml_get_kernel_stack %ebx, %esi
24
25	movl %esi, %esp
26/* From now on, we can use stack. */
27
28	# recreate stack layout as if normal interrupt occurs.
29	kml_recreate_kernel_stack_layout %ebx
30
31	call_helper prepare_fault_handler, $double_fault_fixup, %edi, %ebx
32
33	ret_from_task_without_iret %edi, GDT_ENTRY_TSS
34
35	jmp double_fault_task

这是一段处理函数, 因为在之前的tss结构中, 将该段地址保存到了ip寄存器中, 那么当从该tss恢复时, 就会首先运行该处理函数.

Todo: 谁恢复? 从哪里来调用?

根据其注释, 我们认为当double fault中断发生时栈的结构就是如此, 因而先将这两段地址拿出放在edi和ebx中.

Todo: 谁压入的? 何时何地?

随后遇到一个汇编宏:

entry_32.S:kml_get_kernel_stack

asm

macro kml_get_kernel_stack pre_tss, ret
cmpw $__KERNEL_CS, TSS_CS(\pre_tss)
jne 1f
 4
movl TSS_ESP(\pre_tss), \ret
# If the previous ESP points to kernel-space,
# we used the kernel stack.
cmpl $TASK_SIZE, \ret
jbe 1f
10
# If we were in the first instruction of
# ia32_sysenter_target, the previous ESP points to
# tss->esp1, so we need to reset it to tss->esp0.
# EIP will be adjusted in task.c
cmpl $ia32_sysenter_target, TSS_EIP(\pre_tss)
jne 2f
17
# We used the user stack, so
# needs to load the kernel stack
# from ESP0 field of TSS.

movl PER_CPU_VAR(esp0), \ret
23/*
movl $(__KERNEL_PERCPU), %eax
movl %eax, %ds
movl (ESP0_IN_PDA), \ret
movl $(__USER_DS), %eax
movl %eax, %ds
29*/

endm

第一行就是个大难关, 两边比较的对象分别为:

#define __KERNEL_CS (GDT_ENTRY_KERNEL_CS*8) , 而GDT_ENTRY_KERNEL_CS为:

1#define GDT_ENTRY_KERNEL_BASE		(12)
2#define GDT_ENTRY_KERNEL_CS		(GDT_ENTRY_KERNEL_BASE+0)

这需要涉及到GDT中各个分段的位置, 可以参考 <ULK 内存寻址 - rqdmap | blog>

TSS_CS = 76, 至于为什么是76, 需要回头去看x86_hw_tss的结构, 已经摘录到博客里; 但是研究hw_tss发现其中的cx占据的是72-79这8个byte一共64位, 这是由于存储的cx是64位, 我们要的cs在右边一半, 即76-79字节; 所以在汇编中TSS_CS为76.
- 这里应该认为unsigned long 是8个字节? 不然无法与hw_tss结构对上啊!

随后的jne 1f就比较熟悉了, 表示的1号标记所在的地址, 其实就是个jump to label; 如果比较后发现不等于, 说明此前不是内核态, 可以直接把esp作为返回值返回, esp里应该就是目前的内核栈.

而如果等于, 则说明目前正在使用内核栈, 需要进一步的处理.

先是比较$TASK_SIZE和tss段中的esp指针, 这个东西定义为:

 1/* Some constant macros are used in both assembler and
 2 * C code.  Therefore we cannot annotate them always with
 3 * 'UL' and other type specifiers unilaterally.  We
 4 * use the following macros to deal with this.
 5 *
 6 * Similarly, _AT() will cast an expression with a type in C, but
 7 * leave it unchanged in asm.
 8 */
 9
10#ifdef __ASSEMBLY__
11#define _AC(X,Y)	X
12#define _AT(T,X)	X
13#else
14#define __AC(X,Y)	(X##Y)
15#define _AC(X,Y)	__AC(X,Y)
16#define _AT(T,X)	((T)(X))
17#endif
18
19/*
20 * This handles the memory map.
21 *
22 * A __PAGE_OFFSET of 0xC0000000 means that the kernel has
23 * a virtual address space of one gigabyte, which limits the
24 * amount of physical memory you can use to about 950MB.
25 *
26 * If you want more physical memory than this then see the CONFIG_HIGHMEM4G
27 * and CONFIG_HIGHMEM64G options in the kernel configuration.
28 */
29#define __PAGE_OFFSET		_AC(CONFIG_PAGE_OFFSET, UL)
30
31#define TASK_SIZE               (__PAGE_OFFSET)

说白了, 该值应该就是内核地址空间的起始位置.

因此汇编语句通过比较$TASK_SIZE和esp值, 判断之前tss的esp指向是否在内核空间, 如果在, 也可以直接返回esp0即内核栈.

事实上{ss,esp}{0,1,2}这三组寄存器是用来寻找栈的, x86提供0-3共四个级别(还有ss和esp), 而TSS可以用来提升级别, 所以低级别提升之后需要把当前的任务状态保存起来, 其实就是保存到ssX, espX指定的栈中, 比如从3提升到0, 则CPU自动将任务状态保存到ss0:esp0指定的栈中. 为什么没有ss3和esp3？因为3环不需要切换堆栈。

该宏还处理了ia32_sysenter_target的一个情况, 据注释所说其此时的esp指向tss->esp1, 但是我完全不知道其根源如何, 因而暂且不管. Todo.

总之, 我们知道该宏返回了一个内核栈即可!

看回到double_fault_task中, 其将该宏给出的返回值存放在了esp中, 因而此时我们正在使用该CPU的内核栈!

下一步调用kml_recreate_kernel_stack_layout宏, 继续将之前的tss作为参数传入:

entry_32.S:kml_recreate_kernel_stack_layout

asm

 1.macro kml_recreate_kernel_stack_layout pre_tss
 2	cmpw $__KERNEL_CS, TSS_CS(\pre_tss)
 3	jne 1f
 4
 5	movl TSS_ESP(\pre_tss), %eax
 6	cmpl $TASK_SIZE, %eax
 7	ja 2f
 81:
 9	pushl TSS_SS(\pre_tss)
10	pushl TSS_ESP(\pre_tss)
112:
12	pushl TSS_EFLAGS(\pre_tss)
13	pushl TSS_CS(\pre_tss)
14	pushl TSS_EIP(\pre_tss)
15.endm

只有当前一个tss使用了内核的CS段但没有使用内核栈的时候, 才会直接jmp2f; 不然都会先将ss和esp两个栈相关的寄存器压栈, 再执行2号代码段, 即将flags, cs和eip也压栈.

Todo?? 这是在干嘛..

随后进入call\_helper, 他将帮助我们进行函数调用:

entry_32.S:call_helper

call\_helper是一个辅助用汇编宏, 用于传递参数; 提供3个参数, 自动将esp也压入栈中作为最后一个元素. 而func将弹栈取出4个参数, 因而执行完成后的esp地址将偏移4*4=16个位置; 这与task.c文件中prepare_nmi_handler函数的asmlinkage标记相对应:

Todo? +16莫非是移除重将内核栈的过程?

asm

1.macro call_helper func target_address cur_tss pre_tss
2	pushl %esp
3	pushl \pre_tss
4	pushl \cur_tss
5	pushl \target_address 
6	call \func
7	addl $16, %esp
8.endm

其将调用 prepare_nmi_handler函数, 并调用prepare_fault_handler函数:

task.c: prepare_*_handler

这一块东西放的多一点, 因为彼此的关系比较紧密.

 1struct df_stk {
 2	unsigned long ip;
 3	unsigned long cs;
 4	unsigned long flags;
 5};
 6
 7struct nmi_stk {
 8	unsigned long gs;
 9	unsigned long fs;
10	struct df_stk stk;
11};
12
13asmlinkage void prepare_fault_handler(unsigned long target_ip,
14      struct tss_struct* cur, struct tss_struct* pre, struct df_stk* stk)
15{
16	unsigned int cpu = smp_processor_id();
17
18	clear_busy_flag_in_tss_descriptor(cpu);
19
20	stk->cs &= 0x0000ffff;
21
22	if (pre->x86_tss.cs == __KERNEL_CS && pre->x86_tss.sp <= TASK_SIZE) {
23		stk->cs = __KU_CS_EXCEPTION;
24	}
25
26	pre->x86_tss.ip = target_ip;
27	pre->x86_tss.cs = __KERNEL_CS;
28	pre->x86_tss.flags &= (~(X86_EFLAGS_TF | X86_EFLAGS_IF));
29
30	pre->x86_tss.sp = (unsigned long)stk;
31	pre->x86_tss.ss = __KERNEL_DS;
32
33	return;
34}
35
36asmlinkage void prepare_nmi_handler(unsigned long target_ip,
37    struct tss_struct* cur, struct tss_struct* pre, struct nmi_stk* stk)
38{
39	prepare_fault_handler(target_ip, cur, pre, &stk->stk);
40
41	/*
42	 * NOTE: it is unnecessary to set cs to __KU_CS_INTERRUPT
43	 * because the layout of the prepared kernel stack (in entry.S) is
44	 * for exceptions, not interrupts.
45	 */
46
47	stk->fs = pre->x86_tss.fs;
48	stk->gs = pre->x86_tss.gs;
49
50	pre->x86_tss.fs = 0;
51	pre->x86_tss.gs = 0;
52	pre->x86_tss.ldt = 0;
53
54	pre->x86_tss.sp = (unsigned long)stk;
55
56	/*
57	 * Skip the first instruction of ia32_sysenter_target because
58	 * it assumes that %esp points to tss->esp1
59	 * and just loads the correct kernel stack to %esp.
60	 */
61	if (stk->stk.ip == (unsigned long)ia32_sysenter_target) {
62		stk->stk.ip = (unsigned long)sysenter_past_esp;
63	}
64
65	return;
66}

这两个函数的asmlinkage 约束来告诉编译器该函数必须从栈中取元素.

在prepare_nmi_handler中调用了prepare_fault_handler, 第一步是使用clear_busy_flag_in_tss_descriptor清空了cpu的tss的特定标志位:

desc.h:clear_busy_flag_in_tss_descriptor

 1/*
 2 * FIXME: Accessing the desc_struct through its fields is more elegant,
 3 * and should be the one valid thing to do. However, a lot of open code
 4 * still touches the a and b accessors, and doing this allow us to do it
 5 * incrementally. We keep the signature as a struct, rather than an union,
 6 * so we can get rid of it transparently in the future -- glommer
 7 */
 8/* 8 byte segment descriptor */
 9struct desc_struct {
10	union {
11		struct {
12			unsigned int a;
13			unsigned int b;
14		};
15		struct {
16			u16 limit0;
17			u16 base0;
18			unsigned base1: 8, type: 4, s: 1, dpl: 2, p: 1;
19			unsigned limit: 4, avl: 1, l: 1, d: 1, g: 1, base2: 8;
20		};
21	};
22} __attribute__((packed));
23
24struct gdt_page {
25	struct desc_struct gdt[GDT_ENTRIES];
26} __attribute__((aligned(PAGE_SIZE)));
27
28
29static inline struct desc_struct *get_cpu_gdt_table(unsigned int cpu)
30{
31	return per_cpu(gdt_page, cpu).gdt;
32}
33
34static inline void clear_busy_flag_in_tss_descriptor(unsigned int cpu)
35{
36	get_cpu_gdt_table(cpu)[GDT_ENTRY_TSS].b &= (~0x00000200);
37}

其本质上找到了当前cpu的gdt_page中的tss段的实际位置, 并对高32位的某个bit进行了清零, 事实上这是type字段的B位, 位于第41个bit, 表示TSS是否正在使用.

字段含义可以参考 <ULK 内存寻址段描述符 - rqdmap | blog>

回到task.c中, 下一步是修改stk指向的结构体的内容. 初看的时候十分不解, 但看过了汇编代码再看这段就会豁然开朗: esp是栈顶指针, 在汇编宏kml_recreate_kernel_stack_layout中还记得我们压入了什么吗, 这就是在task.c中nmi_stk的内容!

随后再看看做了什么: cs&=0x0000ffff, 不要疑惑, 这是因为段寄存器中存储的是段描述符! 而段描述符就是个16位的字段而已, 因此这里是为了将高位的垃圾给过滤掉. 可以参考: <ULK 内存寻址 - rqdmap | blog>

随后更新几个字段的内容, flags中清空X86_EFLAGS_TF和X86_EFLAGS_IFbit, TF(Trap Flag)用于单步调试, IF(Interrupt Flag)用于启用中断.

ip字段中填入的是double_fault_fixup函数, 这也定义在了汇编中:

entry_32.S:double_fault_fixup

asm

 1ENTRY(double_fault_fixup)
pushl %eax
pushl %edx
pushl %ecx
 5
movl %cr2, %eax
pushl %eax
 8
call do_interrupt_handling
10
popl %eax
movl %eax, %cr2
13
popl %ecx
popl %edx
popl %eax
17
pushl $PAGE_FAULT_ERROR_CODE
pushl $do_page_fault
jmp error_code

其中使用的cr2寄存器是一个32位的控制寄存器(Control Register, cr), 总共其实有4个(cr0-cr3), 用于保存全局性和任务无关的机器状态.

关于cr寄存器补充做一些说明:

CR0中包含了6个预定义标志，0位是保护允许位PE(Protedted Enable)，用于启动保护模式，如果PE位置1，则保护模式启动，如果PE=0，则在实模式下运行。1位是监控协处理位MP(Moniter coprocessor)，它与第3位一起决定：当TS=1时操作码WAIT是否产生一个“协处理器不能使用”的出错信号。第3位是任务转换位(Task Switch)，当一个任务转换完成之后，自动将它置1。随着TS=1，就不能使用协处理器。CR0的第2位是模拟协处理器位 EM (Emulate coprocessor)，如果EM=1，则不能使用协处理器，如果EM=0，则允许使用协处理器。第4位是微处理器的扩展类型位ET(Processor Extension Type)，其内保存着处理器扩展类型的信息，如果ET=0，则标识系统使用的是287协处理器，如果 ET=1，则表示系统使用的是387浮点协处理器。CR0的第31位是分页允许位(Paging Enable)，它表示芯片上的分页部件是否允许工作。

CR1是未定义的控制寄存器，供将来的处理器使用。

CR2是页故障线性地址寄存器，保存最后一次出现页故障的全32位线性地址。

CR3是页目录基址寄存器，保存页目录表的物理地址，页目录表总是放在以4K字节为单位的存储器边界上，因此，它的地址的低12位总为0，不起作用，即使写上内容，也不会被理会。

摘录自 <x86的控制寄存器CR0,CR1,CR2,CR3 - ahuo - 博客园>

将所谓的最后一次出现页故障的全32位线性地址压入栈中, 调用do_interrupt_handling:

task.c:do_interrupt_handling

 1static int NMI_is_set(void) {
 2	unsigned int cpu = smp_processor_id();
 3
 4	if (per_cpu(nmi_stacks, cpu).need_nmi) {
 5		per_cpu(nmi_stacks, cpu).need_nmi = 0;
 6		return 1;
 7	}
 8
 9	return 0;
10}
11
12void (*test_ISR_and_handle_interrupt)(void);
13
14asmlinkage void do_interrupt_handling(void)
15{
16	if (NMI_is_set()) {
17		__asm__ __volatile__ (
18		"pushfl\n\t"
19		"pushl %0\n\t"
20		"pushl $0f\n\t"
21		"jmp nmi\n\t"
22		"0:\n\t"
23		: : "i" (__KERNEL_CS)
24		);
25	}
26
27	test_ISR_and_handle_interrupt();
28}

Todo: 关于NMI的先不管..

其将flags, KERNEL_CS以及0号地址都压入栈中, 随后调用nmi函数, 这是kernel原生的汇编处理函数, 这里就不过多去阅读了.

asm

 1/*
* Initial stack layout (nmi_stack_struct)
*
*          [ unused entry ] <-- used if NMI occurs in DF
* %esp --> pointer to nmi_tss
*          pointer to normal_tss
*          pointer to the descriptor of doublefault_tss
*          need_nmi flag
*/
10ENTRY(nmi_task)
/* Check whether if we were in the double fault task or not. */
movl (%esp), %edi		# get current TSS.
13/* %edi = current_tss */
/* Load the previous tss selector to %ax */
movw (%edi), %ax
cmpw $__DOUBLEFAULT_TSS, %ax
jne 1f
18
/* We were in the double fault task. */
/*
* Do not handle this NMI,
* and notify the double fault task.
*/
24
/* clear busy flag in DFT tss descriptor */
movl 8(%esp), %edx
movl (%edx), %eax
andl $~0x00000200, %eax
movl %eax, (%edx)
30
movl $1, 12(%esp)		# need_nmi = 1
32
ret_from_task_without_iret %edi, GDT_ENTRY_DOUBLEFAULT_TSS
34
jmp nmi_task

entry_32.S:ret_from_task_without_iret

asm

 1.macro ret_from_task_without_iret cur_tss tss_desc
 2	/* clear NT in EFLAGS */
 3	pushfl
 4	andl $~X86_EFLAGS_NT, (%esp)
 5	popfl
 6
 7	movl TSS_ESP0(\cur_tss), %esp
 8
 9	/* We don't use iret, because it will enable NMI */
10	ljmp $(\tss_desc*8), $0x0
11.endm

emm

清空tss的特定标志位clear_busy_flag_in_tss_descriptor:

 1/*
 2 * FIXME: Accessing the desc_struct through its fields is more elegant,
 3 * and should be the one valid thing to do. However, a lot of open code
 4 * still touches the a and b accessors, and doing this allow us to do it
 5 * incrementally. We keep the signature as a struct, rather than an union,
 6 * so we can get rid of it transparently in the future -- glommer
 7 */
 8/* 8 byte segment descriptor */
 9struct desc_struct {
10	union {
11		struct {
12			unsigned int a;
13			unsigned int b;
14		};
15		struct {
16			u16 limit0;
17			u16 base0;
18			unsigned base1: 8, type: 4, s: 1, dpl: 2, p: 1;
19			unsigned limit: 4, avl: 1, l: 1, d: 1, g: 1, base2: 8;
20		};
21	};
22} __attribute__((packed));
23
24struct gdt_page {
25	struct desc_struct gdt[GDT_ENTRIES];
26} __attribute__((aligned(PAGE_SIZE)));
27
28
29static inline struct desc_struct *get_cpu_gdt_table(unsigned int cpu)
30{
31	return per_cpu(gdt_page, cpu).gdt;
32}
33
34static inline void clear_busy_flag_in_tss_descriptor(unsigned int cpu)
35{
36	get_cpu_gdt_table(cpu)[GDT_ENTRY_TSS].b &= (~0x00000200);
37}

字段含义可以参考 <ULK 内存寻址段描述符 - rqdmap | blog>

再考虑一下C结构体的地址分布?

 1// g++ main.c && cat main.c && ./a.out
 2#include <stdio.h>
 3#include <stdlib.h>
 4#include <linux/types.h>
 5
 6struct desc_struct {
 7	union {
 8		struct {
 9			unsigned int a;
10			unsigned int b;
11		};
12		struct {
13			__u16 limit0;
14			__u16 base0;
15			unsigned base1: 8, type: 4, s: 1, dpl: 2, p: 1;
16			unsigned limit: 4, avl: 1, l: 1, d: 1, g: 1, base2: 8;
17		};
18	};
19}global_desc;
20
21
22int main(){
23	global_desc.a = 0x12345678;
24	global_desc.b = 0x9abcdef0;
25	printf("Global variable's address: %x %x %x\n", &global_desc, &global_desc.a, &global_desc.b);
26
27	struct desc_struct * local_desc = (struct desc_struct *)malloc(sizeof(struct desc_struct));
28
29	printf("Local variable's address: %x %x %x\n", local_desc, &local_desc->a, &local_desc->b);
30
31	free(local_desc);
32	return 0;
33}
34// Global variable's address: 15c64030 15c64030 15c64034
35// Local variable's address: 162a82c0 162a82c0 162a82c4

可以明显看到, 无论是堆地址还是栈地址, 结构体指针指向的是低地址, 其内部的元素地址则递增.

那么, 结合结构体的地址分布以及小端存储的特点, 即可知道clear_busy_flag_in_tss_descriptor函数清零了tss描述符第从低到高第32+13个bit(32蕴含在对b的访问里面), 即S字段; 这是由于TSS是一个系统段.

prepare_fault_handler中的cur和pre从哪里来? 其直接由prepare_nmi_handler传递, 而本质上是由汇编代码传递:

asm

 1#ifdef CONFIG_KERNEL_MODE_LINUX
 2
 3/*
 4 * Initial stack layout (nmi_stack_struct)
 5 *
 6 *          [ unused entry ] <-- used if NMI occurs in DF
 7 * %esp --> pointer to nmi_tss
 8 *          pointer to normal_tss
 9 *          pointer to the descriptor of doublefault_tss
10 *          need_nmi flag
11 */
12ENTRY(nmi_task)
13	/* Check whether if we were in the double fault task or not. */
14	movl (%esp), %edi		# get current TSS.
15/* %edi = current_tss */
16	/* Load the previous tss selector to %ax */
17	movw (%edi), %ax
18	cmpw $__DOUBLEFAULT_TSS, %ax
19	jne 1f
20
21	/* We were in the double fault task. */
22	/*
23	 * Do not handle this NMI,
24	 * and notify the double fault task.
25	 */
26
27	/* clear busy flag in DFT tss descriptor */
28	movl 8(%esp), %edx
29	movl (%edx), %eax
30	andl $~0x00000200, %eax
31	movl %eax, (%edx)
32
33	movl $1, 12(%esp)		# need_nmi = 1
34
35	ret_from_task_without_iret %edi, GDT_ENTRY_DOUBLEFAULT_TSS
36
37	jmp nmi_task
381:
39	/* We were in the normal task. */
40
41	movl 4(%esp), %ebx		# get normal TSS.
42/* %ebx = prev_tss */
43
44	# get kernel stack.
45	kml_get_kernel_stack %ebx, %esi
46
47	movl %esi, %esp
48/* From now on, we can use stack. */
49
50	# recreate stack layout as if normal interrupt occurs.
51	kml_recreate_kernel_stack_layout %ebx
52
53	# make room for %fs and %gs
54	addl $-8, %esp
55
56	call_helper prepare_nmi_handler, $nmi_fixup, %edi, %ebx
57
58	ret_from_task_without_iret %edi, GDT_ENTRY_TSS
59
60	jmp nmi_task

谁调用prepare_fault_handler?

…

整个调用过程?

cpu_init初始化:

1cpu_init

有关中断处理的可以温习/补充博客< ULK 中断与异常 - rqdmap | blog> …

其中需要特别注意的是8号异常Double fault的任务门: 产生该异常时说明内核发生了严重的非法操作, 此时无法确定esp寄存器的值是否正确, 因而通过该任务门指向GDT中专门的TSS段, 通过段中内容装载esp和eip寄存器, 最终结果是, 处理器在自己的私有栈上执行doublefault_fn异常处理程序.

Stack switching

The second problem is a manual stack switching problem. In the original Linux kernel, an IA-32 CPU switches a stack from a user stack to a kernel stack on exceptions or interrupts. However, in Kernel Mode Linux, a user program may be executed in kernel mode and the CPU may not switch a stack. Therefore, in current Kernel Mode Linux implementation, the kernel switches a stack manually on exceptions and interrupts. To switch a stack, a kernel need to know a location of a kernel stack in an address space. However, on exceptions and interrupts, the kernel cannot use general registers (EAX, EBX, and so on). Therefore, it is very difficult to get the location of the kernel stack.

To solve the above problem, the current Kernel Mode Linux implementation exploits a per CPU GDT. In Kernel Mode Linux, one segment descriptor of the per CPU GDT entries directly points to the location of the per-CPU TSS (Task State Segment). Thus, by using the segment descriptor, the address of the kernel stack can be available with only one general register.

传统的IA32 CPU在发生异常或中断时会从用户栈切换到内核栈, 但是在KML下CPU不会为运行在内核态的程序切换栈. 因此, 在KML下当处理异常和中断时需要手动为内核切换栈. 目前的KML利用了per CPU GDT, 一个per CPU GDT的段描述符指向的是percpu的TSS, 使用该段描述符即可在只使用一个通用寄存器的情况下获取内核栈的地址.

有关GDT的可以回看 ULK 内存寻址 - rqdmap | blog.

Interrupt lost

The third problem is an interrupt-lost problem on double fault exceptions. Let’s assume that a user program is executed in kernel mode, and its ESP register points to a portion of memory space that has not been mapped to its address space yet. What will happen if an external interrupt is raised just in time? First, a CPU acks the request for the interrupt from an external interrupt controller. Then, the CPU tries to interrupt its execution of the user program. However, it can’t because there is no stack to save the part of the execution context (see above “a stack starvation problem”). Then, the CPU tries to generate a double fault exception and it succeeds because the Kernel Mode Linux implementation handles the double fault by the IA-32 task. The problem is that the double fault exception handler knows only the suspended user program and it cannot know the request for the interrupt because the CPU doesn’t tell nothing about it. Therefore, the double fault handler directly resumes the user program and doesn’t handle the interrupt, that is, the same kind of interrupts never be generated because the interrupt controller thinks that the previous interrupt has not been serviced by the CPU.

To solve the interrupt-lost problem, the current Kernel Mode Linux implementation asks the interrupt controller for untreated interrupts and handles them at the end of the double fault exception handler. Asking the interrupt controller is a costly operation. However, the cost is negligible because double fault exceptions, that is, page faults on memory stacks are not so often.

考虑一个运行在内核态的用户程序, 但其esp寄存器指向了一段尚未被映射到其地址空间的内存. 此时如果发生了外界的中断, CPU首先将ACK该中断请求, 随后其将尝试中断当前运行的用户程序, 但是其又因为没有一个可用的栈来保存执行上下文, 因此CPU产生double fault 异常, 这将由KML的task机制进行处理. 问题在于double fault的处理者只知道有关用户程序的信息而完全不知道中断到达的信息, 这就导致其将恢复用户程序并不处理该中断请求… 中断控制器也会认为CPU不会服务该中断因而此后也不会在发送该中断请求, 这就导致了中断的丢失.

目前的KML通过在double fault处理函数的结尾主动向中断控制器发起查询来解决上述问题, 同样地该查询也非常费时但是由于double fault并不常见因而可以容忍.

Non-maskable interrupt

The reason for handling non-maskable interrupts by the IA-32 tasks is closely related to the manual stack switching problem and the interrupt-lost problem. If an non-maskable interrupt occurs between when a maskable interrupt occurs and when a memory stack is switched from a user stack to a kernel stack, and the non-maskable interrupt causes a page fault on the memory stack, then the double fault exception handler handles the maskable interrupt because it has not been handled. The problem is that the double fault handler returns to the suspended interrupt handling routine and the routine tries to handle the already-handled maskable interrupt again.

The above problem can be avoided by handling non-maskable interrupts with the IA-32 tasks, because no double fault exceptions are generated. Usually, non-maskable interrupts are very rare, so the cost of the IA-32 task mechanisms doesn’t really matter. However, if an NMI watchdog is enabled for debugging purpose, performance degradation may be observed.

One problem for handling non-maskable interrupts by the IA-32 task mechanism is a descriptor-tables inconsistency problem. When the IA-32 tasks are switched back and forth, all segment registers (CS, DS, ES, SS, FS, GS) and the local descriptor table register (LDTR) are reloaded (unlike the usual IA-32 trap/interrupt mechanism). Therefore, to switch the IA-32 task, the global descriptor table and the local descriptor table should be consistent, otherwise, the invalid TSS exception is raised and it is too complex to recover from the exception. The problem is that the consistency cannot be guaranteed because non-maskable interrupts are raised anytime and anywhere, that is, when updating the global descriptor table or the local descriptor table.

To solve the above problem, the current Kernel Mode Linux implementation inserts instructions for saving and restoring FS, GS, and/or LDTR around the portion that manipulate the descriptor tables, if needed (CS, DS, ES are used exclusively by the kernel at that point, so there are no problems). Then, the non-maskable interrupt handler checks whether if FS, GS, and LDTR can be reloaded without problems, at the end of itself. If a problem is found, it reloads FS, GS, and/or LDTR with ‘0’ (reloading FS, GS, and/or LDTR with ‘0’ always succeeds). The reason why the above solution works is as follows. First, if a problem is found at reloading FS, GS, and/or LDTR, that means that a non-maskable interrupt occurs when modifying the descriptor tables. However, FS, GS, and/or LDTR are properly reloaded after the modification by the above mentioned instructions for restoring them. Therefore, just reloading FS, GS, and/or LDTR with ‘0’ works because they will be reloaded soon after. Inserting the instructions may affect performance. Fortunately, however, FS, GS, and/or LDTR are usually reloaded after modifying the descriptor tables, so there are few points at that the instructions should be inserted.

使用task技术来处理不可屏蔽中断的原因与Stack switch和Interrupt lost问题类似, 如果在可屏蔽中断发生后、内存栈从用户栈切换到内核栈之间不可屏蔽中断发生了, 其产生了一个page fault异常,…

Slide

Slide展示.. SCX技术研究

不过其中充斥了完全的谬误与个人错误理解, 不过还是塞进来算了