Windows Kernel Internals II
Virtual Machine Architecture
University of Tokyo – July 2004
Dave Probert, Ph.D.
Advanced Operating Systems Group
Windows Core Operating Systems Division
Microsoft Corporation
© Microsoft Corporation 2004 1
Hosted VM Model
Windows acts as a “host”
– Resources for each VM are allocated from the host
– All I/O with external devices is performed through the host
“Guest” code runs within a
separate context Host context Guest context
Process Process
– Independent address space
– Specialized “VMM” kernel Virtual
Machine
Guest Code
Host Kernel
VMM Kernel
Host Physical Machine
© Microsoft Corporation 2004 2
VM Components
VMM Kernel
– Thin layer, all in assembly
– Code executed at ring-0
– Exception handling
– External Interrupt pass-
through Host context Guest context
– Page table maintenance Virtual Virtual
PC Server
– Located within a 32MB area Guest Code
of address space known as
the “VMM work area” Virtual
Machine
– Work area is relocatable “Additions”
Host NDIS VMM
– One VMM instance per Kernel Driver Driver
VMM Kernel
virtual processor Host Physical Machine
© Microsoft Corporation 2004 3
VM Components
VMM Driver
- Provides kernel-level VM-related
services
- CreateVirtualMachine
- CreateVirtualProcessor
- ExecuteVirtualProcessor
- Implements context switching
Host context Guest context
mechanism between the host
and guest contexts Virtual Virtual
PC Server
- Loads and bootstraps
Guest Code
the VMM kernel
- Much of the security work we’ve Virtual
done recently involved Machine
repackaging the VMM kernel Host NDIS VMM
“Additions”
code into the VMM driver Kernel Driver Driver
VMM Kernel
Host Physical Machine
© Microsoft Corporation 2004 4
VM Components
NDIS Filter Driver
- Allows VM to send and
receive
Ethernet packets via
physical
Ethernet hardware
Host context Guest context
- Spoofs unique MAC
addresses Virtual
PC
Virtual
Server
for virtual NICs Guest Code
- Injects packets into host
Ethernet stack for guest- Virtual
Machine
to-host Host NDIS VMM
“Additions”
networking Kernel Driver Driver
VMM Kernel
Host Physical Machine
© Microsoft Corporation 2004 5
VM Components
Virtual PC / Virtual Server
executables
- Device emulation modules
- Resource allocation
- VM configuration creation
& editing Host context Guest context
- VM control (start, stop, Virtual Virtual
pause, save) PC Server
Guest Code
- Scripting APIs
- User interaction Virtual
Machine
- Host side of guest/host Host NDIS VMM
“Additions”
integration features Kernel Driver Driver
VMM Kernel
Host Physical Machine
© Microsoft Corporation 2004 6
VM Components
Virtual Machine “Additions”
– Collection of components
installed within the guest
environment by the user
– Implement optimizations
• Video
• SCSI
Host context Guest context
• Networking (in the future)
• Guest kernel patches Virtual
PC
Virtual
Server
– Implement guest half of Guest Code
guest/host integration
features Virtual
Machine
“Additions”
• Clipboard sharing Host NDIS VMM
• File drag and drop Kernel Driver Driver
VMM Kernel
• Arbitrary video resizing Host Physical Machine
© Microsoft Corporation 2004 7
VM Execution Loop
Host code repeatedly calls ExecuteVirtualProcessor
VMM acts as “co-routine” (i.e. VMM state is saved and
restored each time ExecuteVirtualProcessor is
called)
Cycles spent inside guest context are counted against
the calling thread
– Host code can control how much time is spent in
guest
Return code indicates why ExecuteVirtualProcessor
returned
– Time slice complete
– IN or OUT instruction encountered
– HLT instruction encountered
© Microsoft Corporation 2004 8
Processor Virtualization
x86 Virtualization
– Processor is non-virtualizable
• Poor privileged and user state separation
– For example, EFLAGS register contains condition codes
(user state) and interrupt mask (privileged state)
• Some instructions that access privileged state are
non-trapping
– Overly complex and messy architecture
• Many modes, legacy protection mechanisms and
general “warts”
© Microsoft Corporation 2004 9
Processor Emulation
In general, emulation is necessary
– VM uses a binary translation mechanism
• Most instructions are copied directly
• Non-virtualizable (“dangerous”) instructions are
modified
– Binary translation execution imposes ~50%
performance overhead
© Microsoft Corporation 2004 10
Direct Execution
In some processor modes, it’s safe to use direct
execution, others require emulation
Real Mode Emulation
Virtual 8086 (v86) mode Direct Execution
Protected Mode Ring 3 Direct Execution (with a few exceptions)
Protected Mode Ring 0 Emulation, unless known to be safe
© Microsoft Corporation 2004 11
Direct Execution
“Ring Compression”
– Guest ring-0, 1, 2 code is executed at ring 1
– Guest ring-3 code is executed at ring 3
– Provides correct MMU protection semantics (since ring 0-2 can
access privileged pages)
Direct execution of ring-0 code is only allowed if the
VMM is notified that it’s “safe”
– This requires patching certain “dangerous” instruction sequences
in the Windows kernel and HAL
– Patching is performed at runtime in memory only
– Patches are different for each version of Windows kernel & HAL
© Microsoft Corporation 2004 12
Guest OS Patching
Examples:
– PUSHFD / POPFD
– CLI / STI
– Spin lock acquisition failure (in the future)
Original Code
pushfd never traps (breaks IF virtualization)
pushfd
cli cli traps, but cannot be easily patched with a
mov eax,[ebp+8] jmp because it only takes up one byte
call [eax]
popfd
ret popfd never traps (breaks IF virtualization)
This sequence prevents correct behavior in direct execution
© Microsoft Corporation 2004 13
Guest OS Patching
Synthetic instructions
– Use an illegal instruction form (reserved for us by Intel)
– Five bytes in length (for ease in patching)
– Exhibit same side effects of real instruction
Original Code With Synthetic Instructions
pushfd vmpushfd All synthetic
cli vmcli instructions trap and
mov eax,[ebp+8] mov eax,[ebp+8]
are five bytes long so
call [eax] call [eax]
popfd vmpopf they can be replaced
ret ret with jmp or call
instructions at runtime
This sequence allows correct behavior in direct execution, but generates three traps
© Microsoft Corporation 2004 14
Guest OS Patching
Runtime Guest OS Patching
– Replace synthetic instructions with subroutine calls
– This technique prevents us from exposing internal VMM
implementation details to OS vendors. We can change the
subroutine implementations in the future.
Original Code With Synthetic Instructions With Runtime Patches
pushfd vmpushfd call _vmpushfd
cli vmcli call _vmcli
mov eax,[ebp+8] mov eax,[ebp+8] mov eax,[ebp+8]
call [eax] call [eax] call [eax]
popfd vmpopf call _vmpopfd
ret ret ret
This patched sequence is correct and fast
© Microsoft Corporation 2004 15
Direct Execution Overhead
Necessary to trap into the VMM kernel on some
instructions
– IN & OUT for I/O device emulation
– STI & CLI for interrupt mask virtualization
– INT & IRET to catch ring transitions
– INVLPG and MOV to CR3 for page table virtualization
Traps are expensive – and getting worse
– ~500 cycles on Pentium III or AMD processors; ~2000
cycles on Pentium 4
– Runtime patching of some trapping instructions is
possible
© Microsoft Corporation 2004 16
Physical Memory & RAM
Virtualized RAM
– User decides how much RAM is associated with each
virtual machine
Physical pages
– Allocated by VMM from host OS
– Currently allocated at the time the VM starts, but
could be allocated on demand
– Host physical addresses don’t match guest physical
addresses
© Microsoft Corporation 2004 17
Logical Page Mappings
Logical Memory
– Logical mappings defined by guest page tables
(mostly)
– VMM finds 32MB unused area for the VMM code and
data (the “VMM work area”).
– VMM monitors guest OS address space usage and
relocates itself if necessary
© Microsoft Corporation 2004 18
VMM Page Tables
VMM maintains its own private page table
– Initially, only the VMM work area is mapped
VMM Page Tables Guest Page Tables
PD Table PD Table
Physical CR3 Virtual CR3
VMM work area
Unused area
mapped here
© Microsoft Corporation 2004 19
VMM Page Tables
VMM maintains its own private page table
– Initially, only the VMM work area is mapped
– Guest pages are mapped on demand as they are
accessed
VMM Page Tables Guest Page Tables
PD Table PD Table
Physical CR3 Virtual CR3
VMM work area
Unused area
mapped here
© Microsoft Corporation 2004 20
VMM Page Tables
VMM maintains its own private page table
– Initially, only the VMM work area is mapped
– Guest pages are mapped on demand as they are
accessed
– Guest pages are unmapped when guest flushes its TLB
– VMM work area is relocated as necessary
VMM Page Tables Guest Page Tables
PD Table PD Table
Physical CR3 Virtual CR3
VMM work area
mapped here
Previous VMM
location now in use
by the guest
© Microsoft Corporation 2004 21
Memory Sharing
Memory allocated with VMM APIs
can be used in three ways
– Mapped within the VMM work area
– As guest virtual RAM (mapped into the guest address
space according to the guest page tables)
– Mapped within the host context (for emulated DMA
operations)
© Microsoft Corporation 2004 22
Device Emulation
Device emulation modules
- Emulate behaviors of a real hardware Device Emulation Models
device 440BX chipset with PIIX4
System BIOS (AMI)
- Register “callbacks” for I/O port PCI Bus
ISA Bus
accesses Power Management
SM Bus
8259 PIC
- Can access virtualized “RAM” for PIT
DMA Controller
emulated DMA operations CMOS
RTC
- Communicate among themselves (e.g. Memory Controller
RAM & VRAM
Ethernet module “plugs into” the PCI COM (Serial) Ports
LPT (Parallel) Ports
bus module and communicates with IDE/ATAPI Controllers
SCSI Adapters (Adaptec 2940)
the PIC module to assert interrupts) SVGA Video Adapter (S3 Trio64)
VESA BIOS
- May call host services to perform 2D Graphics Accelerator
Hardware Cursor
emulation Ethernet Adapters (DEC 21140)
SoundBlaster Sound Card
- Can be suspended, saved and Keyboard
Mouse
restored
© Microsoft Corporation 2004 23
Device I/O Accesses
I/O accesses (IN & OUT
instructions)
- Trap into VMM kernel
- Force a context switch back
to the host context where
device emulation module Host context Guest context
is invoked
- “Fast I/O handlers” can be
Virtual PC
called from within the VMM Guest User Code
context
Device
- Some OUTs can be batched Emulation 3
Module
MMIO accesses 3
Guest Kernel
- Caught in VMM’s page 1
fault handler Host Kernel VMM
Driver
0 1 Guest HAL
- Very expensive
OUT instr.
Context Switch GPF trap
0 Host HAL 0 VMM Kernel
Host Physical Machine
© Microsoft Corporation 2004 24
Discussion
© Microsoft Corporation 2004 25