The Exokernel Operating System Architecture: Dawson R. Engler
The Exokernel Operating System Architecture: Dawson R. Engler
by
Dawson R. Engler
Submitted to the Department of Electrical Engineering and Computer Science
in partial fulfillment of the requirements for the degree of
Doctor of Philosophy in Computer Science and Engineering
at the
Author : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
Department of Electrical Engineering and Computer Science
May 18, 1998
Certified by : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
M. Frans Kaashoek
Associate Professor
Thesis Supervisor
Accepted by : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
Leonard A. Gould
Chairman, Departmental Committee on Graduate Students
The Exokernel Operating System Architecture
by
Dawson R. Engler
Abstract
On traditional operating systems only trusted software such as privileged servers or the kernel can manage resources.
This thesis proposes a new approach, the exokernel architecture, which makes resource management unprivileged but
safe by separating management from protection: an exokernel protects resources, while untrusted application-level
software manages them. As a result, in an exokernel system, untrusted software (e.g., library operating systems) can
implement abstractions such as virtual memory, file systems, and networking.
The main thrusts of this thesis are: (1) how to build an exokernel system; (2) whether it is possible to build a real
one; and (3) whether doing so a good idea. Our results, drawn from two exokernel systems [25, 48], show that the
approach yields dramatic benefits. For example, Xok, an exokernel, runs a web server an order of magnitude faster
than the closest equivalent on the same hardware, common unaltered Unix applications up to three times faster, and
improves global system performance up to a factor of five.
The thesis also discusses some of the unusual techniques we have used to remove the overhead of protection. The
most unusual technique, untrusted deterministic functions, enables an exokernel to verify that applications correctly
track the resources they own, eliminating the need for it to do so. Additionally, the thesis reflects on the subtle issues in
using downloaded code for extensibility and the sometimes painful lessons learned in building three exokernel-based
systems.
Abstract
On traditional operating systems only trusted software such as privileged servers or the kernel can manage resources.
This thesis proposes a new approach, the exokernel architecture, which makes resource management unprivileged but
safe by separating management from protection: an exokernel protects resources, while untrusted application-level
software manages them. As a result, in an exokernel system, untrusted software (e.g., library operating systems) can
implement abstractions such as virtual memory, file systems, and networking.
The main thrusts of this thesis are: (1) how to build an exokernel system; (2) whether it is possible to build a real
one; and (3) whether doing so a good idea. Our results, drawn from two exokernel systems [25, 48], show that the
approach yields dramatic benefits. For example, Xok, an exokernel, runs a web server an order of magnitude faster
than the closest equivalent on the same hardware, common unaltered Unix applications up to three times faster, and
improves global system performance up to a factor of five.
The thesis also discusses some of the unusual techniques we have used to remove the overhead of protection. The
most unusual technique, untrusted deterministic functions, enables an exokernel to verify that applications correctly
track the resources they own, eliminating the need for it to do so. Additionally, the thesis reflects on the subtle issues in
using downloaded code for extensibility and the sometimes painful lessons learned in building three exokernel-based
systems.
3
Acknowledgments
The exokernel project has been the work of many people. The basic principles of Chapter 2 and Aegis implementation
come from a paper [25] written jointly with Frans Kaashoek and James O' Toole (which descended from my master's
thesis, done under Kaashoek, with ideas initated by [26, 27]).
In contrast to Aegis, Xok has been written largely by others. Dave Mazieres implemented the initial Xok kernel.
Thomas Pinckney Russell Hunt, Greg Ganger, Frans Kaashoek, and Hector Briceno further developed Xok and made
ExOS into a real Unix system. Greg Ganger designed and implemented C-FFS [37] and the Cheetah webserver (based
in part on [49]), the two linchpins of most of our application performance numbers. Ganger and Kaashoek oversaw
countless modifications to the entire system. Eddie Kohler made an enormous contribution in our write up of these
results. This work, described in [48], forms the basis for Chapter 5, and as a less primary source for Chapters 2— 4.
Contents
1 Introduction 11
1.1 Relation to other OS structures : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 13
1.1.1 Recent extensible operating systems : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 14
1.2 The focusing questions of this thesis : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 14
1.2.1 How to build an exokernel? : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 15
1.2.2 Can you build a real exokernel system? : : : : : : : : : : : : : : : : : : : : : : : : : : : : 16
1.2.3 Are exokernels a good idea? : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 17
1.2.4 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 18
1.3 Concerns : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 18
1.4 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 19
5
4.3 XN: Problem and history : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 43
4.4 XN: Design and implementation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 43
4.5 XN usage : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 45
4.6 Crash Recovery Issues : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 46
4.7 C-FFS: a library file system : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 47
4.8 Discussion : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 47
7 Conclusion 67
7.1 Possible Failures of the Architecture : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 67
7.2 Experience : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 68
7.2.1 Clear advantages : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 68
7.2.2 Costs : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 68
7.2.3 Lessons : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 69
7.3 Conclusion : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 69
8 XN's Interface 71
A Privileged system calls : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 71
A.1 XN initialization and shutdown : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 71
A.2 Reconstruction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 71
B Public system calls : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 72
B.1 Creating types : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 72
B.2 Creating and deleting file system trees : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 72
B.3 Buffer cache operations : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 73
B.4 Metadata operations : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 75
6
B.5 Reading XN data structures : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 75
B.6 File system-independent navigation calls : : : : : : : : : : : : : : : : : : : : : : : : : : : 76
9 Aegis' Interface 77
.7 CPU interface : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 77
.8 Environments : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 77
.9 Physical memory : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 80
.10 Interrupts : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 80
.11 Networking : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 83
.12 TLB manipulation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 83
7
List of Figures
1-1 Possible exokernel system. Applications link against library operating systems (libOS), which provide
standard operating system abstractions (virtual memory, files, network protocols, etc.). Because
libOSes, are unprivileged, applications can also specialize them or write their own, as the web server
in the picture has done. Because the exokernel provides protection, completely different libOSes can
simultaneously run on the same system and safely share resources such as disk blocks and physical
pages. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 12
2-1 A simplified exokernel system with two applications, each linked with its own libOS and sharing pages
through a buffer cache registry. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 21
5-1 Performance of unmodified UNIX applications. Xok/ExOS and OpenBSD/C-FFS use a C-FFS file
system while Free/OpenBSD use their native FFS file systems. Times are in seconds. : : : : : : : : : 50
5-2 HTTP document throughput as a function of the document size for several HTTP/1.0 servers. NCSA/BSD
represents the NCSA/1.4.2 server running on OpenBSD. Harvest/BSD represents the Harvest proxy
cache running on OpenBSD. Socket/BSD represents our HTTP server using TCP sockets on OpenBSD.
Socket/Xok represents our HTTP server using the TCP socket interface built on our extensible TCP/IP
implementation on the Xok exokernel. Cheetah/Xok represents the Cheetah HTTP server, which
exploits the TCP and file system implementations for speed. : : : : : : : : : : : : : : : : : : : : : : 53
5-3 Measured global performance of Xok/ExOS (the first bar) and FreeBSD (the second bar), using the
first application pool. Times are in seconds and on a log scale. number/number refers to the the total
number of applications run by the script and the maximum number of jobs run concurrently. Total is
the total running time of each experiment, Max is the longest runtime of any process in a given run
(giving the worst latency). Min is the minimum. : : : : : : : : : : : : : : : : : : : : : : : : : : : 54
5-4 Measured global performance of Xok/ExOS (the first bar) and FreeBSD (the second bar), using the
second application pool. Methodology and presentation are as described for Figure 5-3 . : : : : : : : 55
8
9-6 STLB structure : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 84
9-7 Assembly code used by Aegis to lookup mapping in STLB (18 instructions). : : : : : : : : : : : : : 85
9-8 Hardware defined “low” portion of a TLB entry (i.e., the part bound to a virtual page number). : : : : 86
9
List of Tables
5.1 The I/O-intensive workload installs a large application (the lcc compiler). The size of the compressed
archive file for lcc is 1.1 MByte. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 50
10
Chapter 1
Introduction
And now that the legislators and the do-gooders have so futilely inflicted so many systems upon society,
may they end up where they should have begun: may they reject all systems, and try liberty... — Frederic
Bastiat
It is hard to let old beliefs go. They are familiar. We are comfortable with them and have spent years
building systems and developing habits that depend on them. Like a man who has worn eyeglasses so
long that he forgets he has them on, we forget that the world looks to us the way it does because we have
become used to seeing it that way through a particular set of lenses. Today, however, we need new lenses.
And we need to throw the old ones away. — Kenich Ohmae
Traditional operating systems abstract and protect system resources. For example, they abstract physical memory in
terms of virtual memory, disk blocks in terms of files, and exceptions and CPU in terms of processes. This organization
has three significant benefits. First, it provides a portable interface to the underlying machine; applications need not
care about the details of the underlying hardware. Second, it provides a large default functionality base, removing the
need for application programmers to write device drivers or other low-level operating system code. Finally, it provides
protection: because the operating system controls all application uses of resources, it can control application access
to them, preventing buggy or malicious applications from compromising the system. Empirically, the ability to have
multiple applications and users sharing the same machine is useful. Despite this organization's benefits, it has a serious
problem: only privileged servers and the kernel can manage system resources. Untrusted applications are restricted
to the interfaces and implementations of this privileged software. This organization is flawed because application
demands vary widely. An interface designed to accommodate every application must anticipate all possible needs. The
implementation of such an interface would need to resolve all tradeoffs and anticipate all ways the interface could be
used. Experience suggests that such anticipation is infeasible and that the cost of mistakes is high [3, 9, 16, 25, 43, 79].
The exokernel architecture attacks this problem by giving untrusted application code as much safe control over
resources as possible, thereby allowing orders of magnitude more programmers to innovate and use innovations,
without compromising system integrity. It does so by dividing responsibilities differently from the way conventional
systems do. Exokernels separate protection from management: they protect resources, applications manage them.
An exokernel strives to move all functionality not required for protection out of privileged kernels and servers into
unprivileged applications. For example, in the context of virtual memory, an exokernel protects physical pages and
disk blocks used for paging, but defers the rest of management to applications (i.e., paging, allocation, fault handling,
page table layout, etc.). The ideal exokernel makes untrusted software as powerful as a privileged operating system,
without sacrificing either protection or efficiency.
Of course, not all applications need customized resource management. Instead of communicating with the exokernel
directly, we expect most programs to be linked with libraries that hide low-level resources behind traditional operating
system abstractions. However, unlike traditional implementations of these abstractions, library implementations are
unprivileged and can therefore be modified or replaced at will. We refer to these unprivileged libraries as library
operating systems, or libOSes. On the exokernels described in this thesis, libOSes implement virtual memory, file
systems, networking, and processes. Applications are written on top of these libraries in a way similar to how they are
written on top of current operating systems. As a result, applications can achieve appreciable speedups on an exokernel
by simply linking against an optimized library operating system. Figure 1-1 illustrates a generic exokernel system.
11
clip= angle=270
Figure 1-1: Possible exokernel system. Applications link against library operating systems (libOS), which provide
standard operating system abstractions (virtual memory, files, network protocols, etc.). Because libOSes, are unpriv-
ileged, applications can also specialize them or write their own, as the web server in the picture has done. Because
the exokernel provides protection, completely different libOSes can simultaneously run on the same system and safely
share resources such as disk blocks and physical pages.
An exokernel retains the three benefits of traditional operating systems: default functionality and portability come
from writing applications on top of a libOS, while protection comes from the exokernel, which guards all resources.
In addition, we hope that the exokernel organization dramatically improves system innovation. We have four main
reasons for this hope:
1. Fault-isolation: an error in a libOS only affects the applications using it, in contrast to errors in privileged
operating systems, which can compromise the entire system. Thus, an exokernel significantly reduces the risk
of using operating system innovations.
2. Co-existence: by design, multiple, possibly specialized, library operating systems can co-exist on the same
exokernel system, in contrast to traditional systems, which by-and-large prevent more than one operating system
running at a time. Thus, an exokernel enables innovation composition.
3. Increased implementor base: there are several orders of magnitude more systems programmers than privi-
leged implementors (of oft-times proprietary operating systems). We hope the rate of innovation increases
proportionally.
4. Increased user base: by making operating system software no different from other runtime libraries, the number
of people with discretion to use an innovation increases by an even larger factor. Similarly, using innovations
becomes a simple matter of linking in a new library rather than having to replace an entire system-wide operating
system (and forcing all other users of the system to use it, and its bugs, in the process).
These four features remove many of the practical barriers facing operating system innovation development and
deployment.
12
We do not assume that application programmers will modify operating system software as a matter of course.
Instead, we regard libOS modification as similar to that of compilers: both are large, relatively complex pieces of
software, not altered in the normal course of day-to-day programming. However, in the case of compilers, the enormous
implementor and user community coupled with, unprivileged, fault-isolated compiler implementations has resulted in
thousands of languages and implementations. For example, new languages such as Java, Tcl, Perl, and C++ have swept
the implementor base every few years. Observed operating systems revolutions happen both on a more attenuated time
scale, and with less dramatic scope. We hope that by making OS software more similar to compilers in the above ways
that their evolution will become more similar as well.
An exokernel's success does not depend on a panoply of different operating systems. In our view, similar again
to compilers and languages, there will be a few dominant operating systems but, importantly, the fact that they are
unprivileged will enable them to evolve more readily than traditional systems.
computer system (e.g., what security threats to resist, what resources to protect). In context of extensible operating systems, policy refers to the
algorithm used to make security or resource management decisions, while mechanism refers to the machinery used to implement a particular policy.
For example, a virtual memory paging policy is to evict least recently used pages to disk. The data structures used to do so, such as page tables and a
sorted page list, would be mechanism. Similar to the concepts of code and data, there is no clear fundamental difference between these two notions.
13
to share resources (such as physical memory and disk blocks) without sacrificing a single view of the machine,
Downloading code into the kernel is another approach to extensibility. In many systems only trusted users can
download code, either through dynamically-loaded kernel extensions or static configuration [33, 43]. In the SPIN
and Vino systems, any user can safely download code into the kernel [9, 78]. Safe downloading of code through
type-safety [9, 75] and software fault-isolation [78, 89] is complementary to the exokernel approach of separating
protection from management. Exokernels use downloading of code to let the kernel leave decisions to untrusted
software [25].
In addition to these structural approaches, much work has been done on better OS abstractions that give more
control to applications, such as user-level networking [88, 82], lottery scheduling [90], application-controlled virtual
memory [44, 55] and file systems [13, 72]. All of this work is directly applicable to libOSes.
14
2. Whether it is possible to build a real one.
3. Whether doing so is a good idea.
The first half of the thesis, Chapters 2 and 4, focuses on how to build an exokernel system, and the later chapters
on the approach's efficacy.
The text below provides an overview of the questions each issue raises, along with a sketch of their associated
answers.
15
The most general solution is to apply the exokernel precepts recursively: divide libraries into privileged and
unprivileged parts, where the “privileged” part contains all code required for protection, and must be used by all
applications using a piece of state, while the unprivileged contains all management code and can be replaced by
anyone. Forcing applications to use the privileged portion of a libOS can be done by placing it in a server, using a
restricted language and trusted compiler, or downloading code into the kernel. We have used all three.
In the more common case, library operating systems can frequently be designed for localization of state and to use
standard fault-isolation techniques (such as type-safe languages and memory protection).
In many cases, the desire to protect shared state with mechanisms more elaborate and read and write access checks
makes little technical sense: if the raw data itself can be corrupted, it is unlikely that there is any reason to preserve
high-level invariants on the resultant garbage. One of the most common situations where this dynamic arises is in the
protection of shared bookkeeping data structures. Consider the case of a shared buffer (say a Unix pipe), that uses a
buffer record to track the number of bytes in the buffer. At one level, one would like to guarantee that an application
decrements the byte count on data removal, and increments it on insertion. However, while this desire is sensible from
a software engineering perspective, it is not a protection issue. Since there are no guarantees on the sensibleness of the
inserted data, knowing how many bytes of garbage are in the buffer gains no security. In practice, social, rather than
technical, methods guarantee these sort of invariants — e.g., assemblers adhere to standard object code formats rather
than going through a set of system enforced methods for laying out debugging and linking information.
operations, inter-process communication, system calls, and even operations with large fixed costs, such as message round trip times over a 10Mb/s
Ethernet. [25]
16
gcc, telnet, and csh) without modification. This fact can be seen in that the bulk of our performance data comes from
application end-to-end numbers rather than micro-benchmarks.
17
operating system can. The single new challenge it faces is deriving information lost by dislocating abstractions into
application space. For example, traditional operating systems manage both virtual memory and file caching. As a
result, they can perform global resource management of pages that takes into account the manner in which a page is
being used. In contrast, if an exokernel dislocates virtual memory and file buffer management into library operating
systems it no longer can make such distinctions. While such information matters in theory, in practice we have found
it either unnecessary or crude enough that no special methods have been necessary to derive it. However, whether this
happy situation always holds is an open question.
More concretely, our experiments demonstrate that: (1) given the same workload, an exokernel performs com-
parably to widely used monolithic systems, and (2) that when local optimizations are performed, that whole system
performance improves, and can do so significantly.
1.2.4 Summary
Taken as a whole, these experiments show that, compared to a traditional system, an exokernel provides comparable
to significantly superior performance while simultaneously giving tremendous flexibility.
1.3 Concerns
Below, we discuss five concerns about the exokernel architecture: (1) that an it will compromise portability, (2) that it
will lead to a Babel of incompatible libOSes, (3) that libOSes are too costly to specialize, (4) that they consume too
much space per application, and (5) that the modification of a “free software” OS provides a “good enough” way to
innovate.
First, it is a virtue for an exokernel to expose the details of the system's hardware and software. Our exokernels
expose information such as the number of network and scatter/gather DMA buffers available and TLB entry format.
Naively handled, this exposure can compromise portability. Fortunately, traditional techniques for retaining portability
work just as well on exokernel systems. While applications that use an exokernel interface directly will not be portable,
those that use library operating systems that implement standard interfaces (e.g., POSIX) are portable across any
system that provides the same interface. Library operating systems themselves can be made portable by designing
them to use a low-level machine-independent layer to hide hardware details. This technique has been widely used in
the context of more traditional operating systems [74, 21].
As in traditional systems, an exokernel can provide backward compatibility in three ways: one, binary emulation
of the operating system and its programs; two, implementing its hardware abstraction layer on top of an exokernel;
and three, re-implementing the operating system's abstractions on top of an exokernel.
Second, since library operating systems are unprivileged, anyone can write one. An obvious problem is what
happens to system coherence under an onslaught of “hand rolled” operating systems? Similar to the previous challenge,
applications already solve the problem of chaos from freedom through the use of standards and good programming
practices. The advantage of standards on paper rather than hard wired in the kernel is that they can be re-implemented
and specialized, selected amongst, or, in the case of the worst ones, ignored.
Third, does specialization require writing an entire library operating system? Our experience shows that it does
not. All of the application performance improvements in this thesis, even the most dramatic, were achieved by
inserting the specialized subsystem within the default libOS, rather than rewriting an entire libOS from scratch. Given
a modular design, other library operating systems should allow similar specialization. It is possible that object-oriented
programming methods, overloading, and inheritance can provide useful operating system service implementations that
can be easily specialized and extended, as in the VM++ library [51].
Most libOS code is no more difficult to modify than other systems software such as memory allocators. It tends to
be conceptually shallow, and concerned chiefly with managing various mappings (page tables, meta data, file tables,
etc.). As such, once operating system software is removed from the harsh environ of the kernel proper, modification
does not require extraordinary competence.
Fourth, if each application has its own library operating system linked into it, what about space? As in other domains
with large libraries, exokernel systems can use dynamic linking and shared libraries to reduce space overhead. In our
experience, these techniques reduce overheads to negligible levels. For example, our global performance experiments
place a severe strain on system memory resources, yet an exokernel is comparable to or surpasses the conventional
systems we compare against.
18
Finally, what about Linux, FreeBSD, and the rest of the public-domain operating systems? Do they not evolve as
quickly as an exokernel will? We believe that an exokernel structure has important software engineering benefits over
these systems. First, libraries are fault-isolated from each other, allowing different operating systems to co-exist on the
same system, something not possible with the current crop of operating systems. Second, library development is much
much easier than kernel hacking. For example, standard debugging tools work, and errors crash only the application
using the library instead of the system, allowing easy post-mortem examination. Third, all users have the power to use
an innovation, simply by linking an application against a libOS.
1.4 Summary
The exokernel in a nutshell:
1. What: anyone can safely manage resources.
2. How: separate protection from management. An exokernel protects resources, applications manage them.
Everything not required for protection is moved out of privileged kernels and servers into untrusted application
libraries. This is the one idea to remember for an exokernel, all other features are details derived from it.
The exokernel ideal: the libOS made as powerful as privileged OS, without sacrificing performance or protection.
3. Why? Innovation. An exokernel makes operating systems software co-exist, unprivileged, and modifiable by
orders of magnitude more programmers.
The following chapter articulates the principles of the exokernel methodology. Chapter 3 draws on examples from
two exokernel systems, Aegis and Xok, to show how these principles apply in practice. The most challenging problem
for protection, disk multiplexing, is discussed in detail in Chapter 4. Chapter 5 presents application performance results
that show that that such separation is profitable. We reflect on using downloaded code for extensibility in Chapter 6.
Finally, Chapter 7 discusses future work, reflects on general lessons learned while building exokernel systems, and
concludes.
19
Chapter 2
We all live in the protection of certain cowardices which we call our principles . — Mark Twain (1835-
1910)
It is impossible to design a system so perfect that no one needs to be good. — T. S. Eliot:
The goal of an exokernel is to give as much safe, efficient control over system resources as possible. The challenge
for an exokernel is to give library operating systems maximum freedom in managing resources while protecting them
from each other; a programming error in one library operating system should not affect another library operating
system. To achieve this goal, an exokernel separates protection, which must be privileged, from management, which
should be unprivileged.
Figure 2-1 shows a simplified exokernel system that is running two applications: an unmodified UNIX application
linked against the ExOS libOS and a specialized exokernel application using its own TCP and file system libraries.
Applications communicate with the kernel using low-level physical names (e.g., block numbers); the kernel interface is
as close to the hardware as possible. LibOSes handle higher-level names (e.g., file descriptors) and supply abstractions.
Protection in this figure consists of forcing applications to only read or write (cached) disk blocks that they have access
rights to. The kernel does this by guarding all operations on both the disk blocks themselves and on the physical pages
that hold cached copies.
In general, resource protection consists of three tasks: (1) tracking ownership of resources, (2) ensuring protection
by guarding all resource usage or binding points, and (3) revoking access to resources. To prevent rogue applications
from destroying the system, most exokernels protect physical resources such as physical memory, the CPU, and
network devices (to prevent theft of received messages and corruption of not-yet-transmitted ones). Additionally, they
must guard operations that change privileged state such as the ability to execute privileged instructions, and writes to
“special” memory locations that control devices. Finally, they may guard more abstract resources such as bandwidth.
To make this discussion concrete, consider the protection of physical pages. The core requirement is that the
exokernel must check that every read and write to any page has sufficient access rights. On modern architectures,
encapsulation involves forcing all memory operations to either go through the kernel (which checks permissions
manually) or through translation hardware (the TLB), which checks them against a set of cached pages. To ensure that
malicious processes do not corrupt the TLB, the kernel must check modifications of it as well. In practice this means
that applications cannot issue privileged instructions directly, but must go through the kernel. The next three chapters
give many examples of resources to protect and ways to do so.
Applications cannot override privileged software. This inflexibility makes protection possible. However, if it spills
over into non-protection areas, applications will needlessly lose the ability to control important decisions. Thus, an
exokernel only overrides application decisions to the extent necessary to build a secure system.
This chapter describes five principles that give this declarative goal operational teeth. These principles illustrate
the mechanics of exokernel systems and provide important motivation for many design decisions discussed later in
this thesis. To better understand what the architecture is, we then show what it is not. The remainder of the chapter
discusses how an exokernel protects the state of higher-level abstractions, and performs resource revocation, and then
presents a general technique, secure bindings, we have used to make protection efficient.
20
UNIX Application
Specialized
Application
fd
ExOS T F ExOS
read C S Subset
P
Kernel
b p
Block b = Page p
... b
Figure 2-1: A simplified exokernel system with two applications, each linked with its own libOS and sharing pages
through a buffer cache registry.
21
internal fragmentation through coarse-grained allocation. Coarse-grained protection also makes sharing clumsy, since
resources cannot be shared at a finer-granularity than the exokernel protects.
Expose names. Exokernels use physical names wherever possible. Physical names capture useful information
and do not require potentially costly or race-prone translations from virtual names. For example, physical disk blocks
encode position, one of the most important variables for file system performance. Virtual names, in contrast, would
not necessarily encode this information and would require on-disk translation tables. Using these tables increases disk
accesses (extra writes for persistence, extra reads for name translation) and protecting them across reboots degrades
performance (e.g., by ordering writes to disk, and causing multiple writes).
In practice, physical names also provide a uniform mechanism for optimistic synchronization. They allow locks to
be replaced with checks that “expected” conditions hold (that a directory name maps to a specific disk block and has
not been deleted or moved, etc.).
Expose information. Exokernels expose system information, and collect data that applications cannot easily
derive locally. For example, applications can determine how many hardware network buffers there are or which
pages cache file blocks. An exokernel might also record an approximate least-recently-used ordering of all physical
pages, something individual applications cannot do without global information. Additionally, exokernels export book-
keeping data structures such as free lists, disk arm positions, and cached TLB entries so that applications can tailor
their allocation requests to available resources.
We have found it useful in practice to provide space for protected application-specific data in kernel data structures.
For example, the structures describing an execution environment are improved if application-specific data can be
associated with them (e.g., to record the numeric id of an process' parent, its running status, program name, etc.).
These principles apply not just to the kernel, but to any component of an exokernel system. Privileged servers
should provide an interface boiled down to just what is required for protection.
2.1.1 Policy
An exokernel hands over resource policy decisions to library operating systems. Using this control over resources,
an application or collection of cooperating applications can make decisions about how best to use these resources.
However, as in all systems, an exokernel must include policy to arbitrate between competing library operating systems:
it must determine the absolute importance of different applications, their share of resources, etc. This situation is no
different than in traditional kernels. Appropriate mechanisms are determined more by the environment than by the
operating system architecture. For instance, while an exokernel cedes management of resources to library operating
systems, it controls the allocation and revocation of these resources. By deciding which allocation requests to grant
and from which applications to revoke resources, an exokernel can enforce traditional partitioning strategies, such as
quotas or reservation schemes. Since policy conflicts boil down to resource allocation decisions (e.g., allocation of
seek time, physical memory, or disk blocks), an exokernel handles them in a similar manner.
22
oldest use of downloaded code is packet filters. Packet filters are application-written code fragments that applications
download into the kernel to select what incoming network packets they want.
However, while these software abstractions reside in the kernel, they could also be implemented in trusted user-
level servers. This microkernel organization would cost many additional (potentially expensive) context switches.
Furthermore, partitioning functionality in user-level servers tends to be more complex.
From one perspective, downloaded code provides a conduit for pushing semantics across the user-kernel trust
boundary. This activity is more important for exokernels than traditional systems. By dislodging OS code into
libraries, an exokernel also removes the code that understands resource semantics and, thus, lose the ability to protect
these resources. For networking, because an exokernel ejects the code that understands network protocol semantics it
no longer understands how to bind incoming packets to applications. Packet filters provide a way to recapture these
semantics by encapsulating them in a black box (the filter), which is then injected into the kernel. The exokernel thus
trades an intractable problem — anticipating all possible invariants that must be enforced — for one that is tractable:
testing the downloaded code for trustworthiness (in this case, that one filter will not steal another's packets). Chapter 6
discusses this perspective in more detail.
The key to these exokernel software abstractions is that they neither hinder low-level access to hardware resources
nor unduly restrict the semantics of the protected abstractions they enable. Given these properties, a kernel software
abstraction does not violate the exokernel principles.
Protected sharing
The low-level exokernel interface gives libOSes enough hardware control to implement all traditional operating system
abstractions. Library implementations of abstractions have the advantage that they can trust the applications they link
with and need not defend against malicious use. The flip side, however, is that a libOS cannot necessarily trust all
other libOSes with access to a particular resource. When libOSes guarantee invariants about their abstractions, they
must be aware of exactly which resources are involved, what other processes have access to those resources, and what
level of trust they place in those other processes.
As an example, consider the semantics of the UNIX fork system call. It spawns a new process initially identical to
the currently running one. This involves copying the entire virtual address space of the parent process, a task operating
systems typically perform lazily through copy-on-write to avoid unnecessary page copies. While copy-on-write can
always be done in a trusted, in-kernel virtual memory system, a libOS must exercise care to avoid compromising the
semantics of fork when sharing pages with potentially untrusted processes. This section details some of the approaches
we have used to allow a libOS to maintain invariants when sharing resources with other libOSes.
Our latest exokernel, Xok, provides three mechanisms libOSes can use to maintain invariants in shared abstractions.
First, software regions, areas of memory that can only be read or written through system calls, provide sub-page
protection and fault isolation. Second, Xok allows on the-fly-creation of hierarchically-named capabilities and
requires that these capabilities be specified explicitly on each system call [62]. Thus, a buggy child process accidentally
requesting write access to a page or software region of its parent will likely provide the wrong capability and be denied
permission. Third, Xok provides robust critical sections: inexpensive critical sections that are implemented by logically
disabling software interrupts [10]. Using critical sections instead of locks eliminates the need to trust other processes.
Three levels of trust determine what optimizations can be used by the implementation of a shared abstraction.
Optimize for the common case: Mutual trust. It is often the case that applications sharing resources place a
considerable amount of trust in each other. For instance, any two UNIX programs run by the same user can arbitrarily
modify each others' memory through the debugger system call, ptrace. When two exokernel processes can write
each others' memory, their libOSes can clearly trust each other not to be malicious. This reduces the problem of
guaranteeing invariants from one of security to one of fault-isolation, and consequently allows libOS code to resemble
that of monolithic kernels implementing the same abstraction.
Unidirectional trust. Another common scenario occurs when two processes share resources and one trusts the
other, but the trust is not mutual. Network servers often follow this organization: a privileged process accepts network
connections, forks, and then drops privileges to perform actions on behalf of a particular user. Many abstractions
implemented for mutual trust can also function under unidirectional trust with only slight modification. In the example
of copy-on-write, for instance, the trusted parent process must retain exclusive control of shared pages and its own
page tables, preventing a child from child making copied pages writable in the parent. While this requires more page
faults in the parent, it does not increase the number of page copies or seriously complicate the code.
23
Mutual distrust. Finally, there are situations where mutually distrustful processes must share high-level abstrac-
tions with each other. For instance, two unrelated processes may wish to communicate over a UNIX domain socket,
and neither may have any trust in the other. For OS abstractions that can be shared by mutually distrustful processes,
the most general solution is to apply the exokernel precepts recursively: divide libraries into privileged and unprivi-
leged parts, where the “privileged” part contains all code required for protection, and must be used by all applications
using a piece of state, while the unprivileged contains all management code and can be replaced by anyone. Forcing
applications to use the privileged portion of a libOS can be done by placing it in a server, using compiler techniques,
or downloading code it into the kernel. We have used all three.
Fortunately, sharing with mutual distrust occurs very infrequently for many abstractions. Many types of sharing
occur only between child and parent processes, where mutual or unidirectional trust almost always holds.
24
pagers [2, 77], most operating systems deallocate (and allocate) physical memory without informing applications. This
form of revocation has lower latency and is simpler than visible revocation since it requires no application involvement.
Its disadvantages are that library operating systems cannot guide deallocation and have no knowledge that resources
are scarce.
An exokernel uses visible revocation for most resources. Even the processor is explicitly revoked at the end of
a time slice; a library operating system might react by saving only the required processor state. For example, a
library operating system could avoid saving the floating point state or other registers that are not live. However, since
visible revocation requires interaction with a library operating system, invisible revocation can perform better when
revocations occur very frequently. Processor addressing-context identifiers are a stateless resource that may be revoked
very frequently and are best handled by invisible revocation.
25
A secure binding is a protection mechanism that decouples authorization from the actual use of a resource. Secure
bindings improve performance in two ways. First, the protection checks involved in enforcing a secure binding are
expressed in terms of simple operations that the kernel (or hardware) can implement quickly. Second, a secure binding
performs authorization only at bind time, which allows management to be decoupled from protection. Application-
level software is responsible for many resources with complex semantics (e.g., network connections). By isolating
the need to understand these semantics to bind time, the kernel can efficiently implement access checks at access time
because it need no longer involve an application. Simply put, a secure binding allows the kernel to protect resources
without understanding them.
Operationally, the one requirement needed to support secure bindings is a set of primitives that application-level
software can use to express protection checks. The primitives can be implemented either in hardware or software.
A simple hardware secure binding is a TLB entry: when a TLB fault occurs the complex mapping of virtual to
physical addresses in a library operating system's page table is performed and then loaded into the kernel (bind time)
and then used multiple times (access time). Another example is the packet filter [65], which allows predicates to be
downloaded into the kernel (bind time) and then run on every incoming packet to determine which application the
packet is for (access time). Without a packet filter, the kernel would need to query every application or network server
on every packet reception to determine who the packet was for. By separating protection (determining who the packet
is for) from authorization and management (setting up connections, sessions, managing retransmissions, etc.) very
fast network multiplexing is possible while still supporting complete application-level flexibility.
We use three basic techniques to implement secure bindings: hardware mechanisms, software caching, and
downloading application code.
Appropriate hardware support allows secure bindings to be couched as low-level protection operations such that
later operations can be efficiently checked without recourse to high-level authorization information. For example, a file
server can buffer data in memory pages and grant access to authorized applications by providing them with capabilities
for the physical pages. An exokernel would enforce capability checking without needing any information about the
file system's authorization mechanisms. As another example, some Silicon Graphics frame buffer hardware associates
an ownership tag with each pixel. This mechanism can be used by the window manager to set up a binding between
a library operating system and a portion of the frame buffer. The application can access the frame buffer hardware
directly, because the hardware checks the ownership tag when I/O takes place.
Secure bindings can be cached in an exokernel. For instance, an exokernel can use a large software TLB [7, 46]
to cache address translations that do not fit in the hardware TLB. The software TLB can be viewed as a cache of
frequently-used secure bindings.
Secure bindings can be implemented by downloading code into the kernel. This code is invoked on every resource
access or event to determine ownership and the actions that the kernel should perform. Downloading code into
the kernel allows an application thread of control to be immediately executed on kernel events. The advantages of
downloading code are that potentially expensive crossings can be avoided and that this code can run without requiring
the application itself to be scheduled. Type-safe languages [9, 75], interpretation, and sandboxing [89] can be used to
execute untrusted application code safely [26].
26
As an example, consider the problem of writing cached disk blocks to stable storage in a way that guarantees
consistency across reboots. Rather than an exokernel deciding on a particular write ordering and having to struggle
with the associated tradeoffs in scheduling heuristics and caching decisions required, it can instead allow the application
to construct schedules, retaining for the much simplified task of merely checking that any application schedule gives
appropriate consistency guarantees. Application of this methodology enables an exokernel to leave library operating
systems to decide on tradeoffs themselves rather than forcing a particular set, a crucial shift of labor.
While this style of design may seem obvious, in practice it has been depressingly easy to slide into “the old way”
of deciding how to implement a particular feature rather than asking the unusual question of how one can get out of
doing so, safely. A useful heuristic to catch such slips is to note when one is making many tradeoffs. A plethora of
tradeoffs almost invariably signals a decision that could be implemented or optimized in different important ways and,
thus, should be left to applications.
Because libOSes understand the higher-level semantics of their abstractions, they “just know” many things that an
exokernel does not. For example, that related files will likely be clustered together and if one block is fetched, the
succeeding eight blocks should be as well. Thus, a variant of the above methodology is designing interfaces where
actions do not require justification. As an example, consider the act of fetching a disk block in core. If a file system
must present credentials before the kernel will allow the fetch to be initiated, then it faces serious problems doing
the prefetching it needs, since it would first have to read in all file directory entries, show each file's capability to the
kernel, and only then initiate the fetches. In contrast, by only performing access control when the actual data in the
block or read or written, rather then when the block is fetched, these disk reads can be initiated all at once.
27
Chapter 3
When a man says he approves of something in principle, it means he hasn' t the slightest intention of
putting it into practice. — Otto von Bismarck (1815-1898)
To illustrate the exokernel principles, this chapter discusses how to export low-level primitives such as exceptions
and inter-process communication, and multiplex physical resources such as memory, CPU, and the network. To make
this discussion concrete, we draw on examples from two exokernel systems: Aegis [25], and Xok [48]. Aegis is the
first exokernel we built. It runs on the MIPS [50] DECstation family. Xok is the second. It runs on the x86 architecture
and is the more mature of the two. Construction of these systems spanned four years: two for Aegis, two (and counting)
for Xok.
Xok and Aegis differ in important ways, helping to show how the application of exokernel principles changes
in the face of different constraints. Most of these differences result from the fact that the x86 and MIPS hardware
have radically different architectural interfaces (e.g., software page tables for the MIPS, hardware for x86) and
implementation performance (an order of magnitude in favor of the x86).
The implementations we discuss are merely examples of how resources can be multiplexed, not how they must be.
Other implementations are possible.
Stylistically, the discussion of each resource focuses on two questions: (1) how to give applications control over
the resource and (2) how to protect it. Some of the resources we discuss have little to do with protection due to the
lack of “sharing.” For example, besides memory protection issues, exceptions are contained within a single process.
Lack of multiplexing makes protection a non-issue. Others, such as networking, focus more heavily on it.
We provide a quick overview of Aegis and Xok below. The remainder of the chapter discusses how they multiplex
resources.
28
The micro-benchmarks discussed in this section compare Aegis and ExOS with the performance of Ultrix4.2 (a
mature monolithic UNIX operating system) on the same hardware. While Aegis and ExOS do not offer the same level
of functionality as Ultrix, we do not expect these additions to cause large increases in our timing measurements.
Ultrix, despite its poor performance relative to Aegis, is not a poorly tuned system; it is a mature monolithic system
that performs quite well in comparison to other operating systems [69]. For example, it performs two to three times
better than Mach 3.0 in a set of I/O benchmarks [67]. Also, its virtual memory performance is approximately twice
that of Mach 2.5 and three times that of Mach 3.0 [5].
Our measurements were taken on a DECstation5000/125 (25MHz), with an R3000 processor and a SPECint92
rating of 25.
29
In order to support application-level virtual memory efficiently, TLB refills must be fast. To this end, Aegis caches
TLB entries (a form of secure bindings) in the kernel by overlaying the hardware TLB with a large software TLB
(STLB) to absorb capacity misses [7, 46]. On a TLB miss, Aegis first checks to see whether the required mapping is
in the STLB. If so, Aegis installs it and resumes execution; otherwise, the miss is forwarded to the application.
The STLB contains 4096 entries of 8 bytes each. It is direct-mapped and resides in unmapped physical memory.
An STLB “hit” takes 18 instructions (approximately one to two microseconds). In contrast, performing an upcall to
application level on a TLB miss, followed by a system call to install a new mapping is at least three to six microseconds
more expensive.
As dictated by the exokernel principle of exposing kernel book-keeping structures, the STLB can be mapped using
a well-known capability, which allows applications to efficiently probe for entries.
Using the control given by libOS-defined page tables ExOS has a variety of different page table structures 1 ,
high-performance network paging, and application-specific page coloring (for improved cache performance).
No other protected operating system architecture allows this degree of freedom.
structure in a week, while taking his final exams. As a testament to the difficulty of modifying current operating systems, the proposers of this page
table structure were only able to simulate it [83].
30
Protection for networking is answering the question: given a message, who owns it? On a connection-oriented
network, answering this question is easy: whichever connection (or flow) it belongs to. Binding of application to flow
can happen at connection initiation, removing the need to make decisions based on packet contents. An example of a
hardware-based mechanism is the use of the virtual circuit in ATM cells to securely bind streams to applications [23].
However, answering this question is more difficult on a connectionless network such as Ethernet, where message
ownership requires understanding header semantics. Since the exokernel has dislocated all network protocol code
that understands packet semantics into library operating systems it lacks the information necessary to decide which
application owns what message.
To solve this problem, an exokernel requires that networking libraries download packet filters [65] to select the
messages they want. 2 Conceptually, every filter is invoked on every arriving packet and returns “accept” (the filter
wants the message) or “reject” (the filter does not want the message). The operating system thus need not understand
the actual bits in a message in order to bind an arriving packet to its owner.
For protection, the exokernel must ensure that that a filter does not “lie” and accept packets destined to another
process. To prevent this theft we intentionally designed our filter language to make “overlap” detection simple.
(Alternatively, simple security precautions such as only allowing a trusted server to install filters could be used to
address this problem.) Finally, we ensure fault isolation through a combination of language design (to bound runtime)
and runtime checks (to protect against wild memory references and unsafe operations).
Both Aegis and Xok use packet filters, because our current network does not provide hardware mechanisms
for message demultiplexing. One challenge with a language-based approach is to make filters fast. Traditionally,
packet filters have been interpreted, making them less efficient than in-kernel demultiplexing routines. One of the
distinguishing features of the packet filter engine used by our prototype exokernel is that it compiles packet filters to
machine code at runtime, increasing demultiplexing performance by more than an order of magnitude [28, 31]. 3
Sharing the network interface for outgoing messages is easier. Messages are simply copied from application space
into a transmit buffer. In fact, with appropriate hardware support, transmission buffers can be mapped into application
space just as easily as physical memory pages [23].
An exokernel defers message construction to applications. A protection problem this creates is how to prevent
applications from “spoofing” other applications by sending bogus messages. Our two exokernel systems do not prevent
this attack, since they were designed for an insecure networking environment. However, within the context of a trusted
network, an exokernel could use “inverse” packet filters to reject messages that do not fit a specified pattern (or,
alternatively, reject messages that do). The techniques that worked for DPF could be applied here as well.
which, because they moved networking code from out of the operating system into privileged servers, rendered an operating system too ignorant to
demultiplex packets.
3 However, recently we have experimented with the use of an aggressive interpreter that, by preprocessing filters and exploiting “super-operator”
instructions, runs roughly within a factor of four of hand-tuned code — a perfectly adequate speed given that demultiplexing is coupled to a
high-latency I/O event (packet reception).
31
2500
Ideal
the time slice will be run. Position can be used to meet deadlines and to trade off latency for throughput. For example,
a long-running scientific application could allocate contiguous time slices in order to minimize the overhead of context
switching, while an interactive application could allocate several equidistant time slices to maximize responsiveness.
The kernel provides a yield primitive to donate the remainder of a process' current time slice to another (specific)
process. Applications can use this simple mechanism to implement their own scheduling algorithms.
Timer interrupts denote the beginning and end of time slices, and are delivered in a manner similar to exceptions
(discussed below): a register is saved in the “interrupt save area,” the exception program counter is loaded, and the kernel
jumps to user-specified interrupt handling code with interrupts re-enabled. The application's handlers are responsible
for general-purpose context switching: saving and restoring live registers, releasing locks, etc. This framework gives
applications a large degree of control over context switching. For example, because it notifies applications of clock
interrupts, and gives them control over context switching's state saving and restoration, it can be used to implement
scheduler activations [4].
Fairness is achieved by bounding the time an application takes to save its context: each subsequent timer interrupt
(which demarcates a time slice) is recorded in an excess time counter. Applications pay for each excess time slice
consumed by forfeiting a subsequent time slice. If the excess time counter exceeds a predetermined threshold, the
environment is destroyed. In a more friendly implementation, the kernel could perform a complete context switch for
the application.
This simple scheduler can support a wide range of higher-level scheduling policies. As we demonstrate below, an
application can enforce proportional sharing on a collection of sub-processes.
32
3.3.2 Aegis Processor Environments
An Aegis processor environment is a structure that stores the information needed to deliver events to applications. All
resource consumption is associated with an environment because Aegis must deliver events associated with a resource
(such as revocation exceptions) to its designated owner.
Four kinds of events are delivered by Aegis: exceptions, interrupts, protected control transfers, and address
translations. Processor environments contain the four contexts required to support these events:
Exception context: for each exception, an exception context contains a program counter for where to jump to and
a pointer to physical memory for saving registers.
Interrupt context: for each interrupt an interrupt context includes a program counters and register-save region.
In the case of timer interrupts, the interrupt context specifies separate program counters for start-time-slice and
end-time-slice cases, as well as status register values that control co-processor and interrupt-enable flags.
Protected Entry context: a protected entry context specifies program counters for synchronous and asynchronous
protected control transfers from other applications. Aegis allows any processor environment to transfer control into
any other; access control is managed by the application itself.
Addressing context: an addressing context consists of a set of guaranteed mappings. A TLB miss on a virtual
address that is mapped by a guaranteed mapping is handled by Aegis. Library operating systems rely on guaranteed
mappings for bootstrapping page-tables, exception handling code, and exception stacks. The addressing context also
includes an address space identifier, a status register, and a tag used to hash into the Aegis software TLB. To switch
from one environment to another, Aegis must install these three values.
These are the event-handling contexts required to define a process. Each context depends on the others for validity:
for example, an addressing context does not make sense without an exception context, since it does not define any
action to take when an exception or interrupt occurs.
33
1. It saves three scratch registers into an agreed-upon “save area.” (To avoid TLB exceptions, Aegis does this
operation using physical addresses.)
2. It loads the exception program counter, the last virtual address that failed to have a valid translation, and the
cause of the exception.
3. It uses the cause of the exception to perform an indirect jump to an application-specified program counter value,
where execution resumes with the appropriate permissions set (i.e., in user-mode with interrupts re-enabled).
After processing an exception, applications can immediately resume execution without entering the kernel. Ensuring
that applications can return from their own exceptions (without kernel intervention) requires that all exception state
be available for user reconstruction. This means that all registers that are saved must be in user-accessible memory
locations.
Currently, Aegis dispatches exceptions in 18 instructions (1.5 microseconds on our 25MHz DECstation5000/125),
roughly a factor of a hundred faster than Ultrix, a monolithic OS running on the same hardware, and over five times
faster than the most highly-tuned implementation in the literature. Mainly this improvement comes from the two facts
that (1) exokernel primitives are low-level, which means applications do not pay for unnecessary functionality, and (2)
the exokernel concentrates on doing a few things well. For example, Aegis is small, and does not need to page its data
structures. Because, all kernel addresses are physical they never miss in the TLB, and Aegis does not have to separate
kernel TLB misses from the more general class of exceptions in its exception demultiplexing routine.
Fast exceptions enable a number of intriguing applications: efficient page-protection traps can be used by applica-
tions such as distributed shared memory systems, persistent object stores, and garbage collectors [5, 87]. Exception
times are discussed further in [25].
34
3.4.3 Implementing Unix IPC on Xok
UNIX defines a variety of interprocess communication primitives: signals (software interrupts that can be sent between
processes or to a process itself), pipes (producer-consumer untyped message queues), and sockets (differing from pipes
in that they can be established between non-related processes, potentially executing on different machines).
Signals are layered on top of Xok IPC. Pipes are implemented using Xok's software regions, which provide sub-
page memory protection, coupled with a “directed yield” to the other party when it is required to do work (i.e., if the
queue is full or empty). Sockets communicating on the same machine are currently implemented using a shared buffer.
Inter-machine sockets are implemented through user-level network libraries for UDP and TCP. The network libraries
are implemented using Xok's timers, upcalls, and packet rings, which allow protected buffering of received network
packet,
3.5 Discussion
This chapter has discussed how an exokernel can safely export a variety of resources — physical memory, the CPU,
the network, and hardware events — to applications. The low-level of the exokernel interface allows library operating
systems, working above the exokernel interface, to implement higher-level abstractions and define special-purpose
implementations that best meet the performance and functionality goals of applications. This organization allows the
redefinition of fundamental operating system abstractions by simply changing application-level libraries. Furthermore,
these different versions can co-exist on the same machine and are fully protected by Aegis.
The next chapter presents the resource we have found most challenging, disk, along with our solution to it (and
five failed solutions).
35
Chapter 4
An exokernel must provide a means to safely multiplex disks among multiple library file systems (libFSes). Each
libOS contains one or more libFSes. Multiple libFSes can be used to share the same files with different semantics.
In addition to accessing existing files, libFSes can define new on-disk file types with arbitrary meta data formats.
An exokernel must give libFSes as much control over file management as possible while still protecting files from
unauthorized access. It therefore cannot rely on simple-minded solutions like partitioning to multiplex a disk: each
file would require its own partition.
To allow libFSes to perform their own file management, an exokernel stable storage system must satisfy four
requirements. First, creating new file formats should be simple and lightweight. It should not require any special
privilege. Second, the protection substrate should allow multiple libFSes to safely share files at the raw disk block
and meta data level. Third, the storage system must be efficient—as close to raw hardware performance as possible.
Fourth, the storage system should facilitate cache sharing among libFSes, and allow them to easily address problems
of cache coherence, security, and concurrency.
The goal of disk multiplexing is to make untrusted library file systems (libFSes) as powerful as privileged file
systems. As listed above, there are a number of engineering challenges to reaching this goal. The hardest challenge
by far, however, is access control: i.e., answering the deceptively simple question “who can use a given disk block?”
Inventing a satisfactory solution to this problem took us three years and four systems. This chapter describes how
we multiplex stable storage, both to show how we address these problems and to provide a concrete example of the
exokernel principles in practice. We first describe out solution to efficient access control. An interesting aspect of this
solution is that it uses the libFSes own data structures to track what the libFS owns. Another is the the code verification
technique we invented to do such re-use without impinging on libFS flexibility. We then give an overview XN, Xok's
extensible, low-level in-kernel stable storage system. We also describe the general interface between XN and libFSes
and present one particular libFS, C-FFS, the co-locating fast file system [37].
36
if it can reuse the bookkeeping data structures of the libFS then we can eliminate this redundancy. 1 For example, a
Unix file system tracks what blocks are associated with it (and what principals are allowed to use them) using “meta
data” consisting of directory blocks, inodes, as well as single, double, and triple indirect blocks. In other words,
everything the kernel needs to track to perform access control. If it can reuse libFS meta data, then, it does not need to
perform duplicate tracking of what blocks are associated with which libFS.
Our new problem is that reusing libFS bookkeeping structures requires that we understand them, both so that we
can force them to be correct, and so that we can extract ownership information from them. One traditional solution
would be to provide a fixed set of components (e.g., pointers to disk blocks, disk block extents, etc.) from which
libFSes could build their meta data from. However, creating a set of universal building blocks for file system meta
data is so hard as to be infeasible: file system research is still an active area even after three decades. A component set
capable of describing all results of this research does not seem possible. Instead, we solve this problem by having the
libFS interpret its meta data for us in a way that we can test for correctness.
37
Determinism gives persistence and, as a result, allows us to use induction to verify UDF correctness. Inductive
testing has two phases: initialization and modification. Initialization tests that the UDF's initial state is a valid one.
Modification tests that when a UDF's state is mutated, the mutation leaves the UDF in a valid state.
Thus, to trust libFS meta data and its associated owns function, we need to check that owns(meta) produces a valid
output when meta is initialized, and then retest it after each modification. Once we can force all libFS meta data to
produce valid results, we have accomplished the goal of this section: libFSes can now track their disk blocks without
the kernel having to duplicate this bookkeeping.
More specifically, initialization checks that when the libFS allocates a piece of meta data, it should not control any
disk blocks:
If owns(meta) does not yield the empty set then we know that either owns is incorrect or meta is not properly
initialized. In either case we reject meta. (We do not actually care if owns itself is correct — i.e., that it works on all
possible inputs — just that it work on the current set of used inputs.) Otherwise, we can accept meta. Because owns is
deterministic we are guaranteed that until meta is modified owns(meta) will always produce the empty set.
Next, when a file system wants to modify its meta data (say to allocate a disk block to a file) we verify that meta
goes from its current valid state to a new valid state, where, because of determinism, it must remain.
Allocation is thus:
Because owns is deterministic we can find out its current state by simply querying it. This is the crucial part of the
process that frees us from duplicating any bookkeeping data structures. Using that current state we can ensure that the
modification goes to a valid next state with a straightforward process of computing set union and equality.
Deallocation is similar to allocation:
38
While we must run owns in a safe evaluation context after meta has been modified, any subsequent invocation of
owns(meta) can run without safety checks since owns is deterministic and, after testing, we know that it executes safely
on this value of meta. In the pseudo-code above, while new set = owns(meta) must be run in a safe context, the statement
old set = owns(meta) can run unchecked. One way to look at this fact is that halting problem is trivial for deterministic
Turing machines that one knows have halted in the past.
Naively, UDF induction might appear to be nothing more than a simplistic application of testing pre- and post-
conditions. The crucial difference is that the pre-condition is supplied by the untrusted UDF implementor, who has
been rendered a trustworthy partner through determinism.
Determinism and XN's trusted set implementation (used to check UDF output) prevents UDFs from “cheat” and
producing bogus output after testing.
At this point we can incrementally verify that owns implements its specification correctly. As a result, the operating
system can now rely on potentially malicious libFSes to track what disk blocks they own, in ways that the operating
system does not understand, while still being able to guarantee correctness. Figure 4-1 sketches how the above
verification steps are embedded in the context of a block allocation system call.
Because our approach merely verifies what a UDF did rather than how it did so it is both more robust and simpler
than traditional verification approaches. Only a handful of lines of pseudo-code are required to verify the correctness
of code that is written in an untyped, general-purpose assembly language that allows pointers (including casts between
integers and pointers), aliasing, stores, dynamic memory allocation, arbitrary loops, and unstructured control flow.
Automated theorem provers, in contrast, are both complex and unable to verify such code.
set = {};
foreach partition id
if(set overlap(set, access(id))
error ”Partitions cannot overlap!”;
set = set U access(id);
Allocation checks that the given partition id is valid, and then adds the block to the set:
39
/
C− pseudo code sketch of how to allocate block ’req’ and stuff a pointer to
it in the parent. For simplicity we assume meta is BLOCKSIZE big and that
setting a block to all zeros is a valid initialization.
/
int sys alloc blk(blk t req, blk t parent, void new meta) {
set t old owns, new owns;
set t (owns)(void ); / pointer to owns function /
void old meta, kid meta;
owns old = owns(old meta); / compute current and potential owned sets. /
owns new = safe eval(owns(new meta));
memcpy(old meta, new meta, BLOCKSIZE); / overwrite parent with new value. /
memset(kid meta, 0, BLOCKSIZE); / initialize kid buffere cache entry to all zeros. /
set dirty(parent); / ensure parent will be flushed back to disk. /
return SUCCESS;
}
Figure 4-1: Implementation sketch of a disk block allocation system call that uses UDFs to let untrusted file systems
track the blocks they control.
40
% Unix file system indirect block.
struct indirect block {
unsigned blocks[1024];
};
Figure 4-2: Unix indirect block and its associated state partitioning UDFs, indirect access and indirect npartitions. The
UDFs are “constant” in that they do not use the meta data block to compute partitions. Non-constant UDFs can be
used as long as they are retested when the state they depend on has been modified.
Partitioning of state is controlled by UDFs, since it is their implementor that knows the natural partitions of the
problem. For instance, in the simple case of the Unix indirect block given in Figure 4-2, which is simply a vector
of 1024 disk block pointers, a partition corresponds to the 32 bits in which a single block pointer is stored. In most
situations we have encountered state partitioning can reduce output sets to singleton entries.
In some sense, we have already been doing coarse-grain state partitioning, since we only rerun a UDF when the
disk block it is associated with is modified rather than requiring that a file system have a single UDF that is run on
modification to any block on disk. Partitioning simply takes this subdivision to finer levels.
A natural question is how to track the state in each partition. Our original implementation use a static partition
table for each type, where partition table(i) gave the set of state sets associated with partition i. Such tables can be large.
In the case of file systems, this size is mitigated by the fact that file systems have only a few templates (one for each
meta data type). Unfortunately, partitioning via a data structure means that, in many cases, the partitioning cannot
be adjusted dynamically, as is required for dynamically resized structures. Of course, the obvious solution is to use
UDFs: they were developed, after all to describing meta data layout. This solution turns out to be quite workable, and
has significantly reduced the size of templates that we use to store file system meta data descriptions. 2
To simplify writing UDFs we can refine the partition scheme to make the enumeration of a partition's “read set”
implicit rather than explicit. Instead of requiring the UDF client supply an access function, we instead have the safe
UDF evaluator trace the memory locations examined by a particular UDF invocation. Initialization checks that these
read sets do not overlap by tracing owns on each partition id rather than calling access. Modification checks that the
read set of the modified meta data on a given id is the same as the original read set. One complication of this model
is handling the case where the read set grows or shrinks. Due to space limitations we elide a discussion of how to do
this gracefully; interested readers can refer to [30].
41
The core problem partitioning must solve is how to bind state to inputs: i.e., given a modification to a piece of
meta data f's state, we must test all inputs that depend on that state. State partitioning does this by grouping all inputs
to owns within a single partition. A more sophisticated alternative is to allow overlap between partitions. Naively, we
might expect that allowing overlap requires that we have a function enum that, given a piece of state, enumerates all
inputs that depend on it. The need for such a function would severely restrict the data structures we could verify, since
most do not naturally support the ability to enumerate all inputs given an arbitrary byte of state. However, if we exploit
the libFS, we can have it give us the inputs that could be effected by a state modification and yet be able to test that it
does so correctly.
We do this using reference counting: if we can reliably determine how many inputs use a piece of state, we can
trust the libFS to find which inputs are dependent. Reference counting enables this split because, given the number of
inputs that use a piece of state and a libFS-supplied set of inputs, we can verify that the set is sufficient by invoking
owns on each element and verifying that it indeed touches s. If it does, then we know we have all inputs that depend
on s, otherwise we return an error. Reference counts are computed by the libFS using a refcnt UDF, which associates
each byte of data with a reference count. Note that a correct libFS already tracks roughly the information that refcnt
needs in order to implement memory management. I.e., it cannot free memory that is still needed by inputs to f. Thus,
it does not seem that the requirement that the libFS can write refcnt restricts its freedom in any real way. Lack of space
prevents further discussion; interested readers can refer to [30].
4.2 Overview of XN
Designing a flexible exokernel stable storage system has proven difficult: XN is our fourth design. This section
provides an overview of how we use UDFs, the cornerstone of XN; the following sections describe some earlier
approaches (and why they failed), and aspects of XN in greater depth.
XN provides access to stable storage at the level of disk blocks, exporting a buffer cache registry (Section 4.4) as
well as free maps and other on-disk structures. The main purpose of XN is to determine the access rights of a given
principal to a given disk block as efficiently as possible. XN must prevent a malicious user from claiming another
user's disk blocks as part of her own files. On a conventional OS, this task is easy, since the kernel itself knows the
file's meta data format. On an exokernel, where files have application-defined meta data layouts, the task is more
difficult. On XN libFSes provide UDFs that act as meta data translation functions specific to each file type. XN uses
UDFs to analyze meta data and translate it into a simple form the kernel understands. A libFS developer can install
UDFs to introduce new on-disk meta data formats. UDFs allow the kernel to safely and efficiently handle any meta
data layout without understanding the layout itself.
UDFs are stored on disk in structures called templates. Each template corresponds to a particular meta data format;
for example, a UNIX file system would have templates for data blocks, inode blocks, inodes, indirect blocks, etc. Each
template T has two UDFs: owns-udfT and refcnt-udfT , and two untrusted and potentially nondeterministic functions:
acl-ufT and size-ufT . All four functions are specified in the same language but only owns-udfT and refcnt-udfT must
be deterministic. The other two can have access to, for example, the time of day. The limited language used to write
these functions is a pseudo-RISC assembly language, checked by the kernel to ensure determinacy. Once a template
is specified, it cannot be changed.
The owns-udf function allows XN to check the correctness of libFS meta data modifications (specified as a list of
byte range modifications) using techniques from the previous subsection.
The refcnt-udf function allows libFSes to represent reference counts however they wish. For simplicity, reference
count are stored in the block that is pointed to (and, thus, they are persistent). Whenever an edge to this block is
formed, XN verifies that refcnt-udf increases by one. And, when an edge is deleted, that the count is decremented.
The acl-uf function implements template-specific access control and semantics; its input is a piece of meta data,
a proposed modification to that meta data, and set of credentials (e.g., capabilities). Its output is a Boolean value
approving or disapproving of the modification. XN runs the proper acl-uf function before any meta data modification.
acl-ufs can implement access control lists, as well as providing certain other guarantees; for example, an acl-uf could
ensure that inode modification times are kept current by rejecting any meta data changes that do not update them.
The size-uf function simply returns the size of a data structure in bytes.
42
4.3 XN: Problem and history
The most difficult requirement for XN is efficiently determining the access rights of a given principal to a given disk
block. We discuss the successive approaches that we have pursued.
Disk-block-level multiplexing. One approach is to associate with each block or extent a capability (or access
control list) that guards it. Unfortunately, if the capability is spatially separated from the disk block (e.g., stored
separately in a table), accessing a block can require two disk accesses (one to fetch the capability and one to fetch
the block). While caching can mitigate this problem to a degree, we are nervous about its overhead on disk-intensive
workloads. An alternative approach is to co-locate capabilities with disk blocks by placing them immediately before
a disk block's data [54]. Unfortunately, on common hardware, reserving space for a capability would prevent blocks
from being multiples of the page size, adding overhead and complexity to disk operations.
Self-descriptive meta data. Our first serious attempt at efficient disk multiplexing provided a means for each
instance of meta data to describe itself. For example, a disk block would start with some number of bytes of
application-specific data and then say “the next ten integers are disk block pointers.” The complexity of space-efficient
self-description caused us to limit what meta data could be described. We discovered that this approach both caused
unacceptable amounts of space overhead and required excessive effort to modify existing file system code, because it
was difficult to shoehorn existing file system data structures into a universal format.
Template-based description. Self-description and its problems were eliminated by the insight that each file system
is built from only a handful of different on-disk data structures, each of which can be considered a type. Since the
number of types is small, it is feasible to describe each type only once per file system—rather than once per instance
of a type—using a template.
Originally, templates were written in a declarative description language (similar to that used in self-descriptive
meta data) rather than UDFs. This system was simple and better than self-descriptive meta data, but still exhibited
what we have come to appreciate as an indication that applications do not have enough control: the system made too
many tradeoffs. We had to make a myriad of decisions about which base types were available and how they were
represented (how large disk block pointers could be, how the type layout could change, how extents were specified).
Given the variety of on-disk data structures described in the file system literature, it seems unlikely that any fixed set
of components will ever be enough to describe all useful meta data.
Our current solution uses templates, but trades the declarative description language for a more expressive,interpreted
language—UDFs. This lets libFSes track their own access rights without XN understanding how they do so; XN merely
verifies that libFSes track block ownership correctly.
1. To prevent unauthorized access, every operation on disk data must be guarded. For speed, XN uses secure
bindings [25] to move access checks to bind time rather than checking at every access. For example, the
permission to read a cached disk block is checked when the page is inserted into the page table of the libFS's
environment, rather than on every access.
2. XN must be able to determine unambiguously what access rights a principal has to a given disk block. For
speed, it uses the UDF mechanism to protect disk blocks using the libFS's own meta data rather than guarding
each block individually.
3. XN must guarantee that disk updates are ordered such that a crash will not incorrectly grant a libFS access to
data it either has freed or has not allocated. This requirement means that meta data that is persistent across
crashes cannot be written when it contains pointers to uninitialized meta data, and that reallocation of a freed
block must be delayed until all persistent pointers to it have been removed.
43
While isolation allows separate libFSes to coexist safely, protected sharing of file system state by mutually distrustful
libFSes requires three additional features:
1. Coherent caching of disk blocks. Distributed, per-application disk block caches create a consistency problem:
if two applications obliviously cache the same disk block in two different physical pages, then modifications
will not be shared. XN solves this problem with an in-kernel, system-wide, protected cache registry that maps
cached disk blocks to the physical pages holding them.
2. Atomic meta data updates. Many file system updates have multiple steps. To ensure that shared state always
ends up in a consistent and correct state, libFSes can lock cache registry entries. (Future work will explore
optimistic concurrency control based on versioning.)
3. Well-formed updates. File abstractions above the XN interface may require that meta data modifications satisfy
invariants (e.g., that link counts in inodes match the number of associated directory entries). UDFs allow XN to
guarantee such invariants in a file-system-specific manner, allowing mutually distrustful applications to safely
share meta data.
XN controls only what is necessary to enforce these protection rules. All other abilities—I/O initiation, disk block
layout and allocation policies, recovery semantics, and consistency guarantees—are left to untrusted libFSes.
44
Registry entries are installed in two ways. First, an application that has write access to a block can directly install
a mapping to it into the registry. Second, applications that do not have write access to a block can indirectly install
an entry for it by performing a “read and insert,” which tells the kernel to read a disk block, associate it with an
application-provided physical page, set the protection of that page page appropriately, and insert this mapping into
the registry. This latter mechanism is used to prevent applications that do not have permission to write a block from
modifying it by installing a bogus in-core copy.
XN does not replace physical pages from the registry (except for those freed by applications), allowing applications
to determine the most appropriate caching policy. Because applications also manage virtual memory paging, the
partitioning of disk cache and virtual memory backing store is under application control. To simplify the application's
task and because it is inexpensive to provide, XN maintains an LRU list of unused but valid buffers. By default, when
LibOSes need pages and none are free, they recycle the oldest buffer on this LRU list.
XN allows any process to write “unowned” dirty blocks to disk (i.e., blocks not associated with a running process),
even if that process does not have write permission for the dirty blocks. This allows the construction of daemons that
asynchronously write dirty blocks. LibFSes do not have to trust daemons with write access to their files, only to flush
the blocks. This ability has three benefits. First, the contents of the registry can be safely retained across process
invocations rather than having to be brought in and paged out on creation and exit. Second, this design simplifies the
implementations of libFSes, since a libFS can rely on a daemon of its choice to flush dirty blocks even in difficult
situations (e.g., if the application containing the libFS is swapped out). Third, this design allows different write-back
policies.
4.5 XN usage
To illustrate how XN is used, we sketch how a libFS can implement common file system operations. These two setup
operations are used to install a libFS:
Type creation. The libFS describes its types by storing templates, described above in Section 4.2, into a type
catalogue. Each template is identified by a unique string (e.g., “FFS Inode”). Once installed, types are persistent
across reboots.
LibFS persistence. To ensure that libFS data is persistent across reboots, a libFS can register the root of its tree
in XN's root catalogue. A root entry consists of a disk extent and corresponding template type, identified by a unique
string (e.g., “mylibFS”).
After a crash, XN uses these roots to garbage-collect the disk by reconstructing the free map. It does so by logically
traversing all roots and all blocks reachable from them: it marks reachable blocks as allocated, non-reachable blocks
as free. If this step is too expensive, it could be eliminated by ordering writes to the free map (so that the map always
held a conservative picture of what blocks were free) or using a log.
During reconstruction XN also checks for errors in meta data reference counts by counting all pointers to all meta
data instances. If this count does not match a meta data's reference count, XN records this violation in an error log.
After finding all errors, it then runs libFS-supplied patch programs to fix these and any libFS-specific errors. If errors
remain after this process, XN marks the root of any tree that contains a bogus reference count as “tainted.” These
errors must be fixed before the tree can be used. We discuss reconstruction further in the next section.
After initialization, the new libFS can use XN. We describe a simplified version of the most common operations.
Startup. To start using XN, a libFS loads its root(s) and any types it needs from the root catalogue into the buffer
cache registry. Usually both will already be cached.
Read. Reading a block from disk is a two-stage process, where the stages can be combined or separated. First, the
libFS creates entries in the registry by passing block addresses for the requested disk blocks and the meta data blocks
controlling them (their parents). The parents must already exist in the registry—libFSes are responsible for loading
them. XN uses owns-udf to determine if the requested blocks are controlled by the supplied meta data blocks and, if
so, installs registry entries.
In the second stage, the libFS initiates a read request, optionally supplying pages to place the data in. Access
control through acl-uf is performed at the parent (e.g., if the data loaded is a bare disk block), at the child (e.g., if the
data is an inode), or both.
A libFS can load any block in its tree by traversing from its root entry, or optionally by starting from any intermediate
node cached in the registry. Note that XN specifically disallows meta data blocks from being mapped read/write.
45
To speculatively read a block before its parent is known, a libFS can issue a raw read command. If the block is not
in the registry, it will be marked as “unknown type” and a disk request initiated. The block cannot be used until after
it is bound to a parent by the first stage of the read process, which will determine its type and allow access control to
be performed.
Allocate. A libFS selects blocks to allocate by reading XN's map of free blocks, allowing libFSes to control file
layout and grouping. Free blocks are allocated to a given meta data node by calling XN with the meta data node, the
blocks to allocate, and the proposed modification to the meta data node. XN checks that the requested blocks are free,
runs the appropriate acl-uf to see if the libFS has permission to allocate, and runs owns-udf, as described in Section 4.2,
to see that the correct block is being allocated. If these checks all succeed, the meta data is changed, the allocated
blocks are removed from the free list, and any allocated meta data blocks are marked tainted (see Section 4.4).
Write. A libFS writes dirty blocks to disk by passing the blocks to write to XN. If the blocks are not in memory,
or they have been pinned in memory by some other application, the write is prevented. The write also fails if any of
the blocks are tainted and reachable from a persistent root. Otherwise, the write succeeds. If the block was previously
tainted and now is not (either by eliminating pointers to uninitialized meta data or by becoming initialized itself), XN
modifies its state and removes it from the tainted list.
Since applications control what is fetched and what is paged out when (and in what order), they can control many
disk management policies and can enforce strong stability guarantees.
Deallocate. XN uses UDFs to check deallocate operations analogously to allocate operations. If there are no
on-disk pointers to a deallocated disk block, XN places it on the free list. Otherwise, XN enqueues the block on a
“will free” list until the block's reference count is zero. Reference counts are decremented when a parent that had an
on-disk pointer to the block deletes that pointer via a write.
46
have the libFS signal violations: rather than forcing libFSes to write blocks out in certain order to prevent violations or
using a log to track them persistently, XN has libFSes provide a “reboot UDF” (reboot) that, when given a file system
tree walks down the tree emitting any violations. How it detects violations is not XN's concern, it need only verify that
reboot will. XN performs this verification as follows by first recording all violations in volatile memory and then, when
a libFS wants to do a write that would violate some invariant (e.g., a stably writing a pointer to an uninitialized piece
of meta data), XN checks that the associated libFS' reboot UDF would inform it about this violation. XN does so in
a way similar to owns verification by (conceptually) running the reboot UDF on the pre-write stable state, performing
the disk write, then running the reboot UDF on the post-write stable state. The obvious problem with this approach
is efficiency. It is obviously not practical to examine the entire disk on every disk write. Fortunately, it appears that
by using both a variation of state partitioning (based on continuation passing [29]) and checkable hints from the file
system, we can checkpoint the UDF precisely, to the point that verification need only examine a few bytes in one or
two cached disk blocks. We are currently designing this scheme.
4.8 Discussion
Similar to how XN uses UDFs to interpret libFS meta data, libFSes can use UDFs to traverse other file system's
meta, even when they do not understand its syntax. (Of course, modification of this meta data typically requires such
understanding.)
UDFs add a negligible cost to XN. As the next section discusses, we have run file system benchmarks with and
without XN enabled and its overhead (and, thus, the overhead of UDFs) is lost in experimental noise. There are two
reasons for this. First, protection tends to be off of the critical path. Second, XN operations tend to be embedded
47
in heavy-weight disk operations. For example, most block operations on cached blocks can simply interact with the
buffer cache registry, rather than using to UDFs. Even if neither of these two conditions held, UDFs would not add
significant overhead: most UDFs tend to be fairly simple, and if they were directly executed rather than interpreted,
cost a few tens of instructions.
XN allows “nestable” extension of file systems. Because XN verifies the correctness of reference counts, pointers,
and meta data interpreters, it allows untrusted implementors to extend an existing file system without compromising
its integrity. Thus, it is possible to add an entirely new directory type to a file system and have it point to old types,
perform access control on them, etc. without the existing implementation open to malice. We do not know of any
other way to achieve this same result.
Stable storage is the most challenging resource we have multiplexed. Future work will focus on two areas. First,
we plan to implement a range of file systems (log-structured file systems, RAID, and memory-based file systems), thus
testing if the XN interface is powerful enough to support concurrent use by radically different file systems. Second we
will investigate using lightweight protected methods like UDFs to implement the simple protection checks required by
higher-level abstractions.
48
Chapter 5
The fundamental principle of science, the definition almost, is this: the sole test of the validity of any idea
is experiment. – Richard P. Feynman
Now, here, you see, it takes all the running you can do, to keep in the same place. If you want to get
somewhere else, you must run at least twice as fast as that! Charles Lutwidge Dodgson (Lewis Carroll)
(1832-1898) “Through the Looking Glass”
3. Are aggressive applications significantly times faster? Among other experiments, we compare the performance
of an optimized webserver, which is eight times faster than on traditional systems.
4. Does local control lead to bad global performance? An exokernel gives applications significantly more control
than traditional operating systems do. Can it reconcile strong local control with good global performance?
To answer this question we measure aggressive workloads on Xok and FreeBSD: (1) given the same workload,
an exokernel performs comparably to widely used monolithic systems, and (2) when local optimizations are
performed, that whole system performance improves, sometimes significantly.
We describe our experimental environment below. The remainder of the chapter addresses each question in turn.
49
Benchmark Description (application)
Copy small file copy the compressed archived source tree (cp)
Uncompress uncompress the archive (gunzip)
Copy large file copy the uncompressed archive (cp)
Unpack file unpack archive (pax)
Copy large tree recursively copy the created directories (cp).
Diff large tree compute the difference between the trees (diff)
Compile compile source code (gcc)
Delete files delete binary files (rm)
Pack tree archive the tree (pax)
Compress compress the archive tree (gzip)
Delete delete the created source tree (rm)
Table 5.1: The I/O-intensive workload installs a large application (the lcc compiler). The size of the compressed
archive file for lcc is 1.1 MByte.
10
0
cp gunzip cp pax cp diff gcc rm pax gzip rm
Unmodified UNIX Programs
Figure 5-1: Performance of unmodified UNIX applications. Xok/ExOS and OpenBSD/C-FFS use a C-FFS file system
while Free/OpenBSD use their native FFS file systems. Times are in seconds.
cost of protection. All measurements reported in this thesis include these extra calls. In practice, protection is off the
critical path, and these overheads are lost in experimental noise.
It is important to note that a sufficiently motivated kernel programmer can implement any optimization that is
implemented in an extensible system. In fact, a member of our research group, Costa Sapuntzakis, has implemented
a version of C-FFS within OpenBSD. Extensible systems (and we believe exokernels in particular) make these
optimizations significantly easier to implement than centralized systems do. For example, porting C-FFS to OpenBSD
took more effort than designing C-FFS and implementing it as a library file system. The experiments below demonstrate
that by using unprivileged application-level resource management, any skilled programmer can implement useful OS
optimizations. The extra layer of protection required to make this application-level management safe costs little.
50
without modifications, they still profit from an exokernel. While an exokernel allows experimentation with completely
different OS interfaces, a more important result may be improving the rate of innovation of implementations of existent
interfaces.
We also ran the Modified Andrew Benchmark (MAB) [69]. On this benchmark, Xok/ExOS takes 11.5 seconds,
OpenBSD/C-FFS takes 12.5 seconds, OpenBSD takes 14.2 seconds, and FreeBSD takes 11.5 seconds. The difference
in performance on MAB is less profound than on the I/O-intensive benchmark, because MAB stresses fork, an expensive
function in Xok/ExOS. ExOS's fork performance suffers because Xok does not yet allow environments to share page
tables. Fork takes six milliseconds on ExOS, compared to less than one millisecond on OpenBSD.
51
calls. Thus, for some uses an exokernel is intrinsically faster than a more traditional system. 1 For example, an
OpenBSD emulator written on our most recent exokernel, Xok, sometimes runs OpenBSD application binaries faster
than their non-emulated performance on OpenBSD for the simple reason that that many system calls (e.g., to read data
structures) become function calls into the emulator's libOS.
further for the non-issue of the cost of having operating system code in libraries.
2 In fact, the original exokernel paper [25] focuses almost exclusively on application-specific optimizations rather improving the rate of whole-
52
8000
NCSA/BSD
Throughput (requests/second)
Harvest/BSD
6000 Socket/BSD
Socket/Xok
Cheetah
4000
2000
0
0 Byte 100 Byte 1 KByte 10 KByte 100 KByte
HTTP page size
Figure 5-2: HTTP document throughput as a function of the document size for several HTTP/1.0 servers. NCSA/BSD
represents the NCSA/1.4.2 server running on OpenBSD. Harvest/BSD represents the Harvest proxy cache running
on OpenBSD. Socket/BSD represents our HTTP server using TCP sockets on OpenBSD. Socket/Xok represents our
HTTP server using the TCP socket interface built on our extensible TCP/IP implementation on the Xok exokernel.
Cheetah/Xok represents the Cheetah HTTP server, which exploits the TCP and file system implementations for speed.
file checksums (which are stored with each file). Data are transmitted (and retransmitted, if necessary) to the client
directly from the file cache without CPU copy operations. (Cao et al. have also used this technique [70].)
Knowledge-based Packet Merging. Cheetah exploits knowledge of its per-request state transitions to reduce the
number of I/O actions it initiates. For example, it avoids sending redundant control packets by delaying ACKs on client
HTTP requests, since it knows it will be able to piggy-back them on the response. This optimization is particularly
valuable for small document sizes, where the reduction represents a substantial fraction (e.g., 20%) of the total number
of packets.
HTML-based File Grouping. Cheetah co-locates files included in an HTML document by allocating them in disk
blocks adjacent to that file when possible. When the file cache does not capture the majority of client requests, this
extension can improve HTTP throughput by up to a factor of two.
Figure 5-2 shows HTTP request throughput as a function of the requested document size for five servers: the NCSA
1.4.2 server [68] running on OpenBSD 2.0, the Harvest cache [14] running on OpenBSD 2.0, the base socket-based
server running on OpenBSD 2.0 (i.e., our HTTP server without any optimizations), the base socket-based server
running on the Xok exokernel system (i.e., our HTTP server without any optimizations with vanilla socket and file
descriptor implementations layered over XIO), and the Cheetah server running on the Xok exokernel (i.e., our HTTP
server with all optimizations enabled).
Figure 5-2 provides several important pieces of information. First, our base HTTP server performs roughly
as well as the Harvest cache, which has been shown to outperform many other HTTP server implementations on
general-purpose operating systems. Both outperform the NCSA server. This gives us a reasonable starting point
for evaluating extensions that improve performance. Second, the default socket and file system implementations
built on top of XIO perform significantly better than the OpenBSD implementations of the same interfaces (by 80–
100%). The improvement comes mainly from simple (though generally valuable) extensions, such as packet merging,
application-level caching of pointers to file cache blocks, and protocol control block reuse.
Third, and most importantly, Cheetah significantly outperforms the servers that use traditional interfaces. By
exploiting Xok's extensibility, Cheetah gains a four times performance improvement for small documents (1 KByte
and smaller), making it eight times faster than the best performance we could achieve on OpenBSD. Furthermore, the
large document performance for Cheetah is limited by the available network bandwidth (three 100Mbit/s Ethernets)
rather than by the server hardware. While the socket-based implementation is limited to only 16.5 MByte/s with 100%
CPU utilization, Cheetah delivers over 29.3 MByte/s with the CPU idle over 30% of the time. The extensibility of
ExOS's default unprivileged TCP/IP and file system implementations made it possible to achieve these performance
improvements incrementally and with low complexity.
The optimizations performed by Cheetah are architecture independent. In Aegis, Cheetah obtained similar perfor-
mance improvements over Ultrix web servers [49].
53
Total
Max
100
Min
Runtime (seconds)
10
0
7/1 14/2 21/3 28/4 35/5
Xok/ExOS and FreeBSD
Figure 5-3: Measured global performance of Xok/ExOS (the first bar) and FreeBSD (the second bar), using the first
application pool. Times are in seconds and on a log scale. number/number refers to the the total number of applications
run by the script and the maximum number of jobs run concurrently. Total is the total running time of each experiment,
Max is the longest runtime of any process in a given run (giving the worst latency). Min is the minimum.
5.5.1 Experiments
Global performance has not been extensively studied. We use the total time to complete a set of concurrent tasks as
a measure of system throughput, and the minimum and the maximum latency of individual applications as a measure of
interactive performance. For simplicity we compare Xok/ExOS's performance under high load to that of FreeBSD; in
these experiments, FreeBSD always performs better than OpenBSD, because of OpenBSD's small, non-unified buffer
cache. While this methodology does not guarantee that an exokernel can compare to any centralized system, it does
offer a useful relative metric.
The space of possible combinations of applications to run is large. The experiments use randomization to ensure
we get a reasonable sample of this space. The inputs are a set of applications to pick from, the total number to run,
and the maximum number that can be running concurrently. Each experiment maintains the number of concurrent
processes at the specified maximum. The outputs are the total running time, giving throughput, and the time to run
each application. Poor interactive performance will show up as a high minimum latency.
The first application pool includes a mix of I/O-intensive and CPU-intensive programs: pack archive (pax -w),
54
Total
Max
Min
100
Runtime (seconds)
10
0
7/1 14/2 21/3 28/4 35/5
Xok/ExOS and FreeBSD
Figure 5-4: Measured global performance of Xok/ExOS (the first bar) and FreeBSD (the second bar), using the second
application pool. Methodology and presentation are as described for Figure 5-3.
search for a word in a large file (grep), compute a checksum many times over a small set of files (cksum), solve a
traveling salesman problem (tsp), solve iteratively a large discrete Laplace equation using successive overrelaxation
(sor), count words (wc), compile (gcc), compress (gzip), and uncompress (gunzip). For this experiment, we chose
applications on which both Xok/ExOS and FreeBSD run roughly equivalently. Each application runs for at least several
seconds and is run in a separate directory from the others (to avoid cooperative buffer cache reuse). The pseudo-random
number generators are identical and start with the same seed, thus producing identical schedules. The applications we
chose compete for the CPU, memory, and the disk.
Figure 5-3 shows on a log scale the results for five different experiments: seven jobs with a maximum concurrency
of one job through 35 jobs with a maximum concurrency of five jobs. The results show that an exokernel system can
achieve performance roughly comparable to UNIX, despite being mostly untuned for global performance.
With a second application pool, we examine global performance when specialized applications (emulated by appli-
cations that benefit from C-FFS's performance advantages) compete with each other and non-specialized applications.
This pool includes tsp and sor from above, unpack archive (pax -r) from Section 5.2, recursive copy (cp -r) from
Section 5.2, and comparison (diff) of two identical 5 MB files. The pax and cp applications represent the specialized
applications.
Figure 5-4 shows on a log scale the results for five experiments: seven jobs with a maximum concurrency of one
job through 35 jobs with a maximum concurrency of 5 jobs. The results show that global performance on an exokernel
system does not degrade even when some applications use resources aggressively. In fact, the relative performance
difference between FreeBSD and Xok/ExOS increases with job concurrency.
5.5.2 Discussion
The central challenge in an exokernel system is not enforcing a global system policy but, rather, deriving the information
needed to decide what enforcement involves and doing so in such a way that application flexibility is minimally curtailed.
Since an exokernel controls resource allocation and revocation, it has the power to enforce global policies. Quota-based
schemes, for instance, can be trivially enforced using only allocation denial and revocation. Fortunately, the crudeness
of successful global optimizations allows global schemes to be readily implemented by an exokernel. For example,
Xok currently tracks global LRU information that applications can use when deallocating resources.
We believe that an exokernel can provide global performance superior to current systems. First, effective local
optimization can mean there are more resources for the entire system. Second, an exokernel gives application writers
machinery to orchestrate inter-application resource management, allowing them to perform domain-specific global
optimizations not possible on current centralized systems (e.g., the UNIX “make” program could be modified to
orchestrate the complete build process). Third, an exokernel can unify the many space-partitioned caches in current
systems (e.g., the buffer cache, network buffers, etc.). Fourth, since applications can know when resources are scarce,
they can make better use of resources when layering abstractions. For example, a web server that caches documents
in virtual memory could stop caching documents when its cache does not fit in main memory. Future research will
pursue these issues.
55
5.5.3 Summary
Our experiments show that even common, unaltered applications can benefit on exokernels, simply by being linked
against an optimized library operating system. Importantly, libOS optimizations appear just as effective as their
equivalent in-kernel implementation. Aggressive applications that want to manage their resources show even greater
improvements. The improvement is especially dramatic for I/O-centric applications, such as our web server, which
runs up to a factor of 8 faster than its closest equivalent. Finally, the power that an exokernel gives to applications
does not lead to poor global performance. In fact, when this control is used to improve application speed, an exokernel
system can have dramatically improved global performance, since there are more resources to go around. Based on
these experiments, the exokernel architecture appears to be a promising alternative to traditional systems.
56
Chapter 6
The individual's whole experience is built upon the plan of his language. — Henri Delacroix
Extensibility refers roughly to how easily a system's functionality can be augmented. A strong thread in computer
science has been developing techniques to enhance extensibility, ranging from programming methodologies such as
structured programming to assist program modification, to dynamic linking of device drivers to add new functions to
an operating system kernel. Using language to build extensible systems has had a venerable tradition: the widely-used
text editor emacs has done so since its inception [11, 81], database systems exploit it to enrich queries and extend data
types, and more recently web browsers and servers have used it to extend their base functionality.
A variety of operating systems have allowed applications to download untrusted code into them as a way to extend
their functionality [9, 22, 25, 32, 48, 71, 79, 80, 92]. This chapter documents experiences drawn from the exokernel
systems described in this thesis. These experiences cover a period of four years, and span numerous rethinkings of the
role of downloaded code, and, as well, much belated realization of its implications and misuses.
The ability to download code has subtle implications. This chapter's central contribution is its perspective on the
abilities downloaded code grants and removes, as well as its concrete examples of how these gained and lost abilities
matter in practice. Some specific insights include:
1. “Infinite” extensibility requires Turing completeness, Turing completeness gives infinite extensibility.
Solving the negative problem of extensibility requires supporting all unanticipated uses. A guaranteed solution
is to let applications inject general-purpose code into the system, thereby granting them the ability to implement
any computable policy or mechanism. (An alternative code motion, uploading operating system code into the
application, provides the same guarantee.)
Conversely, an interface striving for infinite generality is implicitly attempting to provide Turing completeness.
Explicitly realizing this fact leads to the obvious solution of having clients pass in general-purpose code.
The following point provides an example:
2. Correct applications track what resources they have access to, rendering the operating system's bookkeeping
redundant. Thus, if the operating system can reuse the application's data structures, it can eliminate this
redundancy.
To ensure that the operating system can understand these structures without restricting their implementation, we
use the previous point: clients provide a data structure interpreter, written in a Turing complete language, that
the operating system uses to extract the bookkeeping information it needs.
To force these interpreters to be correct, we use the following technique:
3. Inductive incremental testing provides a practical way to verify the correctness (not just mere safety) of deter-
ministic code. We call functions amenable to this approach untrusted deterministic functions (UDFs).
Using UDFs, operating systems can avoid pre-determining implementation tradeoffs by leaving implementation
decisions to client UDFs. Our most interesting use of UDFs, verifying the resource interpreters described above,
lets untrusted file systems track what disk blocks they own without the operating system understanding how, yet
without being vulnerable to malice.
57
4. There are practical nuances between using code or data to orchestrate actions between entities. Two examples
follow.
Data is not a Turing machine: compared to downloaded code, data is transparent, and its operations (read and
write) trivially bounded in cost and guaranteed to terminate. Code is not, necessarily, any of these things.
Injecting potentially non-terminating black boxes into an operating system does not help simplicity. We only
use downloaded code in a few specific instances, and have removed it more than once.
Code requires indirection. Imposing code between the operating system and its data forces it to go through a
layer of potentially non-terminating indirection. For example, since our disk subsystem relies on client UDFs to
interpret file system structures, it cannot modify them directly, but requires assistance for operations as simple
as changing one disk block pointer to another in order to compact the disk.
5. The main benefit of downloaded code is not execution speed, but rather trust and consequently power.
While we started with the view that downloading code was useful for speed (e.g., to eliminate the cost of
kernel/user boundary crossings) [25] it has turned out to be far more crucial for power: because downloaded
code can be controlled, it can be safely granted abilities that external, unrestricted application code cannot.
For example, since downloaded file system UDFs can be verified, they can be trusted to track that file system's
disk blocks. Obviously, unrestricted applications cannot similarly be trusted.
The chapter is organized as follows. The next four sections discuss different language-based subsystems, which
form the spine for our experiences and lessons: DPF [31], our packet filter engine; application specific message
handlers (ASHs) [25, 92, 93], a networking system which invokes downloaded code on message arrival; XN [48], the
disk protection subsystem described in Chapter 4, which contains our most interesting use of downloaded code; and
finally, protected methods, which applications use to enforce invariants on shared state. Section 6.5 explores some of
the slippery implications of using computational building blocks in lieu of passive procedural interfaces. Section 6.6
links our experiences to the existent literature on extensible systems.
58
comparisons of adjacent message values, estimating alignment of message loads to eliminate the pessimal use of
unaligned memory loads and, finally, eliminating bounds checks.
The most unusual optimization DPF does is hash table compilation. When filters compare the same message offset
to the same value we merge them. However, when they compare to different values we create a hash table holding the
constant each filter compares to. DPF uses dynamic code generation to compile this hash table to executable code.
For example, if the hash table has no collisions, the lookup can elide collision checks. If there are a small number of
keys (say 8 or less) it instead generates a specialized binary search that hard codes each key as an immediate value in
the instruction stream. Otherwise it creates a jump table. Additionally, since the number and value of keys are known
at runtime, DPF can select among several hash functions to obtain the best distribution, and then encode the chosen
function in the instruction stream.
59
cache.) Another problem, banal but real, is that threading requires that pointers be loaded as immediates, which on
64-bit machines can be expensive. Again, fixes are not intellectually deep, but can require tedious bookkeeping.
In contrast, dynamically compiling code is much simpler. The code changes infrequently, if ever, and tends to be
used many times before being discarded. As a result, compilation is a “one off” affair and its cost easily recouped.
While current dynamic compilation techniques works reasonably well for compiling languages, they must mature
further before they can be readily used to compile data structures.
60
To summarize: (1) tamed code can be made lightweight and, thus, it can be run in situations where application
scheduling is infeasible, and (2) this decoupling allows application semantics to be reactively incorporated into event
processing.
6.2.2 Discussion
Source level versus object code level sandboxing. Software fault isolation (SFI) can be done either at the source
or at the object code level [59, 89]. ASHs were built using object code SFI, which has the theoretical advantage that
it works across languages and compilers, and with pre-compiled code. In retrospect, a source level implementation
would have been better. Object-level modification is difficult and extremely non-portable, obviously varying across
different architectures and object code formats. Even worse, it also varies across different compiler releases due to the
practice of commercial vendors of deliberately changing object code formats in undocumented ways in order to stifle
third-party competitors [59].
Source-level SFI, because it is tightly integrated within a compiler, is much simpler to develop. It requires
the addition of modest, mostly portable operations done at the level of a compiler's intermediate representation.
A secondary benefit of integration is that SFI operations are optimized by the host compiler, unlike object SFI
implementations. In contrast, object SFI constantly must fight against the fact that it has lost much lost semantic
information. For example, object code modification requires that compiler-generated jump tables be relocated, which
can be challenging, since simply finding these tables is difficult, typically requiring compiler-specific heuristics.
An obvious disadvantage of using source-level SFI is that it is specific to one compiler back end and whatever
languages its front end(s) consume. In theory, this is important. In practice, operating system software is written in the
C programming language. On those rare systems where other languages are used, special support would be required
even with object code SFI, since these languages typically use a runtime system, which must be adapted to run in an
operating system's restrictive context.
A more serious problem is that a source SFI system typically increases the size of the trusted computing base more
than object code SFI system does. A specious counter to this problem is the belief that if the trusted compiler is the
same as that used to compile the operating system, then the trusted computing base has not really increased since the
correctness of that compiler must already be trusted. However, there is a large difference between compiling a kernel
correctly and resisting the malice of clever hackers trying to find a hole in 10-100K lines of compiler, assembler, and
dynamic linker code.
The lack of a widely available object code SFI system forced us to “roll our own.” Realistically, the reliability of a
trusted compiler is most likely better than a object code SFI module we have implemented ourselves.
We no longer use ASHs. In theory, coupling packet arrival to application semantics is profitable. Separate from
our work, Edwards et al. [24] provide a way use simple application-provided scripts to direct message placement while
Fiuczynski and Bershad [32] provide a fully general messaging system. However, practical technology tradeoffs have
made us eliminate ASHs. The three main benefits of ASHs are (1) elimination of kernel crossings, (2) integrated data
copying, and (3) fast upcalls to unscheduled processes, thereby reducing processing latency (e.g., of send-response
style network messages). On current generation chips, however, the latency of I/O devices is large compared to the
overhead of kernel crossings, making the first benefit negligible. The second does not require downloading code, only
an upcall mechanism [19]. In practice, it is the latter ability that gives speed. Finally, the presence of DMA hardware
makes the data integration ASHs provide irrelevant, since there is no way to add user extensions to the hardware's brute
copy.
This section explores language issues in the exokernel's disk multiplexing system, XN, our most “language heavy”
subsystem, which contains our most interesting use of downloaded code. We first discuss how to use UDFs to let
untrusted file systems track what disk blocks they own, hopefully in enough detail that other practitioners can see how
to use UDFs in their domain. We then present a series of insights that have arisen in XN and close with lessons.
61
6.3.1 Language Evolution
As discussed in Chapter 4, the languages we use to describe client meta data have gone from through four iterations,
becoming increasingly general, lower-level, and abstract. We went from an approach that did not use a language
at all to one with an expressive declarative description language (which reading file system literature showed as not
expressive enough) to our quasi-Turing complete 3 One view of the above evolution is as a struggle to define a universal
data layout language. A universal language for most domains requires Turing completeness. Once this fact is realized,
it becomes obvious that one needs to provide general-purpose computational primitives. Unfortunately, we only made
this connection in hindsight after several years of struggles with disk multiplexing.
6.3.2 Insights
Infinite generality requires Turing completeness. A designer attempting to build an infinitely (or even very general)
interface or component set is implicitly striving for Turing completeness. Explicit articulation of this fact makes it clear
a solution is to allow clients to customize policies and interface implementations using a Turing complete language
rather than, say, a “jumble of procedure flags.”
The author belatedly had this insight after struggling with the problem of how to define a completely general set of
meta data building blocks and, in fact, only months after coming up with the solution (UDFs) did why they solve the
problem become clear.
Turing completeness guarantees infinite extensibility. “Solving” extensibility requires showing that any unan-
ticipated use of a system can be implemented. Proving a negative property is hard. A key realization of this chapter is
that, when appropriate, Turing completeness provides a way to guarantee that the extensibility problem is solved. For
example, the fact that UDFs are (roughly) Turing complete guarantees that that they can describe any computable data
layout, anticipated or not.
Transmuting the imperative to the declarative. System implemented functionality is imperative. It determines
how to resolve tradeoffs in interface construction: i.e., whether to optimize for latency, throughput, or space. Such
predetermination can cause problems when many tradeoffs exist. Downloaded code can be used by a system designer
to defer such tradeoffs to clients. By allowing client code to implement functions, the system builder can switch from
an imperatively deciding how to implement this function, to declarative testing that client code did so correctly. For
example, rather than imperatively deciding how to represent meta data XN declaratively tests that a UDF produces the
correct output. This approach has been noticeably easier than the previous struggle to construct a universal data layout
language.
The cost of this approach is that testing can be more complex than implementing the functionality (it can also be
simpler) and more expensive, though this can be a net win if the algorithm is not on the critical path or grants sufficient
power or speed.
Code enables semantic compression. Data representation is important. In a sense, UDFs can be viewed as
semantics-exploiting meta data compressors. One could, after all, define a space-inefficient and inflexible but fully
general meta data layout. However, UDFs allow representation to be more succinct. For example, much of the meaning
of a libFS's meta data is encode in its code, eliminating the need to duplicate this information in the meta data itself
(e.g., a libFS “just knows” that certain types of block pointers point to four contigous blocks rather than one). As a
more sophisticated example, consider an algebraic relation between blocks such as a file system that allocates blocks
at the beginning of every cylinder group. While a predefined data structure would have to list every block, a UDF can
encode this knowledge in a function that reads the base block from a instance of meta data and constructs its set:
proc owns(meta)
base = meta− >base block;
set = {};
for i = 0 to number of partitions
set = set U { i blocks in cylinder group + base };
return set;
3 While the language used to write UDFs is more-or-less Turing complete, their execution environment is restricted (since UDFs cannot run “too
long”) as are UDFs (since they must be practical to verify). Similar restrictions will hold for any code downloaded into the kernel, since it must
be prevented from at least corrupting kernel data structures. For linguistic we will still refer to such extensions as Turing complete despite this
restricted execution environment.
62
UDFs can make code transparent. A strength of downloaded code is that it can compute its results however
it wishes, in ways the underlying system did not anticipate. However, mysterious result computation can also be a
liability: users of the code may want to know when it computes a certain output. For example, consider a library file
system function, access, that given a principal and inode, indicates whether that principal is allowed to use the inode
(access(inode, pid) ! bool). Given this function, it is not obvious what values of pid will cause it to return true for a piece
of meta data. Thus, an application creating a file controlled using access cannot determine if there exists a special “back
door” value of pid that would give others access to its files. UDFs can eliminate this problem. First, we transform this
function into one that given an inode produces the set of principles allowed to use it (access(inode) ! f set of principles
g ). (In a sense, we transform access into a function that returns the set of values for which the original access returned
true.) Second, we ensure that access is deterministic. Now, at each modification of an inode we can use online testing
to ensure that the set of principles associated with it grows or shrinks exactly as it should.
Program verification enables “nestable” extensibility. Because XN verifies the correctness of reference counts,
pointers, and meta data interpreters, it allows untrusted implementors to extend an existing file system without
compromising its integrity. Thus, it is possible to add an entirely new directory type to a file system and have it point
to old types, perform access control on them, etc. without the existing implementation open to malice. We do not
know of any other way to achieve this same result.
6.3.3 Lessons
Provide reasonable defaults. UDFs are written in a pseudo-assembly language. A simple virtual machine can be
both easy to implement (ours took roughly a day) and small (ours was a few hundred lines of code). The cost, of
course, is the unpleasantness of writing assembly-level code. Fortunately, for limited domains, such as packet filters or
meta data interpreters, this drawback can be eliminated by hiding such code behind higher-level procedural interfaces,
which clients use instead.
Unfortunately, we repeatedly neglected to construct good default libraries after building the base extensible system.
As a result, clients typically wrote their code in terms of raw (albeit portable) assembly language,leading to impenetrable
code scattered throughout programs. This cycle was self-reinforcing: new programmers that wanted to use the system
would look at existing clients, typically not understand what they did, and so cut-and-paste the original code with ad
hoc modifications.
The observation that an extensible system must provide good default libraries is neither deep nor unique [11].
Nonetheless it was frequently violated: we did so with packet filters, then with wake up predicates, then with UDFs.
Low-level type systems are useful. From one perspective XN can be viewed as a dynamic type system placed
below a file system to catch errors. It served this role well, catching several errors in the C-FFS file system that had
escaped the notice of an experienced implementor. In this sense, XN has benefits similar to a low-level typing system
such as the Til assembly language developed to catch low-level compiler bugs [85]. One possible use of exokernel
technology is to place them below existent operating systems as efficient runtime type checkers.
Fast languages are unnecessary. Writing downloaded code using an efficient language does not hurt. But, at
least in the context of an exokernel, it does not seem to help overly much either. UDFs, for example, are written in
an interpreted assembly code and wakeup predicates have numerous excess manual address translations. The reason
for this is that code used for protection is usually off the critical path, while non-protection code can be placed in the
application itself at (perhaps) the cost of an extra system call.
63
is provided in a library, with the admonishment to use it when modifying a directory. Nothing prevents a malicious
application from jumping into the middle of the procedure to skip over any invariant checks or, more simply, just
writing to the directory block directly.
Mutually mistrusting applications can safely share state by agreeing on code to use, downloading it in the kernel,
and accessing the state through this code. Method invocation happens via a system call, forcing execution to begin
at a well-defined program counter value. This prevents applications from jumping over guards. Method execution
cannot be “hi jacked” by the application manipulating its state via debugging system calls, signals, or page mapping
tricks. State exists only in the method's address space. Applications cannot modify it by forging pointers or aliasing
virtual addresses. A benefit of this separation is that non-page protection can be readily implemented, as opposed to
page granularity memory protection of unrestricted application code. The method cannot write outside its address
space. This protects the application from buggy or malicious method code. (Though, the areas where one would trust
a method's output, but not trust it to not corrupt the application's state are rare.)
Methods can be used to force the coupling of state modifications, such as forcing the invalidation of a “negative
name cache” when a directory entry is allocated. They also help the modification of data structures that span trust
boundaries. For example, they can be used to repair file system data structures after a file system crash. Finally, they
can provide an easy way to get atomicity.
Protected methods are only one of many ways to provide extensible protection. An alternative is to force all
applications to be written in a restricted language and compiled with a trusted compiler. With the advent of languages
such as Java, such alternatives may become more palatable.
Another, more traditional way, is to use servers to encapsulate sensitive state. However, because server code
is not controlled by the trusted kernel it cannot enforce invariants on it. Thus, server functionality must be trusted
completely, and cannot be nested in the same way that methods can be. For example, the kernel can use testing to
verify that methods only touches a specific range of bytes in its guarded state, that its modifications preserve pre- and
post-conditions, that it is correct, etc.
6.5 Discussion
Data is not a Turing machine. Data is inflexible, but transparent, and its operations (read and write) trivially bounded
in cost and guaranteed to terminate. Code is not necessarily any of these three things. The benign characteristics of
data can be a relief compared to the uncertainty induced by injecting potentially non-terminating black boxes into a
complex operating system. Two specific examples follow.
Information can be communicated by memory (e.g, a flag set when the operating system is allowed to write a disk
block) or by code (e.g., a routine called with the disk block asking if it can be written). XN explicitly uses pointers
rather than code to track block write orders, despite the lack of flexibility. Code made dependency cycles more difficult
to check, and thus, the cost of sharing higher.
It is relatively simple and well understood how to decouple application actions from application execution using
the stylized method of buffering. An application that wishes to send a large message on the network can give the
buffer for the message to the OS. The OS (or DMA engine) can in turn send it across the network at whatever rate the
network supports, irrespective of whether the application that provided the data is currently running. As a result, this
pattern of using buffering rather than downloaded code to decouple application actions from scheduling can be seen in
all operating systems the author is aware of.
The alternative, having the application explicitly cooperate with the OS via tamed downloaded code, can be
accomplished [32, 93] but requires far more machinery than the simple buffer management routines above.
Data has visible transitions. Data, because it is passive, can only be changed by an active entity. Given the right
framework (e.g., if applications can only write to data via system calls), this characteristic makes transitions clear and
easily coupled to actions or checks. For example, to allow applications to control the order of disk block writes, XN
allows them to create dependency chains. Since the pointers used to form chains can only be added using the OS, it is
simple to check for cycles. If code was used instead (e.g., a boolean procedure can write?) that was associated with
each block), if the code is in any way opaque, such checking becomes more difficult.
Data is passive, control active. From the application's perspective, the passivity of data can be a significant
problem: it makes state transitions invisible, requiring polling to track them. Our exokernel provides wakeup
predicates 4 as a way to conceptually (if not actually) transmute passive data into active events. Wakeup predicates
4 The exokernel's system for this was conceived, designed and implemented by Thomas Pinckney
64
are application code snippets downloaded into the kernel, bound to useful memory locations (block I/O flags, timer
counters, etc.) and evaluated on various interrupts. When they evaluate to true, the process is awakened.
Code imposes indirection. Placing code between the operating system and its data forces the OS to go through
a layer of potentially non-terminating indirection. For example, rather than meta data traversal routines that simply
walk down a vector of block pointers, XN requires the use of untrusted iterators and must make provisions to ensure
that they do not run too long. Additionally, it cannot simply modify client meta data anymore — since it does not
understand their semantics — and, as a result, cannot necessarily do an operation as simple as changing one disk block
pointer to another in order to compact the disk.
Further, the indirecting code forces the OS to plan for failure. It must have a contingency plan for when the
application code does not update data appropriately or even terminate. Having to rely on a non-trustworthy opponent
to do crucial operations can be a practical irritant.
Code hides information. Information can be exchanged from operating system to application using either mapped
OS data structures or system calls. The latter interface shields applications from implementation details. However,
if the OS does not anticipate the need for a piece of information and encapsulate it within a system call, the client
cannot recover it. However, by ripping away this procedural layer layer and exporting data structures (read only) to
applications, they can obtain all information, anticipated by the kernel implementor or not. (The potential cost is being
tied to a specific implementation.)
Our library operating system's reliance on “wakeup predicates” has driven home the advantages of exposing kernel
data structures. Frequently, we have required unusual information about the system. In all cases, this information was
already provided by the kernel data structures.
Understanding. A practical problem in using downloaded code is that historically there has been a schism between
the compiler and operating system communities. As a result, OS implementors frequently do not understand compilers.
They have no equivalent difficulty with data structures.
Alternatives to downloading code for semantics. DPF, ASHs and XN can be viewed as systems to pull application
semantics into resource management decisions. However, this is not the only way to get the same effect. The easiest
is to upload operating system code into the application. It can then decide how to implement whatever decision it
desires. Additionally, it can do so in a Turing complete way, in an unrestricted environment, and with much concern
in the operating system about termination and opaqueness issues.
In non-protection situations, the semantics of resources need never be imported into the operating system. The
application can instead determine what actions to do, using whatever domain-specific knowledge is important, and
then just tell the operating system what to do. This requires constructing interfaces that do declarative checking of an
operation rather than imperatively deciding how to do it. For example, consider the problem of writing cached disk
blocks to stable storage in a way that guarantees consistency across reboots. Rather than an exokernel deciding on
a particular write ordering itself and thus having to struggle with the tradeoffs in scheduling heuristics and caching
decisions required to do so well, it can instead allow the application construct schedules, retaining for itself the much
simplified task of merely checking that any application schedule gives appropriate consistency guarantees. Application
of this methodology enables an exokernel to leave library operating systems to decide on tradeoffs themselves rather
than forcing a particular set, a crucial shift of labor.
One of the games we have played frequently in an exokernel is determining how to construct interfaces where an
application “just knowing” what is appropriate can be expressed as a kernel action.
65
Interestingly, this is the single feature they change about the OS: all hardware details are emulated faithfully. Most
modern operating systems provide ways to dynamically download load device drivers.
Several hints for when to download code can be found in Lampson [52]. A useful insight is that downloading is
simply an example of higher-order function programing (or in systems languages, the use of function pointers rather
than flags as parameters). The main difference is that code is being shipped across trust boundaries rather than, for
instance, library interfaces. Thus, it appears possible to take some of the ideas from these mature areas “whole cloth.”
Sussman and Abelson [1] is a classic text.
The parallel community has long considered the idea of “function shipping” for speed — e.g., to bring computation
closer to data, to be able to migrate computations for load-balancing, etc. Some of the insights from this use can be
applied to operating systems. Similarly, the distributed systems community has shipped code as well. Java applets are
a topical example [40]. Tennenhouse and Weatherall have proposed to use mobile code to build Active Networks [86];
in an active network, protocols are replaced by programs, which are safely executed in the operating system on message
arrival. Curiously, in contrast to our experience, most uses of mobile code in an Active Network seem to be to improve
efficiency, rather increase power.
66
Chapter 7
Conclusion
But in our enthusiasm, we could not resist a radical overhaul of the system, in which all of its major
weaknesses have been exposed, analyzed, and replaced with new weaknesses. — Bruce Leverett, “Register
Allocation in Optimizing Compilers”
This chapter discusses possible ways that an exokernel approach could fail, lessons learned in building our exokernel
systems, and conclusions.
In our mind, the remaining serious questions about the exokernel architecture are sociological ones rather than
technical. We list five possible failures of the architecture once it moves from our coddling laboratory into the “real
world:”
1. Application writers do not deal well with the freedom they have and become tied too closely to a particular
exokernel implementation, preventing upgrades and slowing, rather than improving, system evolution. While
adherence to standard interfaces and good programming practices should prevent this type of failure (given that
these techniques have a venerable track record of doing so in other domains), it remains to be demonstrated if
they suffice for exokernels.
2. Commoditization of operating system software makes operating system research irrelevant. Commoditization
has already severely restricted the viability of new OS interfaces. The dominance of a few OSes, and the
increasing cost of implementing them, may restrict the viability of new implementations of these interfaces as
well. If so, then much of the innovation potential of an exokernel will be lost.
3. The technical ability to innovate does not lead to any more innovation than on traditional systems. The exokernel
architecture is based on a partially-sociological assumption: that making OS innovation easier and less costly
will lead to a vast improvement in innovation. This may well be a false assumption. For instance, innovation
may already be “easy enough” for those who care to do it.
4. An exokernel, by migrating most OS code to libraries, removes the “single point of upgrade” characteristic of
current systems. This can be an advantage, since applications do not have to wait for the central OS to upgrade
but can instead do so themselves. However, it can also impede progress by making improvements harder to
disseminate.
5. Users will not switch operating systems. An OS forms the primal mud on which systems are built. Changes to it
have far reaching impact. Computer system users have thus demonstrated an understandable reluctance to alter
it. It may be that the advantages of an exokernel system do not proof sufficient to lead to such a switch.
67
Fortunately, an exokernel does not require “whole cloth” adoption for success. It appears that many exokernel
interfaces, especially those related to I/O, can be grafted on to existing systems, with little loss in performance.
Normal applications would the existing OS interfaces as a default, while more aggressive applications would
have the power to control important decisions. 1
While we have confidence that the exokernel's technical advantages will allow it to transcend these potential pitfalls,
it must still demonstrate that it does.
7.2 Experience
Over the past three years, we have built three exokernel systems. We distill our experience by discussing the clear
advantages, the costs, and lessons learned from building exokernel systems.
7.2.2 Costs
Exokernels are not a panacea. This section lists some of the costs we have encountered.
Exokernel interface design is not simple. The goal of an exokernel system is for privileged software to export
interfaces that let unprivileged applications manage their own resources. At the same time, these interfaces must offer
rich enough protection that libOSes can assure themselves of invariants on high-level abstractions. It generally takes
several iterations to obtain a satisfactory interface, as the designer struggles to increase power and remove unnecessary
1 John Jannotti in our group has begun the process of integrating exokernel disk and network interfaces into Linux.
68
functionality while still providing the necessary level of protection. Most of our major exokernel interfaces have gone
through multiple designs over several years.
Information loss. Valuable information can be lost by implementing OS abstractions at application level. For
instance, if virtual memory and the file system are completely at application level, the exokernel may be unable to
distinguish pages used to cache disk blocks and pages used for virtual memory. Glaze, the Fugu exokernel, has the
additional complication that it cannot distinguish such uses from the physical pages used for buffering messages [60].
Frequently-used information can often be derived with little effort. For example, if page tables are managed by the
application, the exokernel can approximate LRU page ordering by tracking the insertion of translations into the TLB.
However, at the very least, this inference requires thought.
Self-paging libOSes. Self-paging is difficult (only a few commercial operating systems page their kernel). Self-
paging libOSes are even more difficult because paging can be caused by external entities (e.g., the kernel touching a
paged-out buffer that a libOS provided). Careful planning is necessary to ensure that libOSes can quickly select and
return a page to the exokernel, and that there is a facility to swap in processes without knowledge of their internals
(otherwise virtual memory customization will be infeasible).
7.2.3 Lessons
Provide space for application data in kernel structures. LibOSes are often easier to develop if they can store shared
state in kernel data structures. In particular, this ability can simplify the task of locating shared state and often avoids
awkward (and complex) replication of indexing structures at the application level. For example, Xok lets libOSes use
the software-only bits of page tables, greatly simplifying the implementation of copy on write.
Fast applications do not require good microbenchmark performance. The main benefit of an exokernel is not
that it makes primitive operations efficient, but that it gives applications control over expensive operations such as
I/O. It is this control that gives order of magnitude performance improvements to applications, not fast system calls.
We heavily tuned Aegis to achieve excellent microbenchmark performance. Xok, on the other hand, is completely
untuned. Nevertheless, applications perform well.
Inexpensive critical sections are useful for LibOSes. In traditional OSes, inexpensive critical sections can be
implemented by disabling interrupts [10]. ExOS implements such critical sections by disabling software interrupts
(e.g., time slice termination upcalls). Using critical sections instead of locks removes the need to communicate to
manage a lock, to trust software to acquire and release locks correctly, and to use complex algorithms to reclaim a lock
when a process dies while still holding it. This approach has proven to be similarly useful on the Fugu multiprocessor;
it is the basis of Fugu's fast message passing.
User-level page tables are complex. If page tables are migrated to user level (as on Aegis), a concerted effort
must be made to ensure that the user's TLB refill handler can run in unusual situations. The reason is not performance,
but that the naming context provided by virtual memory mappings is a requirement for most useful operations. For
example, in the case of downloaded code run in an interrupt handler, if the kernel is not willing to allow application
code to service TLB misses then there are many situations where the code will be unable to make progress. User-level
page tables made the implementation of libOSes tricky on Aegis; since the x86 has hardware page tables, this issue
disappeared on Xok/ExOS.
7.3 Conclusion
Our inventions mirror our secret wishes. Lawrence Durrell (1912-1990) “Mountolive” (1959)
This thesis proposes and evaluates the exokernel operating system architecture. An exokernel gives untrusted
application code as much safe control over resources as possible. It does so by separating management from protection.
All functionality necessary for protection resides in the exokernel, control over all other aspects is given to applications.
Ideally, applications can safely and efficiently perform any operation that a privileged operating system can. Thus,
unlike traditional systems, on an exokernel system, OS software becomes: (1) unprivileged, (2) able to co-exist with
other implementations, (3) modifiable and deployable by orders of magnitude more programmers. We hope that this
organization significantly improves operating system innovation.
This thesis has discussed both the exokernel architecture, and how to apply its principles in practice, drawing
upon examples from two exokernel systems. These systems give significant performance advantages to aggressively-
specialized applications while maintaining competitive performance on unmodified UNIX applications, even under
69
heavily multi-tasked workloads. Exokernels also simplify the job of operating system development by allowing one
library operating system to be developed and debugged from another one running on the same machine. The advantages
of rapid operating system development extend beyond specialized niche applications. Thus, while some questions
about the full implications of the exokernel architecture remain to be answered, it is a viable approach that offers many
advantages over conventional systems.
70
Chapter 8
XN's Interface
This Chapter describes the public and privileged XN system call interface.
A number of routines expect a device number (an integer of type dev t), which names an active XN-controlled disk.
This name is implicit in the disk address type, da t, a 64-bit integer that encodes the device name, disk block, and byte
offset within the disk block. A device corresponds to some range of disk blocks, a freemap that tracks these blocks,
and a “root catalogue,” which is a persistent table that libFSes install types and file system roots into.
In general, any XN system call fails if: (1) a libFS-supplied capability is insufficient, (2) a libFS-supplied pointer
is bogus (i.e., not readable or, for modifications, not writeable), or (3) a libFS-supplied name is bogus (e.g., an invalid
disk block name, an invalid root catalogue entry character string, etc.). To save space, we do not mention these errors
further.
We elide the details of mapping XN data structures and buffer cache entries, since they are specific to the hosting
OS's virtual memory interface rather than XN itself.
A.2 Reconstruction
The following two functions are used by privileged reconstruction programs:
db t sys db alloc(dev t dev, db t db, size t n);
Forces the extent [db, db+n) on disk dev to be taken off of the freelist.
xn err t sys xn mod refcnt(da t da, int delta);
Alters da's reference count by delta.
Currently, the error log routines are in flux. We elide them here.
71
B Public system calls
The following system calls are intended for use by any libFS.
Allocate the extent [db, db+nelem*sizeof t) on disk dev and associate it with name in the root catalogue.
If db's value is 0, then the kernel decides what extent to allocate and writes the allocated block into
db. LibFSes use this system call to install both install file system roots and types. On reboot, the XN
reconstructor loads the catalogue and traverses file systems from these roots, garbage collecting and
performing consistency checks. The entry can be modified for applications that possess cap. The call fails
if any block in the extent has already been allocated, if type is invalid, name or db are not readable.
xn err t sys type import(dev t dev, char *type name);
Converts the root catalogue entry name in dev's root catalogue from a disk block extent to an actual
type. It fails if name does not exist, the blocks are not “raw” disk blocks (i.e., of type XN DB), or the
contents of the extent form an invalid type.
xn err t sys reserve type(dev t dev, char *name);
Reserve a slot for an as-yet-unspecified type in dev's root catalogue. This call is used to construct
mutually recursive types.
To create a new type, the libFS performs the following four steps:
1. Allocates space for it on disk and installs it in the root catalogue using sys install mount. The value of type in
this installation is XN DB, indicating the blocks are simple disk blocks. If the type is involved part of a mutually
recursive specification (which can happen when the type is a composite type), the libFS can use sys reserve type
to reserve the type names for these other types, as well as getting their type id (an integer assigned by the kernel),
which is needed by the type's owns UDF to compute the typed block set that it outputs.
2. Initializes these blocks using sys xn writeb (discussed below) to the contents in ths types “type structure,” which
holds the owns UDF, the refcnt UDF, and other information.
3. Writes these blocks back to disk (the following step fails if they are dirty).
4. Finally, it uses sys type import to convert the blocks from generic blocks to an XN recognized type. sys type import
performs consistency checks on the type data structure (e.g., that the contained UDFs are deterministic, that the
partitions are sensible) and, if the checks succeed, changes the type cataglogue entry to be of type XN TYPE rather
than XN DB.
At this point, the libFS can create and point to blocks of the new type.
Bootstrap the file system tree by inserting the type name's block extent into the buffer cache: getting
the types for the rest of the file system tree can be done by recursively applying the associated owns UDF to
the root and its children. The call writes the root catalogue entry into r, and then annotates the associated
entries in the buffer cache with the given type. Fails if name name does not exist, or the extent pointed to
by the root catalogue entry is not in the buffer cache registry.
72
/ specify xn operations that involve UDFs. /
struct xn op {
/ Specify what extent will be allocated/freed/read. /
struct xn update {
size t own id; / id of the udf to run. /
cap t cap; / capability to use for access. /
/
Specify update to a piece of meta− data. Semantics:
memcpy((char )meta + offset, addr, nbytes);
Ignored for reads.
/
struct xn m vec {
size t offset; / what offset in the type /
void addr; / ptr to value to copy there /
size t nbytes;
} mv;
size t n; / number of elements in mv /
};
Similar to sys xn mount, except that it annotates a a type name's cached blocks as holding an XN type.
It fails if the given name does not exist in the root catalogue, is not a type, or if the blocks are not in the
buffer cache.
To use a file system tree, the file system root and any types it needs must first be loaded into the buffer cache using
sys xn read and insert (discussed below). Then, sys xn mount annotates these disk blocks with the type given by their
root catalogue entry (for the root, its synthetic libFS type, for types themselves XN TYPE).
Types and roots can be removed from the root catalogue using:
xn err t sys uninstall mount(dev t dev, char *name, cap t c);
Delete root catalogue entry name on dev. Fails if name is a file system root that contains child pointers,
or is a type with cached blocks. If a type is deleted that is used by some non-cached meta data instance,
then that meta data's owns function will fail (as will the system call using owns). A better solution would
be to count how many blocks of a given type exist and only allow the type to be deleted when there are no
more blocks of its type.
73
struct xn iov {
xn cnt t cnt; / Decremented on every successul io/
struct xn ioe {
db t db;
size t nblocks;
void addr; / mem to write from/to. /
} iov[1];
size t n io; / number of entries. /
};
Read [db, db+nblocks) from disk dev into the buffer cache. cnt is incremented each time a block is
brought in. Fails if the extent does not fit in the buffer cache.
xn err t sys xn insert attr(da t parent, struct xn op *op);
Install the type of the buffer cache entry using the parent block named by parent. Allocation, deletion and
reading and writing all perform this call implicitly since they require registry entries. Fails if the extent specified in op
is not guarded by the sub-partition specified in op.
xn err t sys xn delete attr(dev t dev, db t db, size t nblocks, cap t cap);
Remove the buffer cache entries for [db, db+nblocks) associated with disk dev. Fails if removing the entry would
lead to an invariant violation or if the entry is in use by some other application.
Importantly, libFSes can fetch any block extent into the buffer cache. They only need to locate the blocks parent (and
thus its type and access control mechanism) before actually reading or writing the cached blocks.
The following two functions write buffer cache entries back to disk:
xn err t sys xn writeback(dev t dev, db t db, size t nblocks, xn cnt t *cnt);
Writes [db, db + nblocks) back to disk dev. Each write decrements cnt.
Writes a list of these extents back to disk dev. On some hardware disks, this “gather” mechanism enables more
sophisticated scheduling. Figure 8-2 gives the ANSI C representation for xn iov. Note that neither of these two
system calls require access checks: They only fail if the extent is invalid or contains blocks tainted with some violation.
The following two routines read and write the bytes in buffer cache entries. No byte range read by a UDF can be
modified using these routines.
xn err t sys xn writeb(da t da, void * src, size t nbytes, cap t cap);
Writes [src, src + nbytes) into the cached blocks for [da, da + nbytes). All of the bytes must be in memory. Fils if
any block of the extent is not in core.
Reads [da, da + nbytes) into [dst, dst + nbytes). Fails for reasons identical to above, with the additional restriction
that [dst, dst + nbytes) must be writable.
74
B.4 Metadata operations
The following routines are used to allocate blocks, form edges to existing blocks, delete edges to existing blocks, and
change a block's type:
xn err t sys xn alloc(da t da, struct xn op *op, unsigned alloc);
Allocates the extent [db, db + (sizeof type * nelem) given by op and writes the modifications contained
in op into the parent metadata, da. alloc indicates whether the block should be zero filled. The call fails
if: (1) the extent cannot be allocated; (2) da is not in core; (3) op contains bogus data (its partition id is
invalid or the modification vector has bogus byte ranges); or (4) the UDF returns the wrong data.
xn err t sys xn free(da t da, struct xn op *op);
Frees the extent [db, db + (sizeof type * nelem) given by op. If the extent's has a reference count greater
than one, then the reference count is decremented. Otherwise it is placed on the freelist. Fails for similar
reasons to sys xn alloc.
xn err t sys xn add edge(da t src, struct xn op *src op, da t dst);
Forms an edge from src to dst and increments dst's reference count. Fails if: (1) neither node is in core
(src must be modified to add a pointer, dst modified to increment its refernce count); (2) src's owns UDF
fails, or dst's refcnt UDF fails; or (3) src op specifies bogus modifications.
xn err t sys xn set type(da t da, int ty, void *src, size t nbytes, cap t cap);
Change the type of a type union instance: XN metadata types can be “unions,” which means they can be
dynamically converted among a series of listed types. Currently, the type field must be “nil” (i.e., the type has just
been initialized). Extending the system call to convert between existant types would not be difficult (it only requires
retesting that the new type’s owns function emits the same blocks as the old one).
Reads the registry attribute for db on disk dev into dst. Fails if db is not in core.
Stores the enumerated list of active devices into devl, which is ndevs big. Fails if the destination is not large
enough.
Reads the block address that holds the root catalogue into r db, the block address of the freemap into f db and
the size of the freemap into nbytes.
Returns the first free extent (starting at hint, hint+nblocks)) found on dev. If hint is 0, then XN makes its own
decision about where to start searching.
75
Reads a the list of dirty but writable entries for device dev in the buffer cache into dbv (whose size is given by
n). Typically this is used by two programs. A “syncer” deamon that flushes back dirty blocks, and by any shutdown
application that needs to flush all blocks back to disk. The latter locks the buffer cache and iteratively flushes blocks
until the empty list is returned.
Return the block address of dev’s XN-created “superblock” (i.e., the block that holds the pointers to XN’s
per-device bookkeeping data structures).
Computes a list of of all blocks controlled by da's partitions owns b through owns e. These are written
into the vector ups (whose size is given by n). To ensure that the list does not get “too big” and the system
call does not run “too long” (both of which are OS dependent) the enumeration may be broken up by XN
into multiple passes.
76
Chapter 9
Aegis' Interface
In general, any system call fails if: (1) a libOS lacks permissions for the operation, (2) a libOS-supplied pointer is
bogus (i.e., not readable or, for modifications, not writeable), or (3) a libOS-supplied name is bogus (e.g., an invalid
page name, packet filter id, etc.). To save space, we do not mention these errors further.
.7 CPU interface
Figure .7 presents the ANSI C representation of an Aegis time slice. The routines to allocate, yield, and free time
slices are given below.
int ae s alloc(int pid, int n);
Allocate time slice n and give environment pid access to it. Fails if n is not free, or the current process
lacks write access to pid.
int ae s free(int slice);
Free time slice slice. Fails if the current process lacks write access to the owning environment.
int ae yield(int pid);
.8 Environments
Figure .8 presents the ANSI C representation of an Aegis environment, which is used to store the hardware information
needed to execute a thread of control in a virtual address space, along with resource accounting information.
int ae e alloc(int n, struct env *e);
Allocate environment n. Application fills in the guarenteed mappings, exception handlers in the
environment structure (given in e). Aegis checks that any given physical addresses are allowed an
sensible.
int ae e ref(int pid);
Add a refrence from the current process to environment pid. Fails if the current process lacks
permissions.
int ae e unref(int pid);
Unreference environment pid. If there are no other outstanding references, the env is freed. Fails if the
current process does not have read permission to the environment.
77
/ Time− slice; controls an ordered scheduling quanta. It is donated
permenentely by synchronous IPC. It can be donated temporarily by
yielding (via ae yield) or by asynchronous IPC’s (via ae asynch ipc).
Ownership of the time− slice can be changed by any process that has
the owning environments capability.
Time− slices are initiated via an upcall to application space and
revoked in the same manner; this interaction allows applications
to control important context− switching operations (e.g., this
functionality is sufficient to implement scheduler activations).
The price of this functionality is that an application must be
prevented from ignoring revocation interrupts. The current method
is to record the number of timer interrupts a process has ignored
in ‘ticks’ and when this value exceeds a predefined threshold
killing the process. In a more mature implementation we would simply
context switch the application by hand.
/
struct slice {
char pad[12];
struct env e; / associated environment (null if no one) /
unsigned short next,prev; / forward and back pointers /
int ticks; / ticks consumed in interrupts /
int epc;
};
78
/
Environment: this is the most complicated entity in aegis. An
environment is basically a process: it defines the program counters
to vector events to and serves as a resource accounting point.
/
struct env {
/ pointers to exception “save areas” where Aegis stores
active registers. /
addr t tlbx save area,
genx save area,
intx save area;
/ exception handlers. /
handler t xh[NEX], / sync exception handlers /
epi, / epilogue code /
pro, / prologue code /
init; / initial code that is jumped to /
/
The following four fields are clustered to be on
the same cache line.
/
signed char cid; / address space identifier /
unsigned char envn; / environment number /
short tag; / 11 bit tag /
handler t gate; / ipc entry point /
unsigned status; / status register /
struct tlb xl[MAXXL]; / guarenteed translations /
struct ae intr queue iq; / interrupt queue /
};
79
int ae e add(int pid, int n);
Give pid access to environment n. Only gives write access at the moment. Fails if the current process
does not have access to pid.
int ae e free(int pid);
Modify the byte range [offset, offset+nbytes) in the environment n's application-specific data region.
.9 Physical memory
The following three routines allocate, deallocate, and share physical pages.
int ae p alloc(int n);
Remove reference to page n. If no other process has a reference to the page it is deallocated. Fails if
the current process lacks permissions.
int ae p add(int prot, int pid, int n);
Give process pid access to page n with protections prot. Fails if the current process lacks appropriate
permissions.
.10 Interrupts
To make interrupt handling efficient, Aegis places interrupt notifications in a user-space interrupt queue that the libOS
can read and modify directly. The structures used for this are given in Figure 9-3.
int ae read exposed info (struct exposed info *i);
Read out offsets and lengths of exposed kernel data structures in preperation for mapping their
associated pages. Figure 9-4 presents the data structures that can be mapped.
int ae getrate(void);
Copy pages src pfn, src pfn+sz) to the contigious page range starting at dst pfn, bypassing the TLB. cache
indicates whether the copy should also bypass the cache.
int ae memset(int dst pfn, int offset, char b, int sz);
80
struct ae interrupt {
handler t h; / handler to jump too. /
/
circular interrupt queue: is provided at user− level so that applications
can control interrupt handling efficiently.
Useful facts:
1. pending == 0 − > q is empty
2. tail points to the first entry to dequeue.
3. full: pending == sz
4. head − points to first empty entry.
To consume an interrupt:
1. increment tail % AE INTQ SZ
2. decrement pending.
3. renable the interrupt type.
/
struct ae intr queue {
/
This flag counts the number of interrupts pending while the
process is running with interrupts disabled.
/
unsigned short pending;
/
Number of overflow ints for each type. (simple way to allow
resource− specific recovery.)
/
unsigned short overflow[AE NINTS];
unsigned short overflow p; / set to 1 if there is overflow. /
/
Handlers, two for each interrupt. We make application
explicitly check whether it was running or not. Could
have two handlers, where handler 0 is set when process
was not executing, handler 1 is set when it was.
/
handler t h[AE NINTS];
/ 81
Circular queue; when it runs out, we write an overflow interrupt.
/
unsigned char head;
struct exposed info {
unsigned int start page; / which phys page the info starts on /
int num; / how many pages /
/ page info /
char p refcnt start, p acl start, p map start;
int p refcnt size, p acl size, p map size;
/ env info /
char e refcnt start, e acl start, e map start, env start;
int e refcnt size, e acl size, e map size, env size;
/ stlb /
char stlb start;
int stlb size;
};
Figure 9-4: Structure used to hold where each exposed kernel data structure begins and its size. By default, each
environment has read access to the pages containing these structures.
82
/ Structure to hold recv messages. /
struct ae recv {
int n; / number of entries / struct rec {
int sz;
void data;
} r[MAXPKTS];
};
.11 Networking
Aegis provides calls to insert and delete packet filters. It also provides support for Ethernet and AN2 OTTO chips. For
simplicity, we only provide the former's interface.
The following two functions install and delete packet filters. Filters can be bound to an ASH (libOS messaging code
downloaded into the kernel), which is run when the filter matches. Otherwise the packet will be handled by the libOS.
int ae dpf install(void *filter, int sz, int ash id);
Installs filter (of sz bytes) and binds it to the ASH ash id. Fails if the size is too large, or the filter overlaps
with another filter and the current process has insufficient permissions to override it.
int ae dpf delete(int id);
Install a receive structure, recv, to handle packets arriving for filter fid; receive notification via polling.
addr points to a counter that is incremented by the received message size. Figure 9-5 shows the ANSI C
representation of the receive structure. Fails if there are already too many enqueued receive structures.
int ae eth send(void *msg, int sz);
“Gather” send of the message specified by recv: Aegis copies the message into a contigous outgoing
buffer (the hardware we run on lacks support for DMA).
int ae eth info(addr t addr);
83
struct stlb {
/ STLB tag: the contents of the TLB context register (c0 tlbctx) /
unsigned :2,
vpn:19, / bad virtual page number /
tag:11; / 11 bit tag associated (pseudo− randomly) with each process /
/ TLB entry /
unsigned :8, / reserved /
g:1, / Global: TLB ignores the PID match req /
v:1, / Valid: if not set, TLBL or TLBS miss occurs/
d:1, / Dirty /
n:1, / Non− cacheable. /
pfn:20; / Page frame number /
};
Insert TLB entry for virtual address va. lo holds the physical page number, protection information, and
various software tags. Figure 9-8 gives its ANSI C representation.
void ae tlbrprotn(addr t va, int len);
Make the region [va, va+len*PAGESIZ) writable. Fails if the current process does not have write access to any
page.
void ae tlbflush(void);
84
# Software refill of TLB: uses an STLB cache.
# Our hash function combines process tag with the lower bits of the virtual
# page number (VPN) that missed:
# (((c0 tlbcxt << 17 ˆ c0 tlbctx) >> 16 ) & STLB MASK&˜7)
sll k0, k0, 17 # move VPN up
xor k1, k0, k1 # combine with process tag
srl k1, k1, 16 # move down (8 byte align)
andi k0, k1, STLB MASK & ˜7 # remove upper and lower bits.
# 4. Load TLB: we first load the fetched TLB entry into tlblo (but do not write
# this register into the TLB); we then re− fetch tlbcxt in preparation to its
# comparison to the STLB tag.
mtc0 k0, c0 tlblo # (optimistically) load TLB entry
mfc0 k0, c0 tlbcxt # get context again
nop # delay slot
Figure 9-7: Assembly code used by Aegis to lookup mapping in STLB (18 instructions).
85
/ LO portion of TLB mapping. /
struct lo {
unsigned
w:1, / write perm? /
:7, / reserved for software (libOS) /
g:1, / Global: TLB ignores the PID match req /
v:1, / Valid: If not set, TLBL or TLBS miss occurs/
d:1, / Dirty /
n:1, / Non− cacheable. /
pfn:20; / Page Frame Number: 31..12 of the pa /
};
Figure 9-8: Hardware defined “low” portion of a TLB entry (i.e., the part bound to a virtual page number).
86
Bibliography
[1] H. Abelson, G. J. Sussman, and J. Sussman. Structure and Interpretation of Computer Programs. MIT Press,
1996.
[2] M. Accetta, R. Baron, W. Bolosky, D. Golub, R. Rashid, A. Tevanian, and M. Young. Mach: a new kernel
foundation for UNIX development. In Proceedings of the Summer 1986 USENIX Conference, pages 93–112,
July 1986.
[3] T.E. Anderson. The case for application-specific operating systems. In Third Workshop on Workstation Operating
Systems, pages 92–94, 1992.
[4] T.E. Anderson, B.N. Bershad, E.D. Lazowska, and H.M. Levy. Scheduler activations: Effective kernel support
for the user-level management of parallelism. In Proceedings of the Thirteenth ACM Symposium on Operating
Systems Principles, pages 95–109, October 1991.
[5] A.W. Appel and K. Li. Virtual memory primitives for user programs. In Fourth International Conference on
Architecture Support for Programming Languages and Operating Systems, pages 96–107, Santa Clara, CA, April
1991.
[6] M. L. Bailey, B. Gopal, M. A. Pagels, L. L. Peterson, and P. Sarkar. PATHFINDER: A pattern-based packet
classifier. In Proceedings of the First Symposium on Operating Systems Design and Implementation, pages
115–123, Monterey, CA, USA, November 1994.
[7] K. Bala, M.F. Kaashoek, and W.E. Weihl. Software prefetching and caching for translation lookaside buffers.
In Proceedings of the First Symposium on Operating Systems Design and Implementation, pages 243–253,
November 1994.
[8] J. Barrera. Invocation chaining: manipulating light-weight objects across heavy-weight boundaries. In Proc. of
4th IEEE Workshop on Workstation Operating Systems, pages 191–193, October 1993.
[9] B. N. Bershad, S. Savage, P. Pardyak, E. G. Sirer, M. Fiuczynski, D. Becker, S. Eggers, and C. Chambers.
Extensibility, safety and performance in the SPIN operating system. In Proceedings of the Fifteenth ACM
Symposium on Operating Systems Principles, pages 267–284, Copper Mountain Resort, CO, USA, December
1995.
[10] B.N. Bershad, D.D. Redell, and J.R. Ellis. Fast mutual exclusion for uniprocessors. In Proc. of the Conf. on
Architectural Support for Programming Languages and Operating Systems, pages 223–237, October 1992.
[11] Nathaniel Borenstein and James Gosling. Unix emacs: A retrospective. In ACM SIGGRAPH Symposium on User
Interface Software, October 1988.
[12] E. Bugnion, S. Devine, and M. Rosenblum. Disco: running commodity operating systems on scalable multipro-
cessors. In Proceedings of the Sixteenth ACM Symposium on Operating Systems Principles, 1997.
[13] P. Cao, E. W. Felten, and K. Li. Implementation and performance of application-controlled file caching. In
Proceedings of the First Symposium on Operating Systems Design and Implementation,pages 165–178, November
1994.
87
[14] A. Chankhunthod, P. B. Danzig, C. Neerdaels, M. F. Sc hwartz, and K. J. Worrell. A hierarchical internet object
cache. In Proceedings of 1996 Usenix Technical Conference, pages 153–163, January 1996.
[15] D. L. Chaum and R. S. Fabry. Implementing capability-based protection using encryption. Technical Report
UCB/ERL M78/46, University of California at Berkeley, July 1978.
[16] D. Cheriton and K. Duda. A caching model of operating system kernel functionality. In Proceedings of the First
Symposium on Operating Systems Design and Implementation, pages 179–193, November 1994.
[17] D. R. Cheriton. An experiment using registers for fast message-based interprocess communication. Operating
Systems Review, 18:12–20, October 1984.
[18] D. D. Clark and D. L. Tennenhouse. Architectural considerations for a new generation of protocols. In ACM
Communication Architectures, Protocols, and Applications (SIGCOMM) 1990, pages 200–208, Philadelphia, PA,
USA, September 1990.
[19] D.D. Clark. On the structuring of systems using upcalls. In Proceedings of the Tenth ACM Symposium on
Operating Systems Principles, pages 171–180, December 1985.
[20] R. J. Creasy. The origin of the VM/370 time-sharing system. IBM J. Research and Development, 25(5):483–490,
September 1981.
[21] H. Custer. Inside Windows/NT. Microsoft Press, Redmond, WA, 1993.
[22] P. Deutsch and C. A. Grant. A flexible measurement tool for software systems. Information Processing 71, 1971.
[23] P. Druschel, L. L. Peterson, and B. S. Davie. Experiences with a high-speed network adaptor: A software
perspective. In ACM Communication Architectures, Protocols, and Applications (SIGCOMM) 1994, pages 2–13,
London, UK, August 1994.
[24] A. Edwards, G. Watson, J. Lumley, D. Banks, C. Clamvokis, and C. Dalton. User-space protocols deliver high
performance to applications on a low-cost Gb/s LAN. In ACM Communication Architectures, Protocols, and
Applications (SIGCOMM) 1994, pages 14–24, London, UK, August 1994.
[25] D. R. Engler, M. F. Kaashoek, and J. O' Toole Jr. Exokernel: an operating system architecture for application-
specific resource management. In Proceedings of the Fifteenth ACM Symposium on Operating Systems Principles,
pages 251–266, Copper Mountain Resort, Colorado, December 1995.
[26] D. R. Engler, M. F. Kaashoek, and J. O' Toole. The operating system kernel as a secure programmable machine.
In Proceedings of the Sixth SIGOPS European Workshop, pages 62–67, September 1994.
[27] D. R. Engler, M. F. Kaashoek, and J. O' Toole. The operating system kernel as a secure programmable machine.
In Operating systems review, January 1995.
[28] D. R. Engler, D. Wallach, and M. F. Kaashoek. Efficient, safe, application-specific message processing. Technical
Memorandum MIT/LCS/TM533, MIT, March 1995.
[29] Dawson R. Engler. Simple, robust online verification of program correctness. Submitted for publication.
[30] Dawson R. Engler. Efficient verification of demonically-implemented integer functions (or, demonic determinism,
trusted results). available on request, December 1997.
[31] D.R. Engler and M.F. Kaashoek. DPF: fast, flexible message demultiplexing using dynamic code generation.
In ACM Communication Architectures, Protocols, and Applications (SIGCOMM) 1996, pages 53–59, Stanford,
CA, USA, August 1996.
[32] Marc Fiuczynski and Brian Bershad. An extensible protocol architecture for application-specific networking. In
Proceedings of the 1996 Winter USENIX Conference, pages 55–64, January 1996.
88
[33] B. Ford, K. Van Maren, J. Lepreau, S. Clawson, B. Robinson, and Jeff Turner. The FLUX OS toolkit: Reusable
components for OS implementation. In Proc. of Sixth Workshop on Hot Topics in Operating Systems, pages
14–19, May 1997.
[34] Bryan Ford, Mike Hibler, Jay Lepreau, Patrick Tullman, Godmar Back, and Steven Clawson. Microkernels
meet recursive virtual machines. In Proceedings of the Second Symposium on Operating System Design and
Implementation (OSDI 1996), October 1996.
[35] Bryan Ford and Sai R. Susarla. CPU inheritance scheduling. In Proceedings of the Second Symposium on
Operating System Design and Implementation (OSDI 1996), October 1996.
[36] G. Ganger and Y. Patt. Metadata update performance in file systems. In Proceedings of the First Symposium on
Operating Systems Design and Implementation, pages 49–60, November 1994.
[37] Gregory R. Ganger and M. Frans Kaashoek. Embedded inodes and explicit grouping: Exploiting disk bandwidth
for small files. In Proceedings of the 1997 USENIX Technical Conference, 1997.
[38] R. P. Goldberg. Survey of virtual machine research. IEEE Computer, pages 34–45, June 1974.
[39] D. Golub, R. Dean, A. Forin, and R. Rashid. UNIX as an application program. In USENIX 1990 Summer
Conference, pages 87–95, June 1990.
[40] J. Gosling. Java intermediate bytecodes. In Proc. of ACM SIGPLAN workshop on Intermediate Representations,
pages 111–118, march 1995.
[41] P. Brinch Hansen. The nucleus of a multiprogramming system. Communications of the ACM, 13(4):238–241,
April 1970.
[42] H. Härtig, M. Hohmuth, J. Liedtke, and S. Schönberg andJ. Wolter. The performance of -kernel-based systems.
In Proceedings of the Sixteenth ACM Symposium on Operating Systems Principles, 1997.
[43] J.H. Hartman, A.B. Montz, D. Mosberger, S.W. O' Malley, L.L. Peterson, and T.A. Proebsting. Scout: A
communication-oriented operating system. Technical Report TR 94-20, University of Arizona, Tucson, AZ, June
1994.
[44] K. Harty and D.R. Cheriton. Application-controlled physical memory using external page-cache management.
In Fifth International Conference on Architecture Support for Programming Languages and Operating Systems,
pages 187–199, October 1992.
[45] D. Hitz. An NFS file server appliance. Technical Report 3001, Network Applicance Corporation, March 1995.
[46] J. Huck and J. Hays. Architectural support for translation table management in large address space machines. In
Proceedings of the 19th International Symposium on Computer Architecture, pages 39–51, May 1992.
[47] Norman P. Jouppi. Improving direct-mapped cache performance by the addition of a small fully-associative cache
and prefetch buffers. In 17th Annual International Symposium on Computer Architecture, pages 364–373, May
1990.
[48] M. Frans Kaashoek, Dawson R. Engler, Gregory R. Ganger, Hector M. Briceno, Russell Hunt, David Mazieres,
Thomas Pinckney, Robert Grimm, John Jannotti, and Kenneth Mackenzie. Application performance and flexibility
on exokernel systems. In Proceedings of the Sixteenth ACM Symposium on Operating Systems Principles, October
1997.
[49] M.F. Kaashoek, D.R. Engler, D.H. Wallach, and G. Ganger. Server operating systems. In SIGOPS European
Workshop, pages 141–148, September 1996.
[50] Gerry Kane and Joe Heinrich. MIPS RISC Architecture. Prentice Hall, 1992.
[51] K. Krueger, D. Loftesness, A. Vahdat, and T. Anderson. Tools for development of application-specific virtual
memory management. In Conference on Object-Oriented Programming Systems, Languages, and Applications
(OOPSLA) 1993, pages 48–64, October 1993.
89
[52] B. W. Lampson. Hints for computer system design. In Proceedings of the Eighth ACM Symposium on Operating
Systems Principles, pages 33–48, December 1983.
[53] B.W. Lampson. On reliable and extendable operating systems. State of the Art Report, Infotech, 1, 1971.
[54] B.W. Lampson and R.F. Sproull. An open operating system for a single-user machine. Proceedings of the Seventh
ACM Symposium on Operating Systems Principles, pages 98–105, December 1979.
[55] C.H. Lee, M.C. Chen, and R.C. Chang. HiPEC: high performance external virtual memory caching. In Proceedings
of the First Symposium on Operating Systems Design and Implementation, pages 153–164, 1994.
[56] Ian Leslie, Derek McAuley, Richard Black, Timothy Roscoe, Paul Barham, David Evers, Robin Fairbairns, ,
and Eoin Hyden. The design and implementation of an operating system to support distributed multimedia
applications. IEEE Journal on selected areas in communication, 14(7):1280–1297, September 1996.
[57] J. Liedtke. Improving IPC by kernel design. In Proceedings of the Fourteenth ACM Symposium on Operating
Systems Principles, pages 175–188, December 1993.
[58] J. Liedtke. On micro-kernel construction. In Proceedings of the Fifteenth ACM Symposium on Operating Systems
Principles, December 1995.
[59] Steve Lucco. Personal communication. Use of undocumented proprietary formats as a technique to impede
third-party additions, August 1997.
[60] Kenneth Mackenzie, John Kubiatowicz, Matthew Frank, Walter Lee, Victor Lee, Anant Agarwal, and M. Frans
Kaashoek. UDM: User Direct Messaging for General-Purpose Multiprocessing. Technical Memo MIT/LCS/TM-
556, March 1996.
[61] C. Maeda and B. N. Bershad. Protocol service decomposition for high-performance networking. In Proceedings
of the Fourteenth ACM Symposium on Operating Systems Principles, pages 244–255, Asheville, NC, USA, 1993.
[62] David Mazieres and M. Frans Kaashoek. Secure applications need flexibile operating systems. In HotOS-VI,
1997.
[63] David Mazieres and M. Frans Kaashoek. Secure applications need flexible operating systems. In Proceedings of
the6th Workshop on Hot Topics in Operating Systems, May 1997.
[64] S. McCanne and V. Jacobson. The BSD packet filter: A new architecture for user-level packet capture. In USENIX
Technical Conference Proceedings, pages 259–269, San Diego, CA, Winter 1993. USENIX.
[65] J.C. Mogul, R.F. Rashid, and M.J. Accetta. The packet filter: An efficient mechanism for user-level network
code. In Proceedings of the Eleventh ACM Symposium on Operating Systems Principles, pages 39–51, Austin,
TX, USA, November 1987.
[66] A. C. Myers and B. Liskov. Decentralized model for information flow control. In Proceedings of the Sixteenth
ACM Symposium on Operating Systems Principles, October 1997.
[67] D. Nagle, R. Uhlig, T. Stanley, S. Sechrest, T. Mudge, and R. Brown. Design tradeoffs for software-managed
TLBs. In 20th Annual International Symposium on Computer Architecture, pages 27–38, May 1993.
[68] NCSA, University of Illinois, Urbana-Champaign. NCSA HTTPd. http://hoohoo.ncsa.uiuc.edu/index.html.
[69] J. K. Ousterhout. Why aren' t operating systems getting faster as fast as hardware? In Proceedings of the Summer
1990 USENIX Conference, pages 247–256, June 1990.
[70] V. Pai, P. Druschel, and W. Zwaenepoel. I/O-lite: a unified I/O buffering and caching system. Technical
Report http://www.cs.rice.edu/ vivek/IO-lite.html, Rice University, 1997.
[71] Przemyslaw Pardyak and Brian Bershad. Dynamic binding for an extensible system. In Proceedings of the
Second USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 201–212, October
1996.
90
[72] R. Hugo Patterson, Garth A. Gibson, Eka Ginting, Daniel Stodolsky, and Jim Zelenka. Informed prefetching and
caching. In Proceedings of the Fifteenth ACM Symposium on Operating Systems Principles, Copper Mountain
Resort, CO, December 1995.
[73] D. Probert, J.L. Bruno, and M. Karzaorman. SPACE: A new approach to operating system abstraction. In
International Workshop on Object Orientation in Operating Systems, pages 133–137, October 1991.
[74] J.S. Quarterman, A. Silberschatz, and J.L. Peterson. 4.2BSD and 4.3BSD as examples of the UNIX system.
Computing Surveys, 17(4):379–418, December 1985.
[75] D.D. Redell, Y.K. Dalal, T.R. Horsley, H.C. Lauer, W.C. Lynch, P.R. McJones, H.G. Murray, and S.C. Purcell.
Pilot: An operating system for a personal computer. Communications of the ACM, 23(2):81–92, February 1980.
[76] Timothy Roscoe. The Structure of a Multi-Service Operating System. Phd Thesis, Technical Report 376,
Cambridge, 1995.
[77] M. Rozier, V. Abrossimov, F. Armand, I. Boule, M. Gien, M. Guillemont, F. Herrmann, C. Kaiser, S. Langlois,
P. Leonard, and W. Neuhauser. Chorus distributed operating system. Computing Systems, 1(4):305–370, 1988.
[78] M. Seltzer, Y. Endo, C. Small, and K. Smith. Dealing with disaster: Surviving misbehaved kernel extensions.
In Proceedings of the Second Symposium on Operating Systems Design and Implementation, pages 213–228,
October 1996.
[79] C. Small and M. Seltzer. Vino: an integrated platform for operating systems and database research. Technical
Report TR-30-94, Harvard, 1994.
[80] Christopher Small and Margo Seltzer. A comparison of os extension technologies. In Proceedings of the 1996
USENIX Conference, 1996.
[81] Richard Stallman. Emacs, the extensible, customizable self-documenting display editor. In ACM SIGPLAN
SIGOA Symposium on Text Manipulation, June 1981.
[82] V. Buch T. von Eicken, A. Basu and W. Vogels. U-Net: A user-level network interface for parallel and distributed
computing. In Proceedings of the Fifteenth ACM Symposium on Operating Systems Principles, pages 40–53,
Copper Mountain Resort, CO, USA, 1995.
[83] Madhusudhan Talluri, Mark D. Hill, and Yousef A. Khalidi. A new page table for 64-bit address spaces.
In Proceedings of the Fifteenth ACM Symposium on Operating Systems Principles, Copper Mountain Resort,
Colorado, December 1995.
[84] A.S. Tanenbaum, R. van Renesse, H. van Staveren, G. Sharp, S.J. Mullender, A. Jansen, and G. van Rossum. Ex-
periences with the Amoeba distributed operating system. Communications of the ACM, 33(12):46–63, December
1990.
[85] D. Tarditi, G. Morrisett, P. Cheng, C. Stone, R. Harper, and P. Lee. Til: A type-directed optimizing compiler
for ml. In Proceedings of the SIGPLAN ' 96 Conference on Programming Language Design and Implementation,
Philadelphia, PA, May 1996.
[86] D.L. Tennenhouse and David J. Wetherall. Towards an active network architecture. In Proc. Multimedia,
Computing, and Networking 96, January 1996.
[87] C. A. Thekkath and H. M. Levy. Hardware and software support for efficient exception handling. In Sixth
International Conference on Architecture Support for Programming Languages and Operating Systems, pages
110–121, October 1994.
[88] C. A. Thekkath, H. M. Levy, and E. D. Lazowska. Separating data and control transfer in distributed operating
systems. In Sixth International Conference on Architecture Support for Programming Languages and Operating
Systems, pages 2–11, San Francisco, California, October 1994.
91
[89] R. Wahbe, S. Lucco, T. Anderson, and S. Graham. Efficient software-based fault isolation. In Proceedings of the
Fourteenth ACM Symposium on Operating Systems Principles, pages 203–216, Asheville, NC, USA, December
1993.
[90] C. A. Waldspurger and W. E. Weihl. Lottery scheduling: Flexible proportional-share resource management. In
Proceedings of the First Symposium on Operating Systems Design and Implementation, pages 1–11, November
1994.
[91] C. A. Waldspurger and W. E. Weihl. Stride scheduling: deterministic proportional-share resource management.
Technical Memorandum MIT/LCS/TM528, MIT, June 1995.
[92] D. A. Wallach, D. R. Engler, and M. F. Kaashoek. ASHs: Application-specific handlers for high-performance
messaging. In ACM Communication Architectures, Protocols, and Applications (SIGCOMM ' 96), Stanford,
California, August 1996.
[93] Deborah A. Wallach. Supporting application-specific libraries for communication. PhD thesis, M.I.T., 1996.
[94] W. Wulf, E. Cohen, W. Corwin, A. Jones, R. Levin, C. Pierson, and F. Pollack. HYDRA: The kernel of a
multiprocessing operating system. Communications of the ACM, 17(6):337–345, July 1974.
[95] M. Yuhara, B. Bershad, C. Maeda, and E. Moss. Efficient packet demultiplexing for multiple endpoints and large
messages. In Proceedings of the Winter 1994 USENIX Conference, pages 153–165, San Francisco, CA, USA,
January 1994.
92