Csbu
Csbu
Table of Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Welcome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Philosophy . . . . . . . . . . . . . . . . . . . . . . 12
Why from the bottom up?. . . . . . . . . . . 13
Enabling Technologies . . . . . . . . . . . . . 13
Chapter 1. General Unix and Advanced C . . . . . . . . . . . . . . . 13
1 Everything is a file! . . . . . . . . . . . . . . . . . . . . . 13
2 Implementing abstraction . . . . . . . . . . . . . . . . . 15
2.1 Implementing abstraction with C. . 15
2.2 Libraries . . . . . . . . . . . . . . . . . . . . . 20
3 File Descriptors . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1 The Shell . . . . . . . . . . . . . . . . . . . . 24
3.1.1 Redirection . . . . . . . . . 24
3.1.2 Implementing pipe . . . 25
Chapter 2. Binary and Number Representation . . . . . . . . . . 27
1 Binary — the basis of computing . . . . . . . . . . . . 27
1.1 Binary Theory. . . . . . . . . . . . . . . . . 27
1.1.1 Introduction . . . . . . . . 27
1.1.2 The basis of computing .
28
1.1.3 Bits and Bytes. . . . . . . 29
1.1.3.1 ASCII . . . 29
1.1.3.2 Parity . . . 30
1.1.3.3 16, 32 and 64
bit computers . .
Copyright © 2004–2022 Ian Wienand
1
Computer Science from the Bottom Up
30
1.1.3.4 Kilo, Mega
and Giga Bytes .
30
1.1.3.5 Kilo, Mega
and Giga Bits . .
32
1.1.3.6 Conversion .
32
1.1.4 Boolean Operations . . 33
1.1.4.1 Not . . . . . 33
1.1.4.2 And . . . . . 33
1.1.4.3 Or . . . . . . 33
1.1.4.4 Exclusive Or
(xor) . . . . . . . 34
1.1.5 How computers use
boolean operations . . . . 34
1.1.6 Working with binary in
C . . . . . . . . . . . . . . . . . . 34
1.2 Hexadecimal. . . . . . . . . . . . . . . . . . 35
1.3 Practical Implications . . . . . . . . . . 37
1.3.1 Use of binary in code . . .
37
1.3.2 Masking and Flags . . . 37
1.3.2.1 Masking . 37
1.3.2.2 Flags . . . . 39
2 Types and Number Representation . . . . . . . . . . 41
2.1 C Standards . . . . . . . . . . . . . . . . . . 41
2.1.1 GNU C . . . . . . . . . . . . 42
2.2 Types . . . . . . . . . . . . . . . . . . . . . . . 42
2.2.1 64 bit . . . . . . . . . . . . . 44
2.2.2 Type qualifiers . . . . . . 46
2.2.3 Standard Types . . . . . 46
2.2.4 Types in action . . . . . . 47
2.3 Number Representation. . . . . . . . . 50
2.3.1 Negative Values . . . . . 50
2.3.1.1 Sign Bit . . 50
2.3.1.2 One's
Complement . . .
50
2.3.1.3 Two's
Complement . . .
51
2.3.1.3.1 Sign-
extension
52
2.3.2 Floating Point . . . . . . . 52
2.3.2.1 Normalised
Values . . . . . . 56
2
Computer Science from the Bottom Up
2.3.2.1.1 Normalisation
Tricks
56
2.3.2.2 Bringing it
together . . . . 58
Chapter 3. Computer Architecture . . . . . . . . . . . . . . . . . . . . 64
1 The CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
1.1 Branching. . . . . . . . . . . . . . . . . . . . 65
1.2 Cycles . . . . . . . . . . . . . . . . . . . . . . . 65
1.3 Fetch, Decode, Execute, Store . . . . 65
1.3.1 Looking inside a CPU . . .
66
1.3.2 Pipelining . . . . . . . . . . 68
1.3.2.1 Branch
Prediction . . . 69
1.3.3 Reordering . . . . . . . . . 69
1.4 CISC v RISC . . . . . . . . . . . . . . . . . . 70
1.4.1 EPIC . . . . . . . . . . . . . . 71
2 Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
2.1 Memory Hierarchy . . . . . . . . . . . . . 72
2.2 Cache in depth . . . . . . . . . . . . . . . . 73
2.2.1 Cache Addressing. . . . 76
3 Peripherals and buses . . . . . . . . . . . . . . . . . . . . 78
3.1 Peripheral Bus concepts. . . . . . . . . 78
3.1.1 Interrupts . . . . . . . . . . 78
3.1.1.1 Saving
state . . . . . . . 79
3.1.1.2 Interrupts v
traps and
exceptions . . 80
3.1.1.3 Types of
interrupts . . . 80
3.1.1.4 Non-maskable
interrupts . . . 81
3.1.2 IO Space . . . . . . . . . . . 81
3.2 DMA . . . . . . . . . . . . . . . . . . . . . . . . 81
3.3 Other Buses . . . . . . . . . . . . . . . . . . 82
3.3.1 USB . . . . . . . . . . . . . . 82
4 Small to big systems . . . . . . . . . . . . . . . . . . . . . 84
4.1 Symmetric Multi-Processing . . . . . 84
4.1.1 Cache Coherency . . . . 84
4.1.1.1 Cache
exclusivity in
SMP systems . .
86
4.1.2 Hyperthreading . . . . . 86
4.1.3 Multi Core . . . . . . . . . 87
4.2 Clusters . . . . . . . . . . . . . . . . . . . . . 87
4.3 Non-Uniform Memory Access . . . . 88
3
Computer Science from the Bottom Up
4
Computer Science from the Bottom Up
5
Computer Science from the Bottom Up
6
Computer Science from the Bottom Up
7
Computer Science from the Bottom Up
8
Computer Science from the Bottom Up
List of Figures
1.1 Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1 Default Unix Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1.2.1 A pipe in action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.3.2.1.1 Masking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.2.1 Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
1.1 The CPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
1.3.1.1 Inside the CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
1.3.3.1 Reorder buffer example . . . . . . . . . . . . . . . . . . . . . . . 69
2.2.1 Cache Associativity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
2.2.1.1 Cache tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.1.1.1 Overview of handling an interrupt . . . . . . . . . . . . . . 79
3.3.1.1 Overview of a UHCI controller operation . . . . . . . . . 83
4.3.1.1 A Hypercube . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.4.1 Acquire and Release semantics . . . . . . . . . . . . . . . . . . 95
2.1 The Operating System . . . . . . . . . . . . . . . . . . . . . . . . . . 102
2.1.2.1 The Operating System . . . . . . . . . . . . . . . . . . . . . . . 106
4.1.1.1 Rings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
4.1.1.3.1 x86 Segmentation Addressing . . . . . . . . . . . . . . . 126
4.1.1.3.2 x86 segments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
2.1 The Elements of a Process. . . . . . . . . . . . . . . . . . . . . . . 130
9
Computer Science from the Bottom Up
List of Tables
3.1 Standard Files Provided by Unix . . . . . . . . . . . . . . . . . . . 21
3.1.1.1 Standard Shell Redirection Facilities . . . . . . . . . . . . 24
1.1.1.1 Binary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.1.1.2 203 in base 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
1.1.1.3 203 in base 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
1.1.3.4.1 Base 2 and 10 factors related to bytes . . . . . . . . . . 31
1.1.3.6.1 Convert 203 to binary. . . . . . . . . . . . . . . . . . . . . . . 32
1.1.4.1.1 Truth table for not . . . . . . . . . . . . . . . . . . . . . . . . . 33
1.1.4.2.1 Truth table for and . . . . . . . . . . . . . . . . . . . . . . . . . 33
1.1.4.3.1 Truth table for or . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1.1.4.4.1 Truth table for xor . . . . . . . . . . . . . . . . . . . . . . . . . 34
1.1.6.1 Boolean operations in C. . . . . . . . . . . . . . . . . . . . . . . 35
1.2.1 Hexadecimal, Binary and Decimal . . . . . . . . . . . . . . . . 35
1.2.2 Convert 203 to hexadecimal . . . . . . . . . . . . . . . . . . . . . 36
2.2.1 Standard Integer Types and Sizes . . . . . . . . . . . . . . . . 44
2.2.1.1 Standard Scalar Types and Sizes . . . . . . . . . . . . . . . 45
2.3.1.2.1 One's Complement Addition . . . . . . . . . . . . . . . . . . 50
2.3.1.3.1 Two's Complement Addition . . . . . . . . . . . . . . . . . . 51
2.3.2.1 IEEE Floating Point . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.3.2.2 Scientific Notation for 1.98765x10^6 . . . . . . . . . . . . 53
2.3.2.3 Significands in binary . . . . . . . . . . . . . . . . . . . . . . . . 53
2.3.2.1.1 Example of normalising 0.375 . . . . . . . . . . . . . . . . 56
2.1.1 Memory Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
2.1.1 Relocation Example . . . . . . . . . . . . . . . . . . . . . . . . . . 268
5.2.1.1 ELF symbol fields . . . . . . . . . . . . . . . . . . . . . . . . . . 298
List of Examples
10
Computer Science from the Bottom Up
11
Computer Science from the Bottom Up
Introduction
Welcome
Welcome to Computer Science from the Bottom Up
Philosophy
In a nutshell, what you are reading is intended to be a shop class for
computer science. Young computer science students are taught to
"drive" the computer; but where do you go to learn what is under the
hood? Trying to understand the operating system is unfortunately not
as easy as just opening the bonnet. The current Linux kernel runs into
the millions of lines of code, add to that the other critical parts of a
modern operating system (the compiler, assembler and system
libraries) and your code base becomes unimaginable. Further still,
add a University level operating systems course (or four), some good
reference manuals, two or three years of C experience and, just
maybe, you might be able to figure out where to start looking to make
sense of it all.
To keep with the car analogy, the prospective student is starting out
12
Computer Science from the Bottom Up
Enabling Technologies
This book is only possible thanks to the development of Open Source
technologies. Before Linux it was like taking a shop course with a car
that had its bonnet welded shut; today we are in a position to open
that bonnet, poke around with the insides and, better still, take that
engine and use it to do whatever we want.
13
Computer Science from the Bottom Up
1. The screen
2. The keyboard
3. A printer
4. A CD-ROM
The screen and printer are both like a write-only file, but instead of
being stored as bits on a disk the information is displayed as dots on a
screen or lines on a page. The keyboard is like a read only file, with
the data coming from keystrokes provided by the user. The CD-ROM
is similar, but rather than randomly coming from the user the data is
stored directly on the disk.
14
Computer Science from the Bottom Up
2 Implementing abstraction
In general, abstraction is implemented by what is generically termed
an Application Programming Interface (API). API is a somewhat
nebulous term that means different things in the context of various
programming endeavours. Fundamentally, a programmer designs a
set of functions and documents their interface and functionality with
the principle that the actual implementation providing the API is
opaque.
15
Computer Science from the Bottom Up
1 #include <stdio.h>
16
Computer Science from the Bottom Up
40 exit(0);
}
We start out with a structure that defines the API ( struct greet_api ).
The functions whose names are encased in parentheses with a pointer
marker describe a function pointer1. The function pointer describes
the prototype of the function it must point to; pointing it at a function
without the correct return type or parameters will generate a
compiler warning at least; if left in code will likely lead to incorrect
operation or crashes.
We then have our implementation of the API. Often for more complex
functionality you will see an idiom where API implementation
functions will only be a wrapper around other functions that are
conventionally prepended with one or or two underscores2 (i.e.
say_hello_fn() would call another function _say_hello_function() ).
This has several uses; generally it relates to having simpler and
smaller parts of the API (marshalling or checking arguments, for
example) separate from more complex implementation, which often
eases the path to significant changes in the internal workings whilst
1. Often you will see that the names of the parameters are omitted, and
only the type of the parameter is specified. This allows the implementer
to specify their own parameter names avoiding warnings from the
compiler.
2. A double-underscore function __foo may conversationally be referred
to as "dunder foo".
17
Computer Science from the Bottom Up
Finally we can call the API functions through the structure in main .
You will see this idiom constantly when navigating the source code.
The tiny example below is taken from include/linux/virtio.h in the
Linux kernel source to illustrate:
18
Computer Science from the Bottom Up
1 /**
* virtio_driver - operations for a virtio I/O driver
* @driver: underlying device driver (populate name and owner).
* @id_table: the ids serviced by this driver.
5 * @feature_table: an array of feature numbers supported by thi
* @feature_table_size: number of entries in the feature table
* @probe: the function to call when a device is found. Return
* @remove: the function to call when a device is removed.
* @config_changed: optional function to call when the device c
10 * changes; may be called in interrupt context.
*/
struct virtio_driver {
struct device_driver driver;
const struct virtio_device_id *id_table;
15 const unsigned int *feature_table;
unsigned int feature_table_size;
int (*probe)(struct virtio_device *dev);
void (*scan)(struct virtio_device *dev);
void (*remove)(struct virtio_device *dev);
20 void (*config_changed)(struct virtio_device *dev);
#ifdef CONFIG_PM
int (*freeze)(struct virtio_device *dev);
int (*restore)(struct virtio_device *dev);
#endif
25 };
19
Computer Science from the Bottom Up
relevant data.
Starting with descriptors like this is usually the easiest way to begin
understanding the various layers of kernel code.
2.2 Libraries
Libraries have two roles which illustrate abstraction.
3 File Descriptors
One of the first things a UNIX programmer learns is that every
running program starts with three files already opened:
20
Computer Science from the Bottom Up
This raises the question of what an open file represents. The value
returned by an open call is termed a file descriptor and is essentially
an index into an array of open files kept by the kernel.
21
Computer Science from the Bottom Up
2 associates a descriptor
1 with the kernel
with the associated device which gives them a fi
int fd = open("/dev/sr0");
Device Layer
/dev/input
File Descriptors
Further references
/dev/tty
3 to the descriptor
0
are routed to the device /dev/sr0
1
MAX_FD
File descriptors are an index into a file descriptor table stored by the
kernel. The kernel creates a file descriptor in response to an open call and
associates the file descriptor with some abstraction of an underlying file-
like object, be that an actual hardware device, or a file system or
something else entirely. Consequently a process's read or write calls that
reference that file descriptor are routed to the correct place by the kernel
to ultimately do something useful.
22
Computer Science from the Bottom Up
23
Computer Science from the Bottom Up
There are indeed many other layers that complicate the picture in
real-life. For example, the kernel will go to great efforts to cache as
much data from disks as possible in otherwise-free memory; this
provides many speed advantages. It will also try to organise device
access in the most efficient ways possible; for example trying to order
disk-access to ensure data stored physically close together is
retrieved together, even if the requests did not arrive in sequential
order. Further, many devices are of a more generic class such as USB
or SCSI devices which provide their own abstraction layers to write
to. Thus, rather than writing directly to devices, file systems will go
through these many layers. Understanding the kernel is to
understand how these many APIs interrelate and coexist.
But shells do much more than allow you to simply execute a program.
They have powerful abilities to redirect files, allow you to execute
multiple programs simultaneously and script complete programs.
These all come back to the everything is a file idiom.
3.1.1 Redirection
Often we do not want the standard file descriptors mentioned in
Section 3, File Descriptors to point to their default places. For
example, you may wish to capture all the output of a program into a
file on disk or, alternatively, have it read its commands from a file you
prepared earlier. Another useful task might like to pass the output of
one program to the input of another. With the operating system, the
shell facilitates all this and more.
24
Computer Science from the Bottom Up
25
Computer Science from the Bottom Up
$ ls | more
User
Kernel
ls
pipe
File Descriptors
read()
0 write()
Buffer
MAX_FD
The pipe is an in-memory buffer that connects two processes together. file
descriptors point to the pipe object, which buffers data sent to it (via a
write ) to be drained (via a read ).
Writes to the pipe are stored by the kernel until a corresponding read
from the other side drains the buffer. This is a very powerful concept
and is one of the fundamental forms of inter-process communication
or IPC in UNIX-like operating systems. The pipe allows more than just
a data transfer; it can act as a signaling channel. If a process read s
an empty pipe, it will by default block or be put into hibernation until
there is some data available (this is discussed in much greater depth
in Chapter 5, The Process). Thus two processes may use a pipe to
communicate that some action has been taken just by writing a byte
of data; rather than the actual data being important, the mere
presence of any data in the pipe can signal a message. Say for
example one process requests that another print a file — something
that will take some time. The two processes may set up a pipe
between themselves where the requesting process does a read on
the empty pipe; being empty, that call blocks and the process does not
continue. Once the print is done, the other process can write a
message into the pipe, which effectively wakes up the requesting
26
Computer Science from the Bottom Up
Allowing processes to pass data between each other like this springs
another common UNIX idiom of small tools doing one particular
thing. Chaining these small tools gives flexibility that a single
monolithic tool often can not.
1.1.1 Introduction
Binary is a base-2 number system that uses two mutually exclusive
states to represent information. A binary number is made up of
elements called bits where each bit can be in one of the two possible
states. Generally, we represent them with the numerals 1 and 0 . We
also talk about them being true and false. Electrically, the two states
might be represented by high and low voltages or some form of
switch turned on or off.
2... 26 25 24 23 22 21 20
... 64 32 16 8 4 2 1
27
Computer Science from the Bottom Up
27 26 25 24 23 22 21 20
1 1 0 0 1 0 1 1
In the days of punch-cards, one could see with their eye the one's and
zero's that make up the program stream by looking at the holes
present on the card. Of course this moved to being stored via the
polarity of small magnetic particles rather quickly (tapes, disks) and
onto the point today that we can carry unimaginable amounts of data
in our pocket.
28
Computer Science from the Bottom Up
1.1.3.1 ASCII
Given that a byte can represent any of the values 0 through 255,
anyone could arbitrarily make up a mapping between characters and
numbers. For example, a video card manufacturer could decide that
1 represents A , so when value 1 is sent to the video card it displays
a capital 'A' on the screen. A printer manufacturer might decide for
some obscure reason that 1 represented a lower-case 'z', meaning
that complex conversions would be required to display and print the
same thing.
The range of codes is divided up into two major parts; the non-
printable and the printable. Printable characters are things like
characters (upper and lower case), numbers and punctuation. Non-
printable codes are for control, and do things like make a carriage-
return, ring the terminal bell or the special NULL code which
represents nothing at all.
29
Computer Science from the Bottom Up
1.1.3.2 Parity
ASCII, being only a 7-bit code, leaves one bit of the byte spare. This
can be used to implement parity which is a simple form of error
checking. Consider a computer using punch-cards for input, where a
hole represents 1 and no hole represents 0. Any inadvertent covering
of a hole will cause an incorrect value to be read, causing undefined
behaviour.
Parity allows a simple check of the bits of a byte to ensure they were
read correctly. We can implement either odd or even parity by using
the extra bit as a parity bit.
In this way, the flipping of one bit will case a parity error, which can
be detected.
Numbers do not fit into bytes; hopefully your bank balance in dollars
will need more range than can fit into one byte! All most all general-
purpose architectures are at least 32 bit computers. This means that
their internal registers are 32-bits (or 4-bytes) wide, and that
operations generally work on 32-bit values. We refer to 4 bytes as a
word; this is analogous to language where letters (bits) make up
words in a sentence, except in computing every word has the same
size! The size of a C int variable is 32 bits. Modern architectures are
64 bits, which doubles the size the processor works with to 8 bytes.
30
Computer Science from the Bottom Up
1 20
Megabyte 2
1,048,576 106 1,000,000
1
Gigabyte 230 1,073,741,824 109 1,000,000,000
1
Terabyte 240 1,099,511,627,776 1012 1,000,000,000,000
1
Petabyte 250 1,125,899,906,842,624 1015 1,000,000,000,000,000
31
Computer Science from the Bottom Up
The SI standardisation body has recognised these dual uses and has
specified unique prefixes for binary usage. Under the standard 1024
bytes is a kibibyte , short for kilo binary byte (shortened to KiB). The
other prefixes have a similar prefix (Mebibyte, MiB, for example).
Tradition largely prevents use of these terms, but you may seem them
in some literature.
1.1.3.6 Conversion
Reading from the bottom and appending to the right each time gives
11001011 , which we saw from the previous example was 203.
32
Computer Science from the Bottom Up
1.1.4.1 Not
1.1.4.2 And
To remember how the and operation works think of it as "if one input
and the other are true, result is true
1.1.4.3 Or
33
Computer Science from the Bottom Up
34
Computer Science from the Bottom Up
1.2 Hexadecimal
Hexadecimal refers to a base 16 number system. We use this in
computer science for only one reason, it makes it easy for humans to
think about binary numbers. Computers only ever deal in binary and
hexadecimal is simply a shortcut for us humans trying to work with
the computer.
So why base 16? Well, the most natural choice is base 10, since we
are used to thinking in base 10 from our every day number system.
But base 10 does not work well with binary -- to represent 10
different elements in binary, we need four bits. Four bits, however,
gives us sixteen possible combinations. So we can either take the very
tricky road of trying to convert between base 10 and binary, or take
the easy road and make up a base 16 number system -- hexadecimal!
35
Computer Science from the Bottom Up
We can also use the same repeated division scheme to change the
base of a number. For example, to find 203 in hexadecimal
36
Computer Science from the Bottom Up
We can easily do this by the process of masking. This uses the rules of
logical operations to extract values.
37
Computer Science from the Bottom Up
1 0 1 0 0 1 0 1
0 0 0 0 1 1 1 1 0x0F
0 0 0 0 0 1 0 1 0x05
To get the top (blue) four bits, we would invert the mask; in other
words, set the top 4 bits to 1 and the lower 4-bits to 0 . You will note
this gives a result of 1010 0000 (or, in hexadecimal 0xA0 ) when really
we want to consider this as a unique 4-bit value 1010 ( 0x0A ). To get
the bits into the right position we use the right shift operation 4
times, giving a final value of 0000 1010 .
38
Computer Science from the Bottom Up
1 #include <stdio.h>
1.3.2.2 Flags
Often a program will have a large number of variables that only exist
as flags to some condition. For example, a state machine is an
algorithm that transitions through a number of different states but
may only be in one at a time. Say it has 8 different states; we could
easily declare 8 different variables, one for each state. But in many
cases it is better to declare one 8 bit variable and assign each bit to
flag flag a particular state.
39
Computer Science from the Bottom Up
1 #include <stdio.h>
/*
* define all 8 possible flags for an 8 bit variable
5 * name hex binary
*/
#define FLAG1 0x01 /* 00000001 */
#define FLAG2 0x02 /* 00000010 */
#define FLAG3 0x04 /* 00000100 */
10 #define FLAG4 0x08 /* 00001000 */
/* ... and so on */
#define FLAG8 0x80 /* 10000000 */
40
Computer Science from the Bottom Up
return 0;
}
41
Computer Science from the Bottom Up
2.1.1 GNU C
The GNU C Compiler, more commonly referred to as gcc, almost
completely implements the C99 standard. However it also implements
a range of extensions to the standard which programmers will often
use to gain extra functionality, at the expense of portability to another
compiler. These extensions are usually related to very low level code
and are much more common in the system programming field; the
most common extension being used in this area being inline assembly
code. Programmers should read the gcc documentation and
understand when they may be using features that diverge from the
standard.
2.2 Types
As programmers, we are familiar with using variables to represent an
area of memory to hold a value. In a typed language, such as C, every
variable must be declared with a type. The type tells the compiler
about what we expect to store in a variable; the compiler can then
both allocate sufficient space for this usage and check that the
programmer does not violate the rules of the type. In the example
below, we see an example of the space allocated for some common
types of variables.
42
Computer Science from the Bottom Up
1 byte
4 bytes
char c b[0] | *b
int a
int b[2]
2 x 4 bytes
char *h = "hello"
b[1] | *(b+
l
6 bytes
l
\0
System Memory
The C99 standard purposely only mentions the smallest possible size
43
Computer Science from the Bottom Up
Above we can see the only divergence from the standard is that int
is commonly a 32 bit quantity, which is twice the strict minimum 16
bit size that the C99 requires.
Pointers are really just an address (i.e. their value is an address and
thus "points" somewhere else in memory) therefore a pointer needs to
be sufficient in size to be able to address any memory in the system.
2.2.1 64 bit
One area that causes confusion is the introduction of 64 bit
computing. This means that the processor can handle addresses 64
bits in length (specifically the registers are 64 bits wide; a topic we
discuss in Chapter 3, Computer Architecture).
This firstly means that all pointers are required to be a 64 bits wide
so they can represent any possible address in the system. However,
system implementers must then make decisions about the size of the
other types. Two common models are widely used, as shown below.
44
Computer Science from the Bottom Up
You can see that in the LP64 (long-pointer 64) model long values are
defined to be 64 bits wide. This is different to the 32 bit model we
showed previously. The LP64 model is widely used on UNIX systems.
There are good reasons why the size of int was not increased to 64
bits in either model. Consider that if the size of int is increased to
64 bits you leave programmers no way to obtain a 32 bit variable. The
only possibly is redefining shorts to be a larger 32 bit type.
45
Computer Science from the Bottom Up
signed and unsigned are probably the two most important qualifiers;
and they say if a variable can take on a negative value or not. We
examine this in more detail below.
Qualifiers are all intended to pass extra information about how the
variable will be used to the compiler. This means two things; the
compiler can check if you are violating your own rules (e.g. writing to
a const value) and it can make optimisations based upon the extra
knowledge (examined in later chapters).
1. Note that C99 also has portability helpers for printf . The PRI macros
in <inttypes.h> can be used as specifiers for types of specified sizes.
Again see the standard or pull apart the headers for full information.
46
Computer Science from the Bottom Up
47
Computer Science from the Bottom Up
1 /*
* types.c
*/
5 #include <stdio.h>
#include <stdint.h>
int main(void)
{
10 char a;
char *p = "hello";
int i;
return 0;
30 }
48
Computer Science from the Bottom Up
1 $ uname -m
i686
$ ./types
i is 52
10 p is 0x80484e8
p is 0x80484e8
$ uname -m
ia64
15
$ gcc -Wall -o types types.c
types.c: In function 'main':
types.c:19: warning: assignment makes integer from pointer with
types.c:21: warning: cast from pointer to integer of different
20 types.c:22: warning: cast to pointer from integer of different
$ ./types
i is 52
p is 0x40000000000009e0
25 p is 0x9e0
49
Computer Science from the Bottom Up
The most straight forward method is to simply say that one bit of the
number indicates either a negative or positive value depending on it
being set or not.
However, notice that the value 0 now has two equivalent values; one
with the sign bit set and one without. Sometimes these values are
referred to as +0 and -0 respectively.
50
Computer Science from the Bottom Up
Decimal Binary Op
--- --------
10 1
00001001 9
00001010 10
If you add the bits one by one, you find you end up with a carry bit at
the end (highlighted above). By adding this back to the original we
end up with the correct value, 10
Again we still have the problem with two zeros being represented.
Again no modern computer uses one's complement, mostly because
there is a better scheme.
This means there is a slightly odd symmetry in the numbers that can
be represented; for example with an 8 bit integer we have 2^8 = 256
possible values; with our sign bit representation we could represent
-127 thru 127 but with two's complement we can represent -127 thru
128. This is because we have removed the problem of having two
zeros; consider that "negative zero" is (~00000000
+1)=(11111111+1)=00000000 (note discarded carry bit).
51
Computer Science from the Bottom Up
2.3.1.3.1 Sign-extension
52
Computer Science from the Bottom Up
Each bit of the significand adds a little more precision to the values
we can represent. Consider the scientific notation representation of
the value 198765 . We could write this as 1.98765x106 , which
corresponds to a representation below
With only one bit of precision, our fractional precision is not very big;
we can only say that the fraction is either 0 or 0.5 . If we add
another bit of precision, we can now say that the decimal value is one
of either 0,0.25,0.5,0.75 . With another bit of precision we can now
represent the values 0,0.125,0.25,0.375,0.5,0.625,0.75,0.875 .
53
Computer Science from the Bottom Up
represent the value 0.3 we can only say that it is closest to 0.25 ;
obviously this is insufficient for most any application. With 22 bits of
significand we have a much finer resolution, but it is still not enough
for most applications. A double value increases the number of
significand bits to 52 (it also increases the range of exponent values
too). Some hardware has an 84-bit float, with a full 64 bits of
significand. 64 bits allows a tremendous precision and should be
suitable for all but the most demanding of applications (XXX is this
sufficient to represent a length to less than the size of an atom?)
54
Computer Science from the Bottom Up
1 $ cat float.c
#include <stdio.h>
int main(void)
5 {
float a = 0.45;
float b = 8.0;
double ad = 0.45;
10 double bd = 8.0;
return 0;
}
$ ./float
float+float, 6dp : 8.450000
double+double, 6dp : 8.450000
25 float+float, 20dp : 8.44999998807907104492
dobule+double, 20dp : 8.44999999999999928946
$ python
Python 2.4.4 (#2, Oct 20 2006, 00:23:25)
30 [GCC 4.1.2 20061015 (prerelease) (Debian 4.1.1-16.1)] on linux2
55
Computer Science from the Bottom Up
As you can see above, we can make the value normalised by moving
the bits upwards as long as we compensate by increasing the
exponent.
The standard way to find this value is to shift right, check if the
56
Computer Science from the Bottom Up
57
Computer Science from the Bottom Up
1 #include <stdio.h>
int main(void)
{
5 // in binary = 1000 0000 0000 0000
// bit num 5432 1098 7654 3210
int i = 0x8000;
int count = 0;
while ( !(i & 0x1) ) {
10 count ++;
i = i >> 1;
}
printf("First non-zero (slow) is %d\n", count);
25 }
58
Computer Science from the Bottom Up
1 #include <stdio.h>
#include <string.h>
#include <stdlib.h>
5 /* return 2^n */
int two_to_pos(int n)
{
if (n == 0)
return 1;
10 return 2 * two_to_pos(n - 1);
}
double two_to_neg(int n)
{
15 if (n == 0)
return 1;
return 1.0 / (two_to_pos(abs(n)));
}
20 double two_to(int n)
{
if (n >= 0)
return two_to_pos(n);
if (n < 0)
25 return two_to_neg(n);
return 0;
}
59
Computer Science from the Bottom Up
60
Computer Science from the Bottom Up
61
Computer Science from the Bottom Up
return 0;
115 }
62
Computer Science from the Bottom Up
1 $ ./float 8.45
8.450000 = 1 * (1/2^0 + 1/2^5 + 1/2^6 + 1/2^7 + 1/2^10 + 1/2^11
8.450000 = 1 * (1/1 + 1/32 + 1/64 + 1/128 + 1/1024 + 1/2048 + 1
8.450000 = 1 * 1.05624997616 * 8.000000
5 8.450000 = 8.44999980927
From this example, we get some idea of how the inaccuracies creep
into our floating point numbers.
63
Computer Science from the Bottom Up
Chapter 3. Computer
Architecture
1 The CPU
MEMORY
0x090 | 0
0x100 | 10
0x110 | 110
0x120 | 0
REGISTERS
CPU
INSTRUCTIONS
R1=100
R2=LOAD 0x100
R3=ADD R1,R2
STORE 0x110=R3
The CPU executes instructions read from memory. There are two
categories of instructions
1. Those that load values from memory into registers and store
values from registers to memory.
64
Computer Science from the Bottom Up
1.1 Branching
Apart from loading or storing, the other important operation of a CPU
is branching. Internally, the CPU keeps a record of the next
instruction to be executed in the instruction pointer. Usually, the
instruction pointer is incremented to point to the next instruction
sequentially; the branch instruction will usually check if a specific
register is zero or if a flag is set and, if so, will modify the pointer to a
different address. Thus the next instruction to execute will be from a
different part of program; this is how loops and decision statements
work.
1.2 Cycles
We are all familiar with the speed of the computer, given in
Megahertz or Gigahertz (millions or thousands of millions cycles per
second). This is called the clock speed since it is the speed that an
internal clock within the computer pulses.
65
Computer Science from the Bottom Up
3. Execute : take the values from the registers, actually add them
together
4. Store : store the result back into another register. You might
also see the term retiring the instruction.
66
Computer Science from the Bottom Up
program code
Decode Instruction
FP FP
AGU ALU Load +-
Store */
Cache
RAM
Figure 1.3.1.1, Inside the CPU shows a very simple block diagram
illustrating some of the main parts of a modern CPU.
You can see the instructions come in and are decoded by the
processor. The CPU has two main types of registers, those for integer
calculations and those for floating point calculations. Floating point is
a way of representing numbers with a decimal place in binary form,
and is handled differently within the CPU. MMX (multimedia
extension) and SSE (Streaming Single Instruction Multiple Data) or
Altivec registers are similar to floating point registers.
A register file is the collective name for the registers inside the CPU.
67
Computer Science from the Bottom Up
Below that we have the parts of the CPU which really do all the work.
The Arithmetic Logic Unit (ALU) is the heart of the CPU operation. It
takes values in registers and performs any of the multitude of
operations the CPU is capable of. All modern processors have a
number of ALUs so each can be working independently. In fact,
processors such as the Pentium have both fast and slow ALUs; the
fast ones are smaller (so you can fit more on the CPU) but can do only
the most common operations, slow ALUs can do all operations but are
bigger.
Floating point registers have the same concepts, but use slightly
different terminology for their components.
1.3.2 Pipelining
As we can see above, whilst the ALU is adding registers together is
completely separate to the AGU writing values back to memory, so
there is no reason why the CPU can not be doing both at once. We
also have multiple ALUs in the system, each which can be working on
separate instructions. Finally the CPU could be doing some floating
point operations with its floating point logic whilst integer
instructions are in flight too. This process is called pipelining1, and a
processor that can do this is referred to as a superscalar architecture.
All modern processors are superscalar.
Branch instruction play havoc with this model however, since they
may or may not cause execution to start from a different place. If you
are pipelining, you will have to basically guess which way the branch
will go, so you know which instructions to bring into the pipeline. If
1. In fact, any modern processor has many more than four stages it can
pipeline, above we have only shown a very simplified view. The more
stages that can be executed at the same time, the deeper the pipeline.
68
Computer Science from the Bottom Up
pipeline flush, predict taken, predict not taken, branch delay slots
1.3.3 Reordering
In fact, if the CPU is the hose, it is free to reorder the marbles within
the hose, as long as they pop out the end in the same order you put
them in. We call this program order since this is the order that
instructions are given in the computer program.
1
1: r3 = r1 * r2
2: r4 = r2 + r3
3: r7 = r5 * r6
5 4: r8 = r1 + r7
69
Computer Science from the Bottom Up
However, when writing very low level code some instructions may
require some security about how operations are ordered. We call this
requirement memory semantics. If you require acquire semantics this
means that for this instruction you must ensure that the results of all
previous instructions have been completed. If you require release
semantics you are saying that all instructions after this one must see
the current result. Another even stricter semantic is a memory
barrier or memory fence which requires that operations have been
committed to memory before continuing.
1. Even the most common architecture, the Intel Pentium, whilst having an
instruction set that is categorised as CISC, internally breaks down
instructions to RISC style sub-instructions inside the chip before
executing.
70
Computer Science from the Bottom Up
1.4.1 EPIC
The Itanium processor, which is used in many example through this
book, is an example of a modified architecture called Explicitly
Parallel Instruction Computing.
Another term often used around EPIC is Very Long Instruction World
(VLIW), which is where each instruction to the processor is extended
1.
71
Computer Science from the Bottom Up
2 Memory
2.1 Memory Hierarchy
The CPU can only directly fetch instructions and data from cache
memory, located directly on the processor chip. Cache memory must
be loaded in from the main system memory (the Random Access
Memory, or RAM). RAM however, only retains its contents when the
power is on, so needs to be stored on more permanent storage.
72
Computer Science from the Bottom Up
The important point to know about the memory hierarchy is the trade
offs between speed and size — the faster the memory the smaller it is.
Of course, if you can find a way to change this equation, you'll end up
a billionaire!
2. Temporal locality suggests that data that was used recently will
likely be used again shortly.
The cache is a very fast copy of the slower main system memory.
Cache is much smaller than main memories because it is included
inside the processor chip alongside the registers and processor logic.
This is prime real estate in computing terms, and there are both
economic and physical limits to its maximum size. As manufacturers
find more and more ways to cram more and more transistors onto a
chip cache sizes grow considerably, but even the largest caches are
tens of megabytes, rather than the gigabytes of main memory or
terabytes of hard disk otherwise common.
Caches have their own hierarchy, commonly termed L1, L2 and L3. L1
73
Computer Science from the Bottom Up
L1 caches are generally further split into instruction caches and data,
known as the "Harvard Architecture" after the relay based Harvard
Mark-1 computer which introduced it. Split caches help to reduce
pipeline bottlenecks as earlier pipeline stages tend to reference the
instruction cache and later stages the data cache. Apart from
reducing contention for a shared resource, providing separate caches
for instructions also allows for alternate implementations which may
take advantage of the nature of instruction streaming; they are read-
only so do not need expensive on-chip features such as multi-porting,
nor need to handle handle sub-block reads because the instruction
stream generally uses more regular sized accesses.
way
74
Computer Science from the Bottom Up
Once the cache is full the processor needs to get rid of a line to make
room for a new line. There are many algorithms by which the
processor can choose which line to evict; for example least recently
used (LRU) is an algorithm where the oldest unused line is discarded
to make room for the new line.
When data is only read from the cache there is no need to ensure
consistency with main memory. However, when the processor starts
writing to cache lines it needs to make some decisions about how to
update the underlying main memory. A write-through cache will write
the changes directly into the main system memory as the processor
75
Computer Science from the Bottom Up
updates the cache. This is slower since the process of writing to the
main memory is, as we have seen, slower. Alternatively a write-back
cache delays writing the changes to RAM until absolutely necessary.
The obvious advantage is that less main memory access is required
when cache entries are written. Cache lines that have been written
but not committed to memory are referred to as dirty. The
disadvantage is that when a cache entry is evicted, it may require two
memory accesses (one to write dirty data main memory, and another
to load the new data).
76
Computer Science from the Bottom Up
Address Way 1
Offset
MUX
TAG INDEX
Way 2
Less set-associativity means more index bits
Less Associative
Way 1
Way 2
MUX
Address
Offset
TAG INDEX Way 3
Tags need to be checked in parallel to keep latency times low; more tag
bits (i.e. less set associativity) requires more complex hardware to achieve
this. Alternatively more set associativity means less tags, but the processor
now needs hardware to multiplex the output of the many sets, which can
also add latency.
The offset bits depend on the line size of the cache. For example, a
32-byte line size would use the last 5-bits (i.e. 25) of the address as
the offset into the line.
The index is the particular cache line that an entry may reside in. As
an example, let us consider a cache with 256 entries. If this is a
direct-mapped cache, we know the data may reside in only one
possible line, so the next 8-bits (28) after the offset describe the line
to check - between 0 and 255.
Now, consider the same 256 element cache, but divided into two
ways. This means there are two groups of 128 lines, and the given
address may reside in either of these groups. Consequently only
7-bits are required as an index to offset into the 128-entry ways. For a
given cache size, as we increase the number of ways, we decrease the
number of bits required as an index since each way gets smaller.
77
Computer Science from the Bottom Up
stored in the cache is the one it is interested in. Thus the remaining
bits of the address are the tag bits which the cache directory checks
against the incoming address tag bits to determine if there is a cache
hit or not. This relationship is illustrated in Figure 2.2.1.1, Cache
tags.
When there are multiple ways, this check must happen in parallel
within each way, which then passes its result into a multiplexor which
outputs a final hit or miss result. As describe above, the more
associative a cache is, the less bits are required for index and the
more as tag bits — to the extreme of a fully-associative cache where
no bits are used as index bits. The parallel matching of tags bits is the
expensive component of cache design and generally the limiting
factor on how many lines (i.e, how big) a cache may grow.
3.1.1 Interrupts
An interrupt allows the device to literally interrupt the processor to
flag some information. For example, when a key is pressed, an
interrupt is generated to deliver the key-press event to the operating
system. Each device is assigned an interrupt by some combination of
the operating system and BIOS.
78
Computer Science from the Bottom Up
Writing this interrupt handler is the job of the device driver author in
conjunction with the operating system.
IDT
PIC
Device
CPU
Most drivers will split up handling of interrupts into bottom and top
halves. The bottom half will acknowledge the interrupt, queue actions
for processing and return the processor to what it was doing quickly.
The top half will then run later when the CPU is free and do the more
intensive processing. This is to stop an interrupt hogging the entire
CPU.
79
Computer Science from the Bottom Up
There are two main ways of signalling interrupts on a line — level and
edge triggered.
80
Computer Science from the Bottom Up
3.1.2 IO Space
Obviously the processor will need to communicate with the peripheral
device, and it does this via IO operations. The most common form of
IO is so called memory mapped IO where registers on the device are
mapped into memory.
This means that to communicate with the device, you need simply
read or write to a specific address in memory. TODO: expand
3.2 DMA
Since the speed of devices is far below the speed of processors, there
needs to be some way to avoid making the CPU wait around for data
from devices.
Once the device is finished, it will raise an interrupt and signal to the
81
Computer Science from the Bottom Up
driver the transfer is complete. From this time the data from the
device (say a file from a disk, or frames from a video capture card) is
in memory and ready to be used.
3.3.1 USB
From an operating system point of view, a USB device is a group of
end-points grouped together into an interface. An end-point can be
either in or out and hence transfers data in one direction only. End-
points can have a number of different types:
82
Computer Science from the Bottom Up
31 12 11 2
Ba se Index 00 Queue Heads Queue Heads
Interrupt Control and Bulk
Isochronous
Transfer Descriptors Execution By Breadth
(Horizontal Execution)
Horizontal Execution
Link Link Link
Frame List Pointer Pointer Pointer
TD TD TD QH QH QH QH
Frame Pointer Q T
Element Element Element Element
Link Link Link Link
TD TD TD Execution
Pointer Pointer Pointer Pointer
By Depth
Frame Pointer Q T TD TD TD TD
(Vertical
TD TD TD
Frame Pointer Q T Execution)
TD TD TD TD
Frame Pointer Q T
TD TD TD TD
T=Terminate
Q=Transfer Descriptor or Queue Head
TD TD
TD
As you can see from the diagram, the way the data is linked means
that transfer descriptors for isochronous data are associated with
only one particular frame pointer — in other words only one
particular time period — and after that will be discarded. However,
83
Computer Science from the Bottom Up
the interrupt, control and bulk data are all queued after the
isochronous data and thus if not transmitted in one frame (time
period) will be done in the next.
The symmetric term refers to the fact that all the CPUs in the system
are the same (e.g. architecture, clock speed). In a SMP system there
are multiple processors that share other all other system resources
(memory, disk, etc).
This is the CPU cache; remember the cache is a small area of quickly
accessible memory that mirrors values stored in main system
memory. If one CPU modifies data in main memory and another CPU
has an old copy of that memory in its cache the system will obviously
not be in a consistent state. Note that the problem only occurs when
processors are writing to memory, since if a value is only read the
data will be consistent.
84
Computer Science from the Bottom Up
One protocol for doing this is the MOESI protocol; standing for
Modified, Owner, Exclusive, Shared, Invalid. Each of these is a state
that a cache line can be in on a processor in the system. There are
other protocols for doing as much, however they all share similar
concepts. Below we examine MOESI so you have an idea of what the
process entails.
The other case is where the processor snoops and finds that the value
is in another processors cache. If this value has already been marked
as modified, it will copy the data into its own cache and mark it as
shared. It will send a message for the other processor (that we got
the data from) to mark its cache line as owner. Now imagine that a
third processor in the system wants to use that memory too. It will
snoop and find both a shared and a owner copy; it will thus take its
value from the owner value. While all the other processors are only
reading the value, the cache line stays shared in the system.
However, when one processor needs to update the value it sends an
invalidate message through the system. Any processors with that
cache line must then mark it as invalid, because it not longer reflects
the "true" value. When the processor sends the invalidate message, it
marks the cache line as modified in its cache and all others will mark
as invalid (note that if the cache line is exclusive the processor knows
that no other processor is depending on it so can avoid sending an
invalidate message).
From this point the process starts all over. Thus whichever processor
has the modified value has the responsibility of writing the true value
back to RAM when it is evicted from the cache. By thinking through
the protocol you can see that this ensures consistency of cache lines
between processors.
There are several issues with this system as the number of processors
starts to increase. With only a few processors, the overhead of
85
Computer Science from the Bottom Up
Having the processors all on the same bus starts to present physical
problems as well. Physical properties of wires only allow them to be
laid out at certain distances from each other and to only have certain
lengths. With processors that run at many gigahertz the speed of light
starts to become a real consideration in how long it takes messages to
move around a system.
4.1.2 Hyperthreading
Much of the time of a modern processor is spent waiting for much
slower devices in the memory hierarchy to deliver data for
processing.
86
Computer Science from the Bottom Up
While each CPU has its own registers, they still have to share the
core logic, cache and input and output bandwidth from the CPU to
memory. So while two instruction streams can keep the core logic of
the processor busier, the performance increase will not be as great
has having two physically separate CPUs. Typically the performance
improvement is below 20% (XXX check), however it can be drastically
better or worse depending on the workloads.
While generally the processors have their own L1 cache, they do have
to share the bus connecting to main memory and other devices. Thus
performance is not as great as a full SMP system, but considerably
better than a hyperthreading system (in fact, each core can still
implement hyperthreading for an additional enhancement).
4.2 Clusters
Many applications require systems much larger than the number of
processors a SMP system can scale to. One way of scaling up the
system further is a cluster.
87
Computer Science from the Bottom Up
88
Computer Science from the Bottom Up
software has no (well, less) knowledge about the layout of the system
and the hardware does all the work to link the nodes together.
The term non uniform memory access comes from the fact that RAM
may not be local to the CPU and so data may need to be accessed
from a node some distance away. This obviously takes longer, and is in
contrast to a single processor or SMP system where RAM is directly
attached and always takes a constant (uniform) time to access.
89
Computer Science from the Bottom Up
Above we can see the outer cube contains four 8 nodes. The
maximum number of paths required for any node to talk to another
node is 3. When another cube is placed inside this cube, we now have
double the number of processors but the maximum path cost has only
increased to 4. This means as the number of processors grow by 2n
the maximum path cost grows only linearly.
90
Computer Science from the Bottom Up
After this should any other processor try to read the memory block
the directory will find the dirty bit set. The directory will need to get
the updated cache line from the processor with the valid bit currently
set, write the dirty data back to main memory and then provide that
data back to the requesting processor, setting the valid bit for the
requesting processor in the process. Note that this is transparent to
the requesting processor and the directory may need to get that data
from somewhere very close or somewhere very far away.
91
Computer Science from the Bottom Up
92
Computer Science from the Bottom Up
1 typedef struct {
int a;
int b;
} a_struct;
5
/*
* Pass in a pointer to be allocated as a new structure
*/
void get_struct(a_struct *new_struct)
10 {
void *p = malloc(sizeof(a_struct));
In this example, we have two stores that can be done in any particular
order, as it suits the processor. However, in the final case, the pointer
must only be updated once the two previous stores are known to have
been done. Otherwise another processor might look at the value of p ,
follow the pointer to the memory, load it, and get some completely
incorrect value!
93
Computer Science from the Bottom Up
Acquire semantics is like a fence that only allows load and stores to
move downwards through it. That is, when this load or store is
complete you can be guaranteed that any later load or stores will see
the value (since they can not be moved above it).
Release semantics is the opposite, that is a fence that allows any load
or stores to be done before it (move upwards), but nothing before it to
move downwards past it. Thus, when load or store with release
semantics is processed, you can be store that any earlier load or
stores will have been complete.
94
Computer Science from the Bottom Up
Load
Store
Load
Load
Store Acquire
All later operations must be able to
Load see the result of this operation.
Store
Store
Load
Load
Load Release
All ealier operations must be complete
before this operation completes.
Store
Store
Invalid Reordering
Load
The strictest memory model would use a full memory fence for every
operation. The weakest model would leave every load and store as a
95
Computer Science from the Bottom Up
The x86 (and AMD64) processor has a quite strict memory model; all
stores have release semantics (that is, the result of a store is
guaranteed to be seen by any later load or store) but all loads have
normal semantics. lock prefix gives memory fence.
Itanium allows all load and stores to be normal, unless explicitly told.
XXX
4.4.2 Locking
Knowing the memory ordering requirements of each architecture is
not practical for all programmers, and would make programs difficult
to port and debug across different processor types.
You can see how this is tied into the naming of the memory ordering
semantics in the previous section. We want to ensure that before we
acquire a lock, no operations that should be protected by the lock are
re-ordered before it. This is how acquire semantics works.
96
Computer Science from the Bottom Up
lock the first processor holds before unlocking the second lock, we
have a deadlock situation. Each processor is waiting for the other and
neither can continue without the others lock.
97
Computer Science from the Bottom Up
98
Computer Science from the Bottom Up
1.2 Multitasking
We expect modern computers to do many different things at once,
and we need some way to arbitrate between all the different
programs running on the system. It is the operating systems job to
allow this to happen seamlessly.
The X comes from Unix, from which the standard grew. Today, POSIX
is the same thing as the Single UNIX Specification Version 3 or ISO/
IEC 9945:2002. This is a free standard, available online.
Once upon a time, the Single UNIX specification and the POSIX
Standards were separate entities. The Single UNIX specification was
released by a consortium called the "Open Group", and was freely
available as per their requirements. The latest version is The Single
Unix Specification Version 3.
Thus finally the two separate standards were merged into what is
known as the Single UNIX Specification Version 3, which is also
1.
99
Computer Science from the Bottom Up
1.4 Security
On multi-user systems, security is very important. As the arbitrator of
access to the system the operating system is responsible for ensuring
that only those with the correct permissions can access resources.
For example if a file is owned by one user, another user should not be
allowed to open and read it. However there also need to be
mechanisms to share that file safely between the users should they
want it.
1.5 Performance
As the operating system provides so many services to the computer,
its performance is critical. Many parts of the operating system run
extremely frequently, so even an overhead of just a few processor
cycles can add up to a big decrease in overall system performance.
100
Computer Science from the Bottom Up
2 Operating System
Organisation
The operating system is roughly organised as in the figure below.
101
Computer Science from the Bottom Up
Userspace
Task n
Task 1
Task 1
Kernel
Drivers
Hardware
102
Computer Science from the Bottom Up
Whilst this sounds like the most obvious idea, the problem comes
back two main issues
103
Computer Science from the Bottom Up
This is a lot of steps for what might be a fairly simple request from a
foreign component. Obviously one request might make the other
component do more requests of even more components, and the
problem can multiply. Slow message passing implementations were
largely responsible for the poor performance of early microkernel
systems, and the concepts of passing messages are slightly harder for
programmers to program for. The enhanced protection from having
components run separately was not sufficient to overcome these
hurdles in early microkernel systems, so they fell out of fashion.
2.1.1.1 Modules
However, the modules are loaded directly in the privileged kernel and
operate at the same privilege level as the rest of the kernel, so the
system is still considered a monolithic kernel.
2.1.2 Virtualisation
Closely related to kernel is the concept of virtualisation of hardware.
Modern computers are very powerful, and often it is useful to not
think of them as one whole system but split a single physical
computer up into separate "virtual" machines. Each of these virtual
machines looks for all intents and purposes as a completely separate
104
Computer Science from the Bottom Up
machine, although physically they are all in the same box, in the same
place.
105
Computer Science from the Bottom Up
Operating System
Operating System
Operating System
Operating System
Guest
Guest
Guest
Guest
Memory CPUs Disk Memory
Operating System
Guest
Guest
Application
Virtual Virtual
Hardware Hardware
Application Application
Operating System
106
Computer Science from the Bottom Up
This is often used on large machines (with many CPUs and much
RAM) to implement partitioning. This means the machine can be split
up into smaller virtual machines. Often you can allocate more
resources to running systems on the fly, as requirements dictate. The
hypervisors on many large IBM machines are actually quite
complicated affairs, with many millions of lines of code. It provides a
multitude of system management services.
107
Computer Science from the Bottom Up
When one takes a large amount of memory there is less available for
the other. If both keep track of the maximum allocations, a bit of
information can be transferred. Say they make a pact to check every
second if they can allocate this large amount of memory. If the target
can, that is considered binary 0, and if it can not (the other machine
has all the memory), that is considered binary 1. A data rate of one bit
per second is not astounding, but information is flowing.
2.2 Userspace
We call the theoretical place where programs are run by the user
108
Computer Science from the Bottom Up
3 System Calls
3.1 Overview
System calls are how userspace programs interact with the kernel.
The general principle behind how they work is described below.
3.1.2 Arguments
System calls are no good without arguments; for example open()
needs to tell the kernel exactly what file to open. Once again the ABI
will define which registers arguments should be put into for the
system call.
109
Computer Science from the Bottom Up
The rest of the operation is fairly straight forward. The kernel looks in
the predefined register for the system call number, and looks it up in
a table to see which function it should call. This function is called,
does what it needs to do, and places its return value into another
register defined by the ABI as the return register.
The final step is for the kernel to make a jump instruction back to the
userspace program, so it can continue off where it left from. The
userpsace program gets the data it needs from the return register,
and continues happily on its way!
Although the details of the process can get quite hairy, this is
basically all their is to a system call.
3.1.4 libc
Although you can do all of the above by hand for each system call,
system libraries usually do most of the work for you. The standard
library that deals with system calls on UNIX like systems is libc ; we
will learn more about its roles in future weeks.
110
Computer Science from the Bottom Up
1 #include <stdio.h>
/* for syscall() */
#include <sys/syscall.h>
5 #include <unistd.h>
10 void function(void)
{
int pid;
pid = __syscall(__NR_getpid);
15 }
111
Computer Science from the Bottom Up
under the hood. We're going to look at real code, so things can get
quite hairy. But stick with it -- this is exactly how your system works!
3.2.1 PowerPC
PowerPC is a RISC architecture common in older Apple computers,
and the core of devices such as the latest version of the Xbox.
112
Computer Science from the Bottom Up
1
/* On powerpc a system call basically clobbers the same registe
* function call, with the exception of LR (which is needed for
* "sc; bnslr" sequence) and CR (where only CR0.SO is clobbered
5 * an error return status).
*/
__sc_loadargs_##nr(name, args);
__asm__ __volatile__
20 ("sc \n\t"
"mfcr %0 "
: "=&r" (__sc_0),
"=&r" (__sc_3), "=&r" (__sc_4),
"=&r" (__sc_5), "=&r" (__sc_6),
25 "=&r" (__sc_7)
: __sc_asm_input_##nr
: "cr0", "ctr", "memory",
"r8", "r9", "r10","r11", "r12");
__sc_ret = __sc_3;
30 __sc_err = __sc_0;
113
Computer Science from the Bottom Up
}
if (__sc_err & 0x10000000)
{
errno = __sc_ret;
35 __sc_ret = -1;
}
return (type) __sc_ret
114
Computer Science from the Bottom Up
#define _syscall0(type,name)
65 type name(void)
{
__syscall_nr(0, type, name);
}
70 #define _syscall1(type,name,type1,arg1)
type name(type1 arg1)
{
__syscall_nr(1, type, name, arg1);
}
75
#define _syscall2(type,name,type1,arg1,type2,arg2)
type name(type1 arg1, type2 arg2)
{
__syscall_nr(2, type, name, arg1, arg2);
80 }
#define _syscall3(type,name,type1,arg1,type2,arg2,type3,arg3)
type name(type1 arg1, type2 arg2, type3 arg3)
{
85 __syscall_nr(3, type, name, arg1, arg2, arg3);
}
#define _syscall4(type,name,type1,arg1,type2,arg2,type3,arg3,ty
type name(type1 arg1, type2 arg2, type3 arg3, type4 arg4)
90 {
115
Computer Science from the Bottom Up
#define _syscall5(type,name,type1,arg1,type2,arg2,type3,arg3,ty
95 type name(type1 arg1, type2 arg2, type3 arg3, type4 arg4, type5
{
__syscall_nr(5, type, name, arg1, arg2, arg3, arg4, arg
}
This code snippet from the kernel header file asm/unistd.h shows
how we can implement system calls on PowerPC. It looks very
complicated, but it can be broken down step by step.
Firstly, jump to the end of the example where the _syscallN macros
are defined. You can see there are many macros, each one taking
progressively one more argument. We'll concentrate on the most
simple version, _syscall0 to start with. It only takes two arguments,
the return type of the system call (e.g. a C int or char , etc) and the
name of the system call. For getpid this would be done as
_syscall0(int,getpid) .
The first step is declaring some names for registers. What this
essentially does is says __sc_0 refers to r0 (i.e. register 0). The
compiler will usually use registers how it wants, so it is important we
give it constraints so that it doesn't decide to go using register we
need in some ad-hoc manner.
So, all this tricky looking code actually does is puts the system call
number in register 0! Following the code through, we can see that the
116
Computer Science from the Bottom Up
other macros will place the system call arguments into r3 through
r7 (you can only have a maximum of 5 arguments to your system
call).
Now we are ready to tackle the __asm__ section. What we have here
is called inline assembly because it is assembler code mixed right in
with source code. The exact syntax is a little to complicated to go into
right here, but we can point out the important parts.
Just ignore the __volatile__ bit for now; it is telling the compiler that
this code is unpredictable so it shouldn't try and be clever with it.
Again we'll start at the end and work backwards. All the stuff after
the colons is a way of communicating to the compiler about what the
inline assembly is doing to the CPU registers. The compiler needs to
know so that it doesn't try using any of these registers in ways that
might cause a crash.
But the interesting part is the two assembly statements in the first
argument. The one that does all the work is the sc call. That's all you
need to do to make your system call!
So, what happens once the system call handler runs and completes?
Control returns to the next instruction after the sc , in this case a
memory fence instruction. What this essentially says is "make sure
everything is committed to memory"; remember how we talked about
pipelines in the superscalar architecture? This instruction ensures
that everything we think has been written to memory actually has
been, and isn't making its way through a pipeline somewhere.
Well, we're almost done! The only thing left is to return the value
from the system call. We see that __sc_ret is set from r3 and
__sc_err is set from r0. This is interesting; what are these two values
all about?
One is the return value, and one is the error value. Why do we need
two variables? System calls can fail, just as any other function. The
problem is that a system call can return any possible value; we can
not say "a negative value indicates failure" since a negative value
might be perfectly acceptable for some particular system call.
117
Computer Science from the Bottom Up
code to see if the top bit is set; this would indicate a negative number.
If so, we set the global errno value to it (this is the standard variable
for getting error information on call failure) and set the return to be
-1 . Of course, if a valid result is received we return it directly.
118
Computer Science from the Bottom Up
#define _syscall1(type,name,type1,arg1)
type name(type1 arg1) \
25 { \
long __res; \
__asm__ volatile ("int $0x80" \
: "=a" (__res) \
: "0" (__NR_##name),"b" ((long)(arg1))); \
30 __syscall_return(type,__res);
119
Computer Science from the Bottom Up
#define _syscall2(type,name,type1,arg1,type2,arg2)
type name(type1 arg1,type2 arg2)
35 {
long __res;
__asm__ volatile ("int $0x80"
: "=a" (__res)
: "0" (__NR_##name),"b" ((long)(arg1)),"c" ((long)(arg2
40 __syscall_return(type,__res);
}
#define _syscall3(type,name,type1,arg1,type2,arg2,type3,arg3)
type name(type1 arg1,type2 arg2,type3 arg3)
45 {
long __res;
__asm__ volatile ("int $0x80"
: "=a" (__res)
: "0" (__NR_##name),"b" ((long)(arg1)),"c" ((long)(arg2
50 "d" ((long)(arg3)));
__syscall_return(type,__res);
}
#define _syscall4(type,name,type1,arg1,type2,arg2,type3,arg3,ty
55 type name (type1 arg1, type2 arg2, type3 arg3, type4 arg4)
{
long __res;
__asm__ volatile ("int $0x80"
: "=a" (__res)
60 : "0" (__NR_##name),"b" ((long)(arg1)),"c" ((long)(arg2
120
Computer Science from the Bottom Up
65 #define _syscall5(type,name,type1,arg1,type2,arg2,type3,arg3,ty
type5,arg5)
type name (type1 arg1,type2 arg2,type3 arg3,type4 arg4,type5 ar
{
long __res;
70 __asm__ volatile ("int $0x80"
: "=a" (__res)
: "0" (__NR_##name),"b" ((long)(arg1)),"c" ((long)(arg2
"d" ((long)(arg3)),"S" ((long)(arg4)),"D" ((long)(arg
__syscall_return(type,__res);
75 }
#define _syscall6(type,name,type1,arg1,type2,arg2,type3,arg3,ty
type5,arg5,type6,arg6)
type name (type1 arg1,type2 arg2,type3 arg3,type4 arg4,type5 ar
80 {
long __res;
__asm__ volatile ("push %%ebp ; movl %%eax,%%ebp ; movl %1,%%ea
: "=a" (__res)
: "i" (__NR_##name),"b" ((long)(arg1)),"c" ((long)(arg2
85 "d" ((long)(arg3)),"S" ((long)(arg4)),"D" ((long)(arg
"0" ((long)(arg6)));
__syscall_return(type,__res);
}
121
Computer Science from the Bottom Up
x86 register names are based around letters, rather than the
numerical based register names of PowerPC. We can see from the
zero argument macro that only the A register gets loaded; from this
we can tell that the system call number is expected in the EAX
register. As we start loading registers in the other macros you can see
the short names of the registers in the arguments to the __asm__ call.
Another thing you might notice there is nothing like the memory
fence instruction we saw previously with the PowerPC. This is
because on x86 the effect of all instructions will be guaranteed to be
visible when the complete. This is easier for the compiler (and
programmer) to program for, but offers less flexibility.
The only thing left to contrast is the return value. On the PowerPC we
had two registers with return values from the kernel, one with the
value and one with an error code. However on x86 we only have one
return value that is passed into __syscall_return . That macro casts
the return value to unsigned long and compares it to an (architecture
and kernel dependent) range of negative values that might represent
error codes (note that the errno value is positive, so the negative
result from the kernel is negated). However, this means that system
calls can not return small negative values, since they are
indistinguishable from error codes. Some system calls that have this
requirement, such as getpriority() , add an offset to their return
value to force it to always be positive; it is up to the userspace to
realise this and subtract this constant value to get back to the "real"
122
Computer Science from the Bottom Up
value.
4 Privileges
4.1 Hardware
We mentioned how one of the major tasks of the operating system is
to implement security; that is to not allow one application or user to
interfere with any other that is running in the system. This means
applications should not be able to overwrite each others memory or
files, and only access system resources as dictated by system policy.
123
Computer Science from the Bottom Up
Ring n
Ring 2
Ring 1
Ring 0
In the inner most ring are the most protected instructions; those that
only the kernel should be allowed to call. For example, the HLT
instruction to halt the processor should not be allowed to be run by a
user application, since it would stop the entire computer from
working. However, the kernel needs to be able to call this instruction
when the computer is legitimately shut down.1
Each inner ring can access any instructions protected by a further out
ring, but not any protected by a further in ring. Not all architectures
have multiple levels of rings as above, but most will either provide for
at least a "kernel" and "user" level.
The 386 protection model has four rings, though most operating
systems (such as Linux and Windows) only use two of the rings to
maintain compatibility with other architectures that do now allow as
124
Computer Science from the Bottom Up
Applications may only raise their privilege level by specific calls that
allow it, such as the instruction to implement a system call. These are
usually referred to as a call gate because they function just as a
physical gate; a small entry through an otherwise impenetrable wall.
When that instruction is called we have seen how the hardware
completely stops the running application and hands control over to
the kernel. The kernel must act as a gatekeeper; ensuring that
nothing nasty is coming through the gate. This means it must check
system call arguments carefully to make sure it will not be fooled into
doing anything it shouldn't (if it can be, that is a security bug). As the
kernel runs in the innermost ring, it has permissions to do any
operation it wants; when it is finished it will return control back to
the application which will again be running with its lower privilege
level.
One problem with traps as described above is that they are very
expensive for the processor to implement. There is a lot of state to be
saved before context can switch. Modern processors have realised
this overhead and strive to reduce it.
125
Computer Science from the Bottom Up
4 bits
16 bits
20 bits (1MiB)
CS:0x1000
DS:0x4000
CODE
DATA
CPU
SS:0x10000
2^20
64KiB Segments
126
Computer Science from the Bottom Up
Start : 0x1000
Protected
Size : 0x1000
Code
Ring : 0
Type : CODE
0
1
2
3
Gate
Call
Protection
Type : GATE
"Far" call invokes a call gate
Size : 0x1000
Code
Ring : 3
CODE Type : CODE
Size : 0x1000
STACK
Data
Ring : 3
Type : DATA
Start : 0x4000
Process
Size : 0x1000
Stack
Start : 0x5000
Process
Type : TSS
x86 segments in action. Notice how a "far-call" passes via a call-gate which
redirects to a segment of code running at a lower ring level. The only way
127
Computer Science from the Bottom Up
to modify the code-segment selector, implicitly used for all code addresses,
is via the call mechanism. Thus the call-gate mechanism ensures that to
choose a new segment descriptor, and hence possibly change protection
levels, you must transition via a known entry point.
The problem with this scheme is that it is slow. It takes a lot of effort
to do all this checking, and many registers need to be saved to get
into the new code. And on the way back out, it all needs to be
restored again.
128
Computer Science from the Bottom Up
Because the general nature has been replaced with so much prior-
known information, the whole process can be speed up, and hence we
have a the aforementioned fast system call. The other thing to note is
that state is not preserved when the kernel gets control. The kernel
has to be careful to not to destroy state, but it also means it is free to
only save as little state as is required to do the job, so can be much
more efficient about it. This is a very RISC philosophy, and illustrates
how the line blurs between RISC and CISC processors.
4.2.1 ioctl
about ioctls
129
Computer Science from the Bottom Up
2 Elements of a process
Process ID
Memory
Files
Registers
Kernel State
2.1 Process ID
The process ID (or the PID) is assigned by the operating system and is
unique to each running process.
2.2 Memory
We will learn exactly how a process gets its memory in the following
weeks -- it is one of the most fundamental parts of how the operating
system works. However, for now it is sufficient to know that each
process gets its own section of memory.
In this memory all the program code is stored, along with variables
130
Computer Science from the Bottom Up
By convention, stacks usually grow down2 . This means that the stack
1. Not all architectures support this, however. This has lead to a wide
range of security problems on many architectures.
2. Some architectures, such as PA-RISC from HP, have stacks that grow
131
Computer Science from the Bottom Up
High
Address return addr
function1(int x, int y)
Stack Frame
input (x) {
int z
input (y) z = function2(x+y)
}
local (z)
int function2(int a)
return addr {
return a + 100
input (a) }
We can see how having a stack brings about many of the features of
functions.
• Each function has its own copy of its input arguments. This is
because each function is allocated a new stack frame with its
arguments in a fresh area of memory.
132
Computer Science from the Bottom Up
You can see how the way functions works fits exactly into the
nature of a stack. Any function can call any other function,
which then becomes the up most function (put on top of the
stack). Eventually that function will return to the function that
called it (takes itself off the stack).
133
Computer Science from the Bottom Up
the process.
134
Computer Science from the Bottom Up
1 $ cat sp.c
void function(void)
{
int i = 100;
5 int j = 200;
int k = 300;
}
135
Computer Science from the Bottom Up
local variables. Since the stack grows down, we subtract from the
value held in the stack pointer. The value 16 is a value large enough
to hold our local variables, but may not be exactly the size required
(for example with 3 4 byte int values we really only need 12 bytes,
not 16) to keep alignment of the stack in memory on certain
boundaries as the compiler requires.
Then we move the values into the stack memory (and in a real
function, use them). Finally, before returning to our parent function
we "pop" the values off the stack by moving the stack pointer back to
where it was before we started.
The bottom of the heap is known as the brk, so called for the system
call which modifies it. By using the brk call to grow the area
downwards the process can request the kernel allocate more memory
for it to use.
The heap is most commonly managed by the malloc library call. This
makes managing the heap easy for the programmer by allowing them
to simply allocate and free (via the free call) heap memory. malloc
can use schemes like a buddy allocator to manage the heap memory
for the user. malloc can also be smarter about allocation and
potentially use anonymous mmaps for extra process memory. This is
where instead of mmaping a file into the process memory it directly
maps an area of system RAM. This can be more efficient. Due to the
complexity of managing memory correctly, it is very uncommon for
any modern program to have a reason to call brk directly.
the start of the stack frame. Having this pointer helps debuggers to
walk upwards through the stack frames, however it makes one less
register available for other applications.
136
Computer Science from the Bottom Up
Kernel
Shared Libraries
mmap area
Stack
Heap
malloc()
brk
BSS
Program Image
Data
Code
Process Memory
137
Computer Science from the Bottom Up
Thus, file descriptors are kept by the kernel individually for each
process.
File descriptors also have permissions. For example, you may be able
to read from a file but not write to it. When the file is opened, the
operating system keeps a record of the processes permissions to that
file in the file descriptor and doesn't allow the process to do anything
it shouldn't.
2.4 Registers
We know from the previous chapter that the processor essentially
performs generally simple operations on values in registers. These
values are read (and written) to memory -- we mentioned above that
each process is allocated memory which the kernel keeps track of.
138
Computer Science from the Bottom Up
2.5.2 Priority
Some processes are more important than others, and get a higher
priority. See the discussion on the scheduler below.
2.5.3 Statistics
The kernel can keep statistics on each processes behaviour which can
help it make decisions about how the process behaves; for example
does it mostly read from disk or does it mostly do CPU intensive
operations?
3 Process Hierarchy
Whilst the operating system can run many processes at the same
time, in fact it only ever directly starts one process called the init
(short for initial) process. This isn't a particularly special process
except that its PID is always 0 and it will always be running.
1. The term spawn is often used when talking about parent processes
creating children; as in "the process spawned a child".
139
Computer Science from the Bottom Up
1 init-+-apmd
|-atd
|-cron
...
5 |-dhclient
|-firefox-bin-+-firefox-bin---2*[firefox-bin]
| |-java_vm---java_vm---13*[java_vm]
| `-swf_play
4.1 Fork
When you come to metaphorical "fork in the road" you generally have
two options to take, and your decision effects your future. Computer
programs reach this fork in the road when they hit the fork() system
call.
At this point, the operating system will create a new process that is
exactly the same as the parent process. This means all the state that
was talked about previously is copied, including open files, register
state and all memory allocations, which includes the program code.
The return value from the system call is the only way the process can
determine if it was the existing process or a new one. The return
value to the parent process will be the Process ID (PID) of the child,
whilst the child will get a return value of 0.
At this point, we say the process has forked and we have the parent-
child relationship as described above.
140
Computer Science from the Bottom Up
4.2 Exec
Forking provides a way for an existing process to start a new one, but
what about the case where the new process is not part of the same
program as parent process? This is the case in the shell; when a user
starts a command it needs to run in a new process, but it is unrelated
to the shell.
This is where the exec system call comes into play. exec will replace
the contents of the currently running process with the information
from a program binary.
Thus the process the shell follows when launching a new program is
to firstly fork , creating a new process, and then exec (i.e. load into
memory and execute) the program binary it is supposed to run.
4.3.1 clone
In the kernel, fork is actually implemented by a clone system call.
This clone interfaces effectively provides a level of abstraction in
how the Linux kernel can create processes.
clone allows you to explicitly specify which parts of the new process
are copied into the new process, and which parts are shared between
the two processes. This may seem a bit strange at first, but allows us
to easily implement threads with one very simple interface.
4.3.1.1 Threads
141
Computer Science from the Bottom Up
Process ID
Memory
Files
Thread ID Thread ID
Registers Registers
1. Separate processes can not see each others memory. They can
only communicate with each other via other system calls.
The problem that this raises is that threads can very easily step
on each others toes. One thread might increment a variable,
and another may decrease it without informing the first thread.
These type of problems are called concurrency problems and
they are many and varied.
142
Computer Science from the Bottom Up
Thus the other method is that the kernel has full knowledge of the
thread. Under Linux, this is established by making all processes able
to share resources via the clone system call. Each thread still has
associated kernel resources, so the kernel can take it into account
when doing resource allocations.
Copy on write also has a big advantage for exec . Since exec will
simply be overwriting all the memory with the new program, actually
143
Computer Science from the Bottom Up
copying the memory would waste a lot of time. Copy on write saves us
actually doing the copy.
On boot the kernel starts the init process, which then forks and execs
the systems boot scripts. These fork and exec more programs,
eventually ending up forking a login process.
The other job of the init process is "reaping". When a process calls
exit with a return code, the parent usually wants to check this code
to see if the child exited correctly or not.
However, this exit code is part of the process which has just called
exit . So the process is "dead" (e.g. not running) but still needs to
stay around until the return code is collected. A process in this state
is called a zombie (the traits of which you can contrast with a
mystical zombie!)
A process stays as a zombie until the parent collects the return code
with the wait call. However, if the parent exits before collecting this
return code, the zombie process is still around, waiting aimlessly to
give its status to someone.
In this case, the zombie child will be reparented to the init process
which has a special handler that reaps the return value. Thus the
process is finally free and the descriptor can be removed from the
kernels process table.
144
Computer Science from the Bottom Up
1 $ cat zombie.c
#include <stdio.h>
#include <stdlib.h>
5 int main(void)
{
pid_t pid;
if (pid == 0) {
printf("child : %d\n", getpid());
15 sleep(2);
printf("child exit\n");
exit(1);
}
20 /* in parent */
while (1)
{
sleep(1);
}
25 }
$ ps ax | grep [z]ombie
16168 pts/9 S 0:00 ./zombie
16169 pts/9 Z 0:00 [zombie] <defunct>
145
Computer Science from the Bottom Up
Below the code you can see the results of running the program. The
parent process (16168) is in state S for sleep (as we expect) and the
child is in state Z for zombie. The ps output also tells us that the
process is defunct in the process description.1
5 Context Switching
Context switching refers to the process the kernel undertakes to
switch from one process to another. XXX ?
6 Scheduling
A running system has many processes, maybe even into the hundreds
or thousands. The part of the kernel that keeps track of all these
processes is called the scheduler because it schedules which process
should be run next.
People are always coming up with new algorithms, and you can
probably think of your own fairly easily. But there are a number of
different components of scheduling.
1. The square brackets around the "z" of "zombie" are a little trick to
remove the grep processes itself from the ps output. grep interprets
everything between the square brackets as a character class, but
because the process name will be "grep [z]ombie" (with the brackets)
this will not match!
146
Computer Science from the Bottom Up
6.2 Realtime
Some processes need to know exactly how long their time-slice will
be, and how long it will be before they get another time-slice to run.
Say you have a system running a heart-lung machine; you don't want
the next pulse to be delayed because something else decided to run in
the system!
147
Computer Science from the Bottom Up
Bitmap
Process
148
Computer Science from the Bottom Up
7 The Shell
On a UNIX system, the shell is the standard interface to handling
processes on your system. Once the shell was the primary interface,
however modern Linux systems have a GUI and provide a shell via a
"terminal application" or similar. The primary job of the shell is to
help the user handle starting, stopping and otherwise controlling
processes running in the system.
When you type a command at the prompt of the shell, it will fork a
copy of itself and exec the command that you have specified.
The shell then, by default, waits for that process to finish running
before returning to a prompt to start the whole process over again.
The new process runs in the background, and the shell is ready
waiting to start a new process should you desire. You can usually tell
the shell to foreground a process, which means we do actually want
to wait for it to finish.
149
Computer Science from the Bottom Up
8 Signals
Processes running in the system require a way to be told about events
that influence them. On UNIX there is infrastructure between the
kernel and processes called signals which allows a process to receive
notification about events important to it.
As a process uses the read system call to read input from the
keyboard, the kernel will be watching the input stream looking for
special characters. Should it see a ctrl-c it will jump into signal
handling mode. The kernel will look to see if the process has
registered a handler for this interrupt. If it has, then execution will be
passed to that function where the function will handle it. Should the
process have not registered a handler for this particular signal, then
the kernel will take some default action. With ctrl-c the default
action is to terminate the process.
A process can choose to ignore some signals, but other signals are
not allowed to be ignored. For example, SIGKILL is the signal sent
when a process should be terminated. The kernel will see that the
process has been sent this signal and terminate the process from
running, no questions asked. The process can not ask the kernel to
ignore this signal, and the kernel is very careful about which process
is allowed to send this signal to another process; you may only send it
to processes owned by you unless you are the root user. You may have
seen the command kill -9 ; this comes from the implementation
SIGKILL signal. It is commonly known that SIGKILL is actually defined
to be 0x9 , and so when specified as an argument to the kill
program means that the process specified is going to be stopped
immediately. Since the process can not choose to ignore or handle
this signal, it is seen as an avenue of last resort, since the program
will have no chance to clean up or exit cleanly. It is considered better
to first send a SIGTERM (for terminate) to the process first, and if it
has crashed or otherwise will not exit then resort to the SIGKILL . As a
150
Computer Science from the Bottom Up
This raises the question of what happens after the signal is received.
Once the signal handler has finished running, control is returned to
the process which continues on from where it left off.
8.1 Example
The following simple program introduces a lot of signals to run!
151
Computer Science from the Bottom Up
1 $ cat signal.c
#include <stdio.h>
#include <unistd.h>
#include <signal.h>
5
void sigint_handler(int signum)
{
printf("got SIGINT\n");
}
10
int main(void)
{
signal(SIGINT, sigint_handler);
printf("pid is %d\n", getpid());
15 while (1)
sleep(1);
}
$ gcc -Wall -o signal signal.c
$ ./signal
20 pid is 2859
got SIGINT # press ctrl-c
# press ctrl-z
[1]+ Stopped ./signal
152
Computer Science from the Bottom Up
We have simple program that simply defines a handler for the SIGINT
signal, which is sent when the user presses ctrl-c . All the signals for
the system are defined in signal.h , including the signal function
which allows us to register the handling function.
The program simply sits in a tight loop doing nothing until it quits.
When we start the program, we try pressing ctrl-c to make it quit.
Rather than taking the default action, or handler is invoked and we
get the output as expected.
You guessed it, more signals! When a parent child has a process that
dies, it gets a SIGCHLD signal back. In this case the shell was the
parent process and so it got the signal. Remember how we have the
zombie process that needs to be reaped with the wait call to get the
return code from the child process? Well another thing it also gives
the parent is the signal number that the child may have died from.
Thus the shell knows that child process died from a SIGABRT and as
an informational service prints as much for the user (the same
process happens to print out "Segmentation Fault" when the child
153
Computer Science from the Bottom Up
You can see how in even a simple program, around 5 different signals
were used to communicate between processes and the kernel and
keep things running. There are many other signals, but these are
certainly amongst the most common. Most have system functions
defined by the kernel, but there are a few signals reserved for users
to use for their own purposes within their programs ( SIGUSR ).
154
Computer Science from the Bottom Up
64-bit computing does have some trade-offs against using smaller bit-
width processors. Every program compiled in 64-bit mode requires
8-byte pointers, which can increase code and data size, and hence
impact both instruction and data cache performance. However, 64-bit
processors tend to have more registers, which means less need to
save temporary variables to memory when the compiler is under
register pressure.
155
Computer Science from the Bottom Up
Full address
Imp
imp
sign
1111111... 1
0000000... 0
The exact most-significant bit value for the processor can usually be
found by querying the processor itself using its informational
instructions. Although the exact value is implementation dependent, a
typical value would be 48; providing 248 = 256 TiB of usable address-
space.
156
Computer Science from the Bottom Up
So to this end, we say that all addresses a program uses are virtual.
The operating system keeps track of virtual addresses and how they
are allocated to physical addresses. When a program does a load or
store from an address, the processor and operating system work
together to convert this virtual address to the actual address in the
system memory chips.
3 Pages
The total address-space is divided into individual pages. Pages can be
many different sizes; generally they are around 4 KiB, but this is not a
hard and fast rule and they can be much larger but generally not any
smaller. The page is the smallest unit of memory that the operating
system and hardware can deal with.
Page
Page
Page
Page
157
Computer Science from the Bottom Up
4 Physical Memory
Just as the operating system divides the possible address space up
into pages, it divides the available physical memory up into frames. A
frame is just the conventional name for a hunk of physical memory
the same size as the system page size.
How does the operating system know what memory is available? This
information about where memory is located, how much, attributes
and so forth is passed to the operating system by the BIOS during
initialisation.
Page tables can have many different structures and are highly
optimised, as the process of finding a page in the page table can be a
lengthy process. We will examine page-tables in more depth later.
158
Computer Science from the Bottom Up
another process.
6 Virtual Addresses
When a program accesses memory, it does not know or care where
the physical memory backing the address is stored. It knows it is up
to the operating system and hardware to work together to map locate
the right physical address and thus provide access to the data it
wants. Thus we term the address a program is using to access
memory a virtual address. A virtual address consists of two parts; the
page and an offset into that page.
6.1 Page
Since the entire possible address space is divided up into regular
sized pages, every possible address resides within a page. The page
component of the virtual address acts as an index into the page table.
Since the page is the smallest unit of memory allocation within the
system there is a trade-off between making pages very small, and
thus having very many pages for the operating-system to manage,
and making pages larger but potentially wasting memory
6.2 Offset
The last bits of the virtual address are called the offset which is the
location difference between the byte address you want and the start
of the page. You require enough bits in the offset to be able to get to
any byte in the page. For a 4K page you require (4K == (4 * 1024) ==
4096 == 212 ==) 12 bits of offset. Remember that the smallest
amount of memory that the operating system or hardware deals with
is a page, so each of these 4096 bytes reside within a single page and
are dealt with as "one".
159
Computer Science from the Bottom Up
Since the page-tables are under the control of the operating system, if
the virtual-address doesn't exist in the page-table then the operating-
system knows the process is trying to access memory that has not
been allocated to it and the access will not be allowed.
Virtual Address
Page Table
We can follow this through for our previous example of a simple linear
page-table. We calculated that a 32-bit address-space would require a
table of 1048576 entries when using 4KiB pages. Thus to map a
theoretical address of 0x80001234, the first step would be to remove
the offset bits. In this case, with 4KiB pages, we know we have 12-bits
(212 == 4096) of offset. So we would right-shift out 12-bits of the
virtual address, leaving us with 0x80001. Thus (in decimal) the value
in row 524289 of the linear page table would be the physical frame
corresponding to this page.
160
Computer Science from the Bottom Up
You might see a problem with a linear page-table: since every page
must be accounted for, whether in use or not, a physically linear
page-table is completely impractical with a 64-bit address space.
Consider a 64-bit address space divided into 64 KiB pages creates
264/216 = 252 pages to be managed; assuming each page requires an
8-byte pointer to a physical location a total of 252*23 = 255 or 32 PiB
of contiguous memory would be required just for the page table!
There are ways to split addressing up that avoid this which we will
discuss later.
7 Consequences of virtual
addresses, pages and page
tables
Virtual addressing, pages and page-tables are the basis of every
modern operating system. It under-pins most of the things we use our
systems for.
7.2 Protection
We previously mentioned that the virtual mode of the 386 processor is
161
Computer Science from the Bottom Up
called protected mode, and this name arises from the protection that
virtual memory can offer to processes running on it.
Since each page has extra attributes, a page can be set read only,
write only or have any number of other interesting properties. When
the process tries to access the page, the operating system can check
if it has sufficient permissions and stop it if it does not (writing to a
read only page, for example).
Systems that use virtual memory are inherently more stable because,
assuming the perfect operating system, a process can only crash itself
and not the entire system (of course, humans write operating systems
and we inevitably overlook bugs that can still cause entire systems to
crash).
7.3 Swap
We can also now see how the swap memory is implemented. If instead
of pointing to an area of system memory the page pointer can be
changed to point to a location on a disk.
This can be a major issue for swap memory. Loading from the hard
disk is very slow (compared to operations done in memory) and most
people will be familiar with sitting in front of the computer whilst the
hard disk churns and churns whilst the system remains unresponsive.
162
Computer Science from the Bottom Up
7.3.1 mmap
A different but related process is the memory map, or mmap (from the
system call name). If instead of the page table pointing to physical
memory or swap the page table points to a file, on disk, we say the
file is mmap ed.
You can see now how threads are implemented. In Section 4.3.1,
clone we said that the Linux clone() function could share as much
or as little of a new process with the old process as it required. If a
process calls clone() to create a new process, but requests that the
two processes share the same page table, then you effectively have a
thread as both processes see the same underlying physical memory.
You can also see now how copy on write is done. If you set the
permissions of a page to be read-only, when a process tries to write to
the page the operating system will be notified. If it knows that this
page is a copy-on-write page, then it needs to make a new copy of the
page in system memory and point the page in the page table to this
new page. This can then have its attributes updated to have write
permissions and the process has its own unique copy of the page.
The memory hierarchy tells us that disk access is much slower than
memory access, so it makes sense to move as much data from disk
into system memory if possible.
Linux, and many other systems, will copy data from files on disk into
memory when they are used. Even if a program only initially requests
a small part of the file, it is highly likely that as it continues
163
Computer Science from the Bottom Up
processing it will want to access the rest of file. When the operating
system has to read or write to a file, it first checks if the file is in its
memory cache.
The page cache refers to a list of pages the kernel keeps that refer to
files on disk. From above, swap page, mmaped pages and disk cache
pages all fall into this category. The kernel keeps this list because it
needs to be able to look them up quickly in response to read and
write requests XXX: this bit doesn't file?
8 Hardware Support
So far, we have only mentioned that hardware works with the
operating system to implement virtual memory. However we have
glossed over the details of exactly how this happens.
164
Computer Science from the Bottom Up
1. Imagine that the maximum offset was 32 bits; in this case the entire
address space could be accessed as an offset from a segment at
0x00000000 and you would essentially have a flat layout -- but it still isn't
as good as virtual memory as you will see. In fact, the only reason it is
16 bits is because the original Intel processors were limited to this, and
the chips maintain backwards compatibility.
165
Computer Science from the Bottom Up
Segment Register
Segment Register
In the above figure, there are three segment registers which are all
pointing to segments. The maximum offset (constrained by the
number of bits available) is shown by shading. If the program wants
an address outside this range, the segment registers must be
reconfigured. This quickly becomes a major annoyance. Virtual
memory, on the other hand, allows the program to specify any
address and the operating system and hardware do the hard work of
translating to a physical address.
166
Computer Science from the Bottom Up
However, if the processor can not find a translation in the TLB, the
processor must raise a page fault. This is similar to an interrupt (as
discussed before) which the operating system must handle.
In the case that the operating system can not find a translation in the
page table, or alternatively if the operating system checks the
permissions of the page in question and the process is not authorised
to access it, the operating system must kill the process. If you have
ever seen a segmentation fault (or a segfault) this is the operating
system killing a process that has overstepped its bounds.
Should the translation be found, and the TLB currently be full, then
one translation needs to be removed before another can be inserted.
It does not make sense to remove a translation that is likely to be
used in the future, as you will incur the cost of finding the entry in the
page-tables all over again. TLBs usually use something like a Least
Recently Used or LRU algorithm, where the oldest translation that
has not been used is ejected in favour of the new one.
The access can then be tried again, and, all going well, should be
found in the TLB and translated correctly.
When we say that the operating system finds the translation in the
page table, it is logical to ask how the operating system finds the
memory that has the page table.
The base of the page table will be kept in a register associated with
each process. This is usually called the page-table base-register or
similar. By taking the address in this register and adding the page
number to it, the correct entry can be located.
An accessed page is simply any page that has been accessed. When a
page translation is initially loaded into the TLB the page can be
167
Computer Science from the Bottom Up
marked as having been accessed (else why were you loading it in?1)
The operating system can periodically go through all the pages and
clear the accessed bit to get an idea of what pages are currently in
use. When system memory becomes full and it comes time for the
operating system to choose pages to be swapped out to disk,
obviously those pages whose accessed bit has not been reset are the
best candidates for removal, because they have not been used the
longest.
A dirty page is one that has data written to it, and so does not match
any data already on disk. For example, if a page is loaded in from
swap and then written to by a process, before it can be moved out of
swap it needs to have its on disk copy updated. A page that is clean
has had no changes, so we do not need the overhead of copying the
page back to disk.
Both are similar in that they help the operating system to manage
pages. The general concept is that a page has two extra bits; the dirty
bit and the accessed bit. When the page is put into the TLB, these bits
are set to indicate that the CPU should raise a fault .
168
Computer Science from the Bottom Up
169
Computer Science from the Bottom Up
9 Linux Specifics
Although the basic concepts of virtual memory remain constant, the
specifics of implementations are highly dependent on the operating
system and hardware.
170
Computer Science from the Bottom Up
Kernel
Kernel S
(Sha
(Priva
Page
Physical
Memory
As the page tables use a hierarchy that is three levels deep, the Linux
scheme is most commonly referred to as the three level page table.
The three level page table has proven to be robust choice, although it
is not without its criticism. The details of the virtual memory
implementation of each processor vary Whitley meaning that the
generic page table Linux chooses must be portable and relatively
generic.
The concept of the three level page table is not difficult. We already
know that a virtual address consists of a page number and an offset in
171
Computer Science from the Bottom Up
the physical memory page. In a three level page table, the virtual
address is further split up into a number levels.
Each level is a page table of its own right; i.e. it maps a page number
of a physical page. In a single level page table the "level 1" entry
would directly map to the physical frame. In the multilevel version
each of the upper levels gives the address of the physical memory
frame holding the next lower levels page table.
Virtual Address
172
Computer Science from the Bottom Up
In a three level system, the first level is only one physical frame of
memory. This maps to a second level, which is again only a single
frame of memory, and again with the third. Consequently, the three
level system reduces the number of pages required to only a fraction
of those required for the single level system.
The part of the processor that deals with virtual memory is generally
referred to as the Memory Management Unit or MMU
10.1 x86-64
XXX
10.2 Itanium
The Itanium MMU provides many interesting features for the
operating system to work with virtual memory.
173
Computer Science from the Bottom Up
0x1000 0x1000
Shared Region
Shared Key
Process 1 Process 2
The Itanium MMU considers these problems and provides the ability
174
Computer Science from the Bottom Up
Index Virtual
Region ID
Virtual Region # (VRN) Page # (VPN)
Search Search
Search
Further to this, the top three bits (the region bits) are not considered
in virtual address translation. Therefore, if two processes share a RID
(i.e., hold the same value in one of their region registers) then they
have an aliased view of that region. For example, if process-A holds
RID 0x100 in region-register 3 and process-B holds the same RID
175
Computer Science from the Bottom Up
To allow for even finer grained sharing, each TLB entry on the
Itanium is also tagged with a protection key. Each process has an
additional number of protection key registers under operating-system
control.
The key can also enforce permissions; for example, one process may
have a key which grants write permissions and another may have a
read-only key. This allows for sharing of translation entries in a much
wider range of situations with granularity right down to a single-page
level, leading to large potential improvements in TLB performance.
176
Computer Science from the Bottom Up
177
Computer Science from the Bottom Up
Virtual Address
0x123400
0x123 0x4
Page Size
178
Computer Science from the Bottom Up
PGD PGD
PMD PMD
PMD
PTE PTE
Conceptual view of a
hierarchial page table
PTE
T
P
L
PTE's for a contiguous
region of virtual addresses B
PTE P
179
Computer Science from the Bottom Up
Hash
PPN PPN
PKEY psize
64 bits Tag
Chain
4 x 64 bits
Figure 10.2.2.1.3 Itanium PTE entry formats
The major drawback is that the VLPT now requires TLB entries which
causes an increase on TLB pressure. Since each address space
requires its own page table the overheads become greater as the
system becomes more active. However, any increase in TLB capacity
misses should be more than regained in lower refill costs from the
efficient hardware walker. Note that a pathological case could skip
over page_size ÷ translation_size entries, causing repeated nested
faults, but this is a very unlikely access pattern.
180
Computer Science from the Bottom Up
Using TLB entries in an effort to reduce TLB refill costs, as done with
the SF-VHPT, may or may not be an effective trade-off. Itanium also
implements a hashed page-table with the potential to lower TLB
overheads. In this scheme, the processor hashes a virtual address to
find an offset into a contiguous table.
The extra information required for each translation entry gives rise to
the moniker long-format~VHPT (LF-VHPT). Translation entries grow
to 32-bytes as illustrated on the right hand side of Figure 10.2.2.1.3,
Itanium PTE entry formats.
The main advantage of this approach is the global hash table can be
pinned with a single TLB entry. Since all processes share the table it
should scale better than the SF-VHPT, where each process requires
increasing numbers of TLB entries for VLPT pages. However, the
larger entries are less cache friendly; consider we can fit four 8-byte
short-format entries for every 32-byte long-format entry. The very
large caches on the Itanium processor may help mitigate this impact,
however.
181
Computer Science from the Bottom Up
One advantage of the SF-VHPT is that the operating system can keep
translations in a hierarchical page-table and, as long as the hardware
translation format is maintained, can map leaf pages directly to the
VLPT. With the LF-VHPT the OS must either use the hash table as the
primary source of translation entries or otherwise keep the hash table
as a cache of its own translation information. Keeping the LF-VHPT
hash table as a cache is somewhat sub-optimal because of increased
overheads on time critical fault paths, however advantages are
gained from the table requiring only a single TLB entry.
182
Computer Science from the Bottom Up
2 Building an executable
When we talk about the compiler, there are actually three separate
steps involved in creating the executable file.
1. Compiling
2. Assembling
3. Linking
Each link in the chain takes the source code progressively closer to
being binary code suitable for execution.
183
Computer Science from the Bottom Up
3 Compiling
3.1 The process of compiling
The first step of compiling a source file to an executable file is
converting the code from the high level, human understandable
language to assembly code. We know from previous chapters than
assembly code works directly with the instructions and registers
provided by the processor.
3.1.1 C code
With C code, there is actually a step before parsing the source code
called the pre-processor. The pre-processor is at its core a text
replacement program. For example, any variable declared as #define
variable text will have variable replaced with text . This
preprocessed code is then passed into the compiler.
3.2 Syntax
Any computing language has a particular syntax that describes the
rules of the language. Both you and the compiler know the syntax
rules, and all going well you will understand each other. Humans,
being as they are, often forget the rules or break them, leading the
compiler to be unable to understand your intentions. For example, if
you were to leave the closing bracket off a if condition, the compiler
does not know where the actual conditional is.
184
Computer Science from the Bottom Up
3.3.1 Alignment
0 4 8 1
Aligned Unaligned
Mem
Registers
CPU
CPUs can generally not load a value into a register from an arbitrary
memory location. It requires that variables be aligned on certain
boundaries. In the example above, we can see how a 32 bit (4 byte)
value is loaded into a register on a machine that requires 4 byte
alignment of variables.
185
Computer Science from the Bottom Up
The C99 standard only says that structures will be ordered in memory
in the same order as they are specified in the declaration, and that in
an array of structures all elements will be the same size.
186
Computer Science from the Bottom Up
1 $ cat struct.c
#include <stdio.h>
struct a_struct {
5 char char_one;
char char_two;
int int_one;
};
10 int main(void)
{
struct a_struct s;
15 printf("%p : s.char_one\n" \
"%p : s.char_two\n" \
"%p : s.int_one\n", &s.char_one,
&s.char_two, &s.int_one);
20 return 0;
$ ./struct
0x7fdf6798 : s.char_one
30 0x7fdf6799 : s.char_two
187
Computer Science from the Bottom Up
0x7fdf679c : s.int_one
$ ./struct-packed
0x7fcd2778 : s.char_one
35 0x7fcd2779 : s.char_two
0x7fcd277a : s.int_one
s.char_one
s.char_two
s.int_one
In the other example we direct the compiler not to pad structures and
correspondingly we can see that the integer starts directly after the
two chars .
188
Computer Science from the Bottom Up
One possible way to detect this sort of situation is profiling. When you
profile your code you "watch" it to analyse what code paths are taken
and how long they take to execute. With profile guided optimisation
(PGO) the compiler can put special extra bits of code in the first
binary it builds, which runs and makes a record of the branches
taken, etc. You can then recompile the binary with the extra
information to possibly create a better performing binary. Otherwise
the programmer can look at the output of the profile and possibly
detect situations such as cache line bouncing. (XXX somewhere else?)
What the compiler has done above is traded off using some extra
memory to gain a speed improvement in running our code. The
compiler knows the rules of the architecture and can make decisions
about the best way to align data, possibly by trading off small
amounts of wasted memory for increased (or perhaps even just
correct) performance.
189
Computer Science from the Bottom Up
1 $ cat stack.c
#include <stdio.h>
struct a_struct {
5 int a;
int b;
};
int main(void)
10 {
int i;
struct a_struct s;
printf("%p\n%p\ndiff %ld\n", &i, &s, (unsigned long)&s
return 0;
15 }
$ gcc-3.3 -Wall -o stack-3.3 ./stack.c
$ gcc-4.0 -o stack-4.0 stack.c
$ ./stack-3.3
20 0x60000fffffc2b510
0x60000fffffc2b520
diff 16
$ ./stack-4.0
25 0x60000fffff89b520
0x60000fffff89b524
diff 4
190
Computer Science from the Bottom Up
Generally you should ensure that you do not make assumptions about
the size of types or alignment rules.
There are a few common sequences of code that deal with alignment;
generally most programs will consider it in some ways. You may see
these "code idioms" in many places outside the kernel when dealing
with programs that deal with chunks of data in some form or another,
so it is worth investigating.
We can take some examples from the Linux kernel, which often has to
deal with alignment of pages of memory within the system.
191
Computer Science from the Bottom Up
1 [ include/asm-ia64/page.h ]
/*
* PAGE_SHIFT determines the actual kernel page size.
5 */
#if defined(CONFIG_IA64_PAGE_SIZE_4KB)
# define PAGE_SHIFT 12
#elif defined(CONFIG_IA64_PAGE_SIZE_8KB)
# define PAGE_SHIFT 13
10 #elif defined(CONFIG_IA64_PAGE_SIZE_16KB)
# define PAGE_SHIFT 14
#elif defined(CONFIG_IA64_PAGE_SIZE_64KB)
# define PAGE_SHIFT 16
#else
15 # error Unsupported page size!
#endif
Above we can see that there are a number of different options for
page sizes within the kernel, ranging from 4KB through 64KB.
192
Computer Science from the Bottom Up
3.4 Optimisation
Once the compiler has an internal representation of the code, the
really interesting part of the compiler starts. The compiler wants to
find the most optimised assembly language output for the given input
code. This is a large and varied problem and requires knowledge of
everything from efficient algorithms based in computer science to
deep knowledge about the particular processor the code is to be run
on.
There are some common optimisations the compiler can look at when
generating output. There are many, many more strategies for
generating the best code, and it is always an active research area.
Whilst this increases the size of the code, it may allow the processor
to work through the instructions more efficiently as branches can
cause inefficiencies in the pipeline of instructions coming into the
processor.
193
Computer Science from the Bottom Up
Thus the compiler can make a prediction about what way the test is
likely to go. There are some simple rules the compiler can use to
guess things like this, for example if (val == -1) is probably not
likely to be true, since -1 usually indicates an error code and
hopefully that will not be triggered too often.
Some compilers can actually compile the program, have the user run
it and take note of which way the branches go under real conditions.
It can then re-compile it based on what it has seen.
4 Assembler
The assembly code outputted by the compiler is still in a human
readable form, should you know the specifics of the assembly code for
the processor. Developers will often take a peek at the assembly
output to manually check that the code is the most optimised or to
discover any bugs in the compiler (this is more common than one
might think, especially when the compiler is being very aggressive
with optimisations).
This code is called object code and, at this stage, is not executable.
Object code is simply a binary representation of specific input source
code file. Good programming practice dictates that a programmer
should not "put all the eggs in one basket" by placing all your source
code in one file.
5 Linker
Often in a large program, you will separate out code into multiple
files to keep related functions together. Each of these files can be
compiled into object code: but your final goal is to create a single
executable! There needs to be some way combining each of these
object files into a single executable. We call this linking.
194
Computer Science from the Bottom Up
Note that even if your program does fit in one file it still needs to be
linked against certain system libraries to operate correctly. For
example, the printf call is kept in a library which must be combined
with your executable to work. So although you do not explicitly have
to worry about linking in this case, there is most certainly still a
linking process happening to create your executable.
5.1 Symbols
5.1.1 Symbols
Variables and functions all have names in source code which we refer
to them by. One way of thinking of a statement declaring a variable
int a is that you are telling the compiler "set aside some memory of
sizeof(int) and from now on when I use a it will refer to this
allocated memory. Similarly a function says "store this code in
memory, and when I call function() jump to and execute this code".
Imagine you have split up your program in two files, but some
functions need to share a variable. You only want one definition (i.e.
memory location) of the shared variable (otherwise it wouldn't be
shared!), but both files need to reference it.
To enable this, we declare the variable in one file, and then in the
other file declare a variable of the same name but with the prefix
extern . extern stands for external and to a human means that this
variable is declared somewhere else.
195
Computer Science from the Bottom Up
6 A practical example
We can walk through the steps taken to build a simple application
step by step.
Note that when you type gcc that actually runs a driver program that
hides most of the steps from you. Under normal circumstances this is
exactly what you want, because the exact commands and options to
get a real life working executable on a real system can be quite
complicated and architecture specific.
196
Computer Science from the Bottom Up
197
Computer Science from the Bottom Up
1 #include <stdio.h>
int main(void)
{
/* function() should return the value of global */
15 int ret = function("Hello, World!");
exit(ret);
}
198
Computer Science from the Bottom Up
1 #include <stdio.h>
6.1 Compiling
All compilers have an option to only execute the first step of
compilation. Usually this is something like -S and the output will
generally be put into a file with the same name as the input file but
with a .s extension.
Thus we can show the first step with gcc -S as illustrated in the
example below.
199
Computer Science from the Bottom Up
1 $ gcc -S hello.c
$ gcc -S function.c
$ cat function.s
.file "function.c"
5 .pred.safe_across_calls p1-p5,p16-p63
.section .sdata,"aw",@progbits
.align 4
.type i#, @object
.size i#, 4
10 i:
data4 100
.section .rodata
.align 8
.LC0:
15 stringz "%s\n"
.text
.align 16
.global function#
.proc function#
20 function:
.prologue 14, 33
.save ar.pfs, r34
alloc r34 = ar.pfs, 1, 4, 2, 0
.vframe r35
25 mov r35 = r12
adds r12 = -16, r12
mov r36 = r1
.save rp, r33
mov r33 = b0
30 .body
200
Computer Science from the Bottom Up
;;
st8 [r35] = r32
addl r14 = @ltoffx(.LC0), r1
;;
35 ld8.mov r37 = [r14], .LC0
ld8 r38 = [r35]
br.call.sptk.many b0 = printf#
mov r1 = r36
;;
40 addl r15 = @ltoffx(global#), r1
;;
ld8.mov r14 = [r15], global#
;;
ld4 r14 = [r14]
45 ;;
mov r8 = r14
mov ar.pfs = r34
mov b0 = r33
.restore sp
50 mov r12 = r35
br.ret.sptk.many b0
;;
.endp function#
.ident "GCC: (GNU) 3.3.5 (Debian 1:3.3.5-11)"
201
Computer Science from the Bottom Up
6.2 Assembly
Assembly is a fairly straight forward process. The assembler is
usually called as and takes arguments in a similar fashion to gcc
$ as -o function.o function.s
$ as -o hello.o hello.s
$ ls
function.c function.o function.s hello.c hello.o hello.s
202
Computer Science from the Bottom Up
203
Computer Science from the Bottom Up
• Have a look at the symbols in the function.c file and how they
fit into the output.
6.3 Linking
Actually invoking the linker, called ld , is a very complicated process
on a real system (are you sick of hearing this yet?). This is why we
leave the linking process up to gcc .
But of course we can spy on what gcc is doing under the hood with
the -v (verbose) flag.
204
Computer Science from the Bottom Up
1 /usr/lib/gcc-lib/ia64-linux/3.3.5/collect2 -static
/usr/lib/gcc-lib/ia64-linux/3.3.5/../../../crt1.o
/usr/lib/gcc-lib/ia64-linux/3.3.5/../../../crti.o
/usr/lib/gcc-lib/ia64-linux/3.3.5/crtbegin.o
5 -L/usr/lib/gcc-lib/ia64-linux/3.3.5
-L/usr/lib/gcc-lib/ia64-linux/3.3.5/../../..
hello.o
function.o
--start-group
10 -lgcc
-lgcc_eh
-lunwind
-lc
--end-group
15 /usr/lib/gcc-lib/ia64-linux/3.3.5/crtend.o
/usr/lib/gcc-lib/ia64-linux/3.3.5/../../../crtn.o
The first thing you notice is that a program called collect2 is being
called. This is a simple wrapper around ld that is used internally by
gcc.
The next thing you notice is object files starting with crt being
specified to the linker. These functions are provided by gcc and the
system libraries and contain code required to start the program. In
actuality, the main() function is not the first one called when a
program runs, but a function called _start which is in the crt object
files. This function does some generic setup which application
programmers do not need to worry about.
205
Computer Science from the Bottom Up
crtbegin.o
crtsaveres.o
crtend.o
crtn.o
We discuss how these are used to start the program a little later.
Next you can see that we link in our two object files, hello.o and
function.o . After that we specify some extra libraries with -l flags.
These libraries are system specific and required for every program.
The major one is -lc which brings in the C library, which has all
common functions like printf() .
After that we again link in some more system object files which do
some cleanup after programs exit.
206
Computer Science from the Bottom Up
207
Computer Science from the Bottom Up
208
Computer Science from the Bottom Up
209
Computer Science from the Bottom Up
210
Computer Science from the Bottom Up
• See there are two symbol tables; the dynsym and symtab ones.
We explain how the dynsym symbols work soon, but notice that
211
Computer Science from the Bottom Up
• Note the many symbols that have been included from the extra
object files. Many of them start with __ to avoid clashing with
any names the programmer might choose. Read through and
pick out the symbols we mentioned before from the object files
and see if they have changed in any way.
212
Computer Science from the Bottom Up
2.3.1 a.out
ELF was not always the standard; original UNIX systems used a file
format called a.out . We can see the vestiges of this if you compile a
program without the -o option to specify an output file name; the
executable will be created with a default name of a.out 1.
a.out is a very simple header format that only allows a single data,
code and BSS section. As you will come to see, this is insufficient for
modern systems with dynamic libraries.
1. In fact, a.out is the default output filename from the linker. The
compiler generally uses randomly generated file names as intermediate
files for assembly and object code.
213
Computer Science from the Bottom Up
2.3.2 COFF
The Common Object File Format, or COFF, was the precursor to ELF.
Its header format was more flexible, allowing more (but limited)
sections in the file.
3 ELF
ELF is an extremely flexible format for representing binary code in a
system. By following the ELF standard you can represent a kernel
binary just as easily as a normal executable or a system library. The
same tools can be used to inspect and operate on all ELF files and
developers who understand the ELF file format can translate their
skills to most modern UNIX systems.
214
Computer Science from the Bottom Up
Header
Header
Data
Header
Data
Header
Data
Header
Data
215
Computer Science from the Bottom Up
1 typedef struct {
unsigned char e_ident[EI_NIDENT];
Elf32_Half e_type;
Elf32_Half e_machine;
5 Elf32_Word e_version;
Elf32_Addr e_entry;
Elf32_Off e_phoff;
Elf32_Off e_shoff;
Elf32_Word e_flags;
10 Elf32_Half e_ehsize;
Elf32_Half e_phentsize;
Elf32_Half e_phnum;
Elf32_Half e_shentsize;
Elf32_Half e_shnum;
15 Elf32_Half e_shstrndx;
} Elf32_Ehdr;
216
Computer Science from the Bottom Up
ELF Header:
Magic: 7f 45 4c 46 01 02 01 00 00 00 00 00 00 00 00 00
5 Class: ELF32
Data: 2's complement, big endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
10 Type: EXEC (Executable file)
Machine: PowerPC
Version: 0x1
Entry point address: 0x10002640
Start of program headers: 52 (bytes into file)
15 Start of section headers: 87460 (bytes into file)
Flags: 0x0
Size of this header: 52 (bytes)
Size of program headers: 32 (bytes)
Number of program headers: 8
20 Size of section headers: 40 (bytes)
Number of section headers: 29
Section header string table index: 28
[...]
The e_ident array is the first thing at the start of any ELF file, and
always starts with a few "magic" bytes. The first byte is 0x7F and
then the next three bytes are "ELF". You can inspect an ELF binary to
217
Computer Science from the Bottom Up
see this for yourself with something like the hexdump command.
Note the 0x7F to start, then the ASCII encoded "ELF" string. Have a
look at the standard and see what the rest of the array defines and
what the values are in a binary.
Next we have some flags for the type of machine this binary is
created for. The first thing we can see is that ELF defines different
type sized versions, one for 32 bit and one for 64 bit versions; here
we inspect the 32 bit version. The difference is mostly that on 64 bit
machines addresses obviously required to be held in 64 bit variables.
We can see that the binary has been created for a big endian machine
that uses 2's complement to represent negative numbers. Skipping
down a bit we can see the Machine tells us this is a PowerPC binary.
218
Computer Science from the Bottom Up
1 $ cat test.c
#include <stdio.h>
int main(void)
5 {
printf("main is : %p\n", &main);
return 0;
}
$ ./test
main is : 0x10000430
In Example 3.1.4, Investigating the entry point we can see that the
entry point is actually a function called _start . Our program didn't
define this at all, and the leading underscore suggests that it is in a
separate namespace. We examine how a program starts up in detail in
Section 8.2, Starting the program.
After that the header contains pointers to where in the file other
important parts of the ELF file start, like a table of contents.
219
Computer Science from the Bottom Up
3.3.1 Segments
As we have done before, it is sometimes easier to look at the higher
level of abstraction (segments) before inspecting the lower layers.
As we mentioned the ELF file has an header that describes the overall
layout of the file. The ELF header actually points to another group of
headers called the program headers. These headers describe to the
operating system anything that might be required for it to load the
binary into memory and execute it. Segments are described by
program headers, but so are some other things required to get the
executable running.
220
Computer Science from the Bottom Up
1 typedef struct {
Elf32_Word p_type;
Elf32_Off p_offset;
Elf32_Addr p_vaddr;
5 Elf32_Addr p_paddr;
Elf32_Word p_filesz;
Elf32_Word p_memsz;
Elf32_Word p_flags;
Elf32_Word p_align;
10 }
Program headers more than just segments. The p_type field defines
just what the program header is defining. For example, if this field is
PT_INTERP the header is defined as meaning a string pointer to an
interpreter for the binary file. We discussed compiled versus
interpreted languages previously and made the distinction that a
compiler builds a binary which can be run in a stand alone fashion.
Why should it need an interpreter? As always, the true picture is a
little more complicated. There are several reasons why a modern
system wants flexibility when loading executable files, and to do this
some information can only be adequately acquired at the actual time
the program is set up to run. We see this in future chapters where we
look into dynamic linking. Consequently some minor changes might
need to be made to the binary to allow it to work properly at runtime.
Thus the usual interpreter of a binary file is the dynamic loader, so
called because it takes the final steps to complete loading of the
executable and prepare the binary image for running.
221
Computer Science from the Bottom Up
There are a few other segment types defined in the program headers,
they are described more fully in the standards specification.
3.3.2 Sections
As we have mentioned, sections make up segments. Sections are a
way to organise the binary into logical areas to communicate
information between the compiler and the linker. In some special
binaries, such as the Linux kernel, sections are used in more specific
ways (see Section 6.2, Custom sections).
222
Computer Science from the Bottom Up
1 typedef struct {
Elf32_Word sh_name;
Elf32_Word sh_type;
Elf32_Word sh_flags;
5 Elf32_Addr sh_addr;
Elf32_Off sh_offset;
Elf32_Word sh_size;
Elf32_Word sh_link;
Elf32_Word sh_info;
10 Elf32_Word sh_addralign;
Elf32_Word sh_entsize;
}
Sections have a few more types defined for the sh_type field; for
example a section of type SH_PROGBITS is defined as a section that
hold binary data for use by the program. Other flags say if this section
is a symbol table (used by the linker or debugger for example) or
maybe something for the dynamic loader. There are also more
attributes, such as the allocate attribute which flags that this section
will need memory allocated for it.
223
Computer Science from the Bottom Up
1 #include <stdio.h>
int big_big_array[10*1024*1024];
int main(void)
10 {
big_big_array[0] = 100;
printf("%s\n", a_string);
a_var_with_value += 20;
}
224
Computer Science from the Bottom Up
Section Headers:
[Nr] Name Type Addr Off Size
10 [ 0] NULL 00000000 000000 000000
[ 1] .interp PROGBITS 10000114 000114 00000d
[ 2] .note.ABI-tag NOTE 10000124 000124 000020
[ 3] .hash HASH 10000144 000144 00002c
[ 4] .dynsym DYNSYM 10000170 000170 000060
15 [ 5] .dynstr STRTAB 100001d0 0001d0 00005e
[ 6] .gnu.version VERSYM 1000022e 00022e 00000c
[ 7] .gnu.version_r VERNEED 1000023c 00023c 000020
[ 8] .rela.dyn RELA 1000025c 00025c 00000c
[ 9] .rela.plt RELA 10000268 000268 000018
20 [10] .init PROGBITS 10000280 000280 000028
[11] .text PROGBITS 100002b0 0002b0 000560
[12] .fini PROGBITS 10000810 000810 000020
[13] .rodata PROGBITS 10000830 000830 000024
[14] .sdata2 PROGBITS 10000854 000854 000000
25 [15] .eh_frame PROGBITS 10000854 000854 000004
[16] .ctors PROGBITS 10010858 000858 000008
[17] .dtors PROGBITS 10010860 000860 000008
[18] .jcr PROGBITS 10010868 000868 000004
[19] .got2 PROGBITS 1001086c 00086c 000010
30 [20] .dynamic DYNAMIC 1001087c 00087c 0000c8
225
Computer Science from the Bottom Up
226
Computer Science from the Bottom Up
...
130: 10010964 4 OBJECT GLOBAL DEFAULT 23 a_var_with_v
...
144: 10000430 96 FUNC GLOBAL DEFAULT 11 main
Thus the .bss section is defined for global variables whose value
should be zero when the program starts. We have seen how the
memory size can be different to the on disk size in our discussion of
segments; variables being in the .bss section are an indication that
they will be given zero value on program start.
The a_string variable lives in the .sdata section, which stands for
small data. Small data (and the corresponding .sbss section) are
sections available on some architectures where data can be reached
by an offset from some known pointer. This means a fixed-value can
be added to the base-address, making it faster to get to data in the
sections as there are no extra lookups and loading of addresses into
memory required. Most architectures are limited to the size of
immediate values you can add to a register (e.g. if performing the
instruction r1 = add r2, 70; , 70 is an immediate value, as opposed to
say, adding two values stored in registers r1 = add r2,r3 ) and can
thus only offset a certain "small" distance from an address. We can
also see that our a_var_with_value lives in the same place.
227
Computer Science from the Bottom Up
Program Headers:
Type Offset VirtAddr PhysAddr FileSiz MemSiz
PHDR 0x000034 0x10000034 0x10000034 0x00100 0x00100
10 INTERP 0x000154 0x10000154 0x10000154 0x0000d 0x0000d
[Requesting program interpreter: /lib/ld.so.1]
LOAD 0x000000 0x10000000 0x10000000 0x14d5c 0x14d5c
LOAD 0x014d60 0x10024d60 0x10024d60 0x002b0 0x00b7c
DYNAMIC 0x014f00 0x10024f00 0x10024f00 0x000d8 0x000d8
15 NOTE 0x000164 0x10000164 0x10000164 0x00020 0x00020
GNU_EH_FRAME 0x014d30 0x10014d30 0x10014d30 0x0002c 0x0002c
GNU_STACK 0x000000 0x00000000 0x00000000 0x00000 0x00000
228
Computer Science from the Bottom Up
Skipping to the bottom of the output, we can see what sections have
been moved into what segments. So, for example the .interp section
is placed into an INTERP flagged segment. Notice that readelf tells us
it is requesting the interpreter /lib/ld.so.1 ; this is the dynamic
linker which is run to prepare the binary for execution.
The other interesting thing to note is that the file size is the same as
the memory size for the code segment, however memory size is
greater than the file size for the data segment. This comes from the
BSS section which holds zeroed global variables.
4 ELF Executables
Executables are of course one of the primary uses of the ELF format.
Contained within the binary is everything required for the operating
system to execute the code as intended.
1. For those that are curious, the PowerPC ABI calls stubs for functions in
dynamic libraries directly in the GOT, rather than having them bounce
through a separate PLT entry. Thus the processor needs execute
permissions for the GOT section, which you can see is embedded in the
data segment. This should make sense after reading the dynamic linking
chapter!
229
Computer Science from the Bottom Up
segments are required to be placed at. We can further see that one
segment is for code — it has read and execute permissions only —
and one is for data, unsurprisingly with read and write permissions,
but importantly no execute permissions (without execute permissions,
even if a bug allowed an attacker to introduce arbitrary data the
pages backing it would not be marked with execute permissions, and
most processors will hence disallow any execution of code in those
pages).
230
Computer Science from the Bottom Up
Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags
10 PHDR 0x0000000000000040 0x0000000000400040 0x000000
0x00000000000001c0 0x00000000000001c0 R E
INTERP 0x0000000000000200 0x0000000000400200 0x000000
0x000000000000001c 0x000000000000001c R
[Requesting program interpreter: /lib64/ld-linux-x86-64.s
15 LOAD 0x0000000000000000 0x0000000000400000 0x000000
0x0000000000019ef4 0x0000000000019ef4 R E
LOAD 0x000000000001a000 0x000000000061a000 0x000000
0x000000000000077c 0x0000000000001500 RW
DYNAMIC 0x000000000001a028 0x000000000061a028 0x000000
20 0x00000000000001d0 0x00000000000001d0 RW
NOTE 0x000000000000021c 0x000000000040021c 0x000000
0x0000000000000044 0x0000000000000044 R
GNU_EH_FRAME 0x0000000000017768 0x0000000000417768 0x000000
0x00000000000006fc 0x00000000000006fc R
25 GNU_STACK 0x0000000000000000 0x0000000000000000 0x000000
0x0000000000000000 0x0000000000000000 RW
231
Computer Science from the Bottom Up
01 .interp
02 .interp .note.ABI-tag .note.gnu.build-id .hash .gnu.h
03 .ctors .dtors .jcr .dynamic .got .got.plt .data .bss
04 .dynamic
35 05 .note.ABI-tag .note.gnu.build-id
06 .eh_frame_hdr
07
5 Libraries
Developers soon tired of having to write everything from scratch, so
one of the first inventions of computer science was libraries.
232
Computer Science from the Bottom Up
the object files from the library linked directly into your final
executable, just as with those you have compiled yourself. When
linked like this the library is called a static library, because the library
will remain unchanged unless the program is recompiled.
This is the most straight forward way of using a library as the final
result is a simple executable with no dependencies.
Below we show the creation of basic static library and introduce some
common tools for working with libraries.
233
Computer Science from the Bottom Up
1 $ cat library.c
/* Library Function */
int function(int input)
{
5 return input + 10;
}
$ cat library.h
/* Function Definition */
10 int function(int);
$ cat program.c
#include <stdio.h>
/* Library header file */
15 #include "library.h"
int main(void)
{
int d = function(100);
20
printf("%d\n", d);
}
$ gcc -c library.c
25 $ ar rc libtest.a library.o
$ ranlib ./libtest.a
$ nm --print-armap ./libtest.a
Archive index:
30 function in library.o
234
Computer Science from the Bottom Up
library.o:
00000000 T function
$ ./program
110
Notice that we define the library API in the header file. The API
consists of function definitions for the functions in the library; this is
so that the compiler knows what types the functions take when
building object files that reference the library (e.g. program.c , which
#include s the header file).
235
Computer Science from the Bottom Up
You then specify the library to the compiler with -lname where name
is the filename of the library without the prefix lib . We also provide
an extra search directory for libraries, namely the current directory
( -L . ), since by default the current directory is not searched for
libraries.
The final result is a single executable with our new library included.
236
Computer Science from the Bottom Up
237
Computer Science from the Bottom Up
1 $ cat coredump.c
int main(void) {
char *foo = (char*)0x12345;
*foo = 'a';
5
return 0;
}
$ file ./core
15 ./core: ELF 32-bit LSB core file Intel 80386, version 1 (SYSV),
$ gdb ./coredump
...
(gdb) core core
20 [New LWP 31614]
Core was generated by `./coredump'.
Program terminated with signal 11, Segmentation fault.
#0 0x080483c4 in main () at coredump.c:3
3 *foo = 'a';
25 (gdb)
238
Computer Science from the Bottom Up
239
Computer Science from the Bottom Up
Symbols take up much less space, but are also targets for removal
from final output. Once the individual object files of an executable are
linked into the single final image there is generally no need for most
symbols to remain. As discussed in Section 3.2, Symbols and
Relocations symbols are required to fix up relocation entries, but
once this is done the symbols are not strictly necessary for running
the final program. On Linux the GNU toolchain strip program
provides options to remove symbols. Note that some symbols are
required to be resolved at run-time (for dynamic linking, the focus of
Chapter 9, Dynamic Linking) but these are put in separate dynamic
symbol tables so they will not be removed and render the final output
useless.
240
Computer Science from the Bottom Up
Program Headers:
Type Offset VirtAddr PhysAddr FileSiz MemSiz
NOTE 0x000214 0x00000000 0x00000000 0x0022c 0x00000
30 LOAD 0x001000 0x08048000 0x00000000 0x01000 0x01000
241
Computer Science from the Bottom Up
60 $ eu-readelf -n ./core
242
Computer Science from the Bottom Up
243
Computer Science from the Bottom Up
ENTRY: 0x8048300
UID: 1000
EUID: 1000
GID: 1000
95 EGID: 1000
SECURE: 0
RANDOM: 0xbfecba1b
EXECFN: 0xbfecdff1
PLATFORM: 0xbfecba2b
100 NULL
LINUX 48 386_TLS
index: 6, base: 0xb771c8d0, limit: 0x000fffff, flags: 0x000
index: 7, base: 0x00000000, limit: 0x00000000, flags: 0x000
index: 8, base: 0x00000000, limit: 0x00000000, flags: 0x000
The other component of the core dump is the NOTE sections which
contain data necessary for debugging but not necessarily captured in
straight snapshot of the memory allocations. The eu-readelf program
used in the second part of the figure provides a more complete view
of the data by decoding it.
244
Computer Science from the Bottom Up
The kernel creates the core dump file within the bounds of the
current ulimit settings — since a program using a lot of memory
could result in a very large dump, potentially filling up disk and
making problems even worse, generally the ulimit is set low or even
at zero, since most non-developers have little use for a core dump file.
However the core dump remains the single most useful way to debug
an unexpected situation in a postmortem fashion.
The modinfo tool can inspect this information within a module and
present it to the user. Below we use the example of the FUSE Linux
kernel module, which allows user-space libraries to provide file-
system implementations to the kernel.
245
Computer Science from the Bottom Up
1 $ cd /lib/modules/$(uname -r)
246
Computer Science from the Bottom Up
247
Computer Science from the Bottom Up
1 /*
* Start at the bottom, and work your way up!
*/
#define __stringify_1(x...) #x
#define __stringify(x...) __stringify_1(x)
/*
* Author(s), use "Name <email>" or just "Name", for multiple
30 * authors use multiple MODULE_AUTHOR() statements/lines.
248
Computer Science from the Bottom Up
*/
#define MODULE_AUTHOR(_author) MODULE_INFO(author, _author)
/* ---- */
35
MODULE_AUTHOR("Your Name <[email protected]>");
We can inspect the symbols placed in the final module to see the end
result:
249
Computer Science from the Bottom Up
250
Computer Science from the Bottom Up
251
Computer Science from the Bottom Up
...
You can roughly see how the linker script specifies things like starting
locations and what sections to group into various segments. In the
same way -Wl is used to pass the --verbose to the linker via gcc,
customised linker scripts can be provided by flags. Regular user-
space developers are unlikely to need to override the default linker
script. However, often very customised applications such as kernel
builds require customised linker scripts.
7 ABIs
An ABI is a term you will hear a lot about when working with systems
programming. We have talked extensively about API, which are
interfaces the programmer sees to your code.
252
Computer Science from the Bottom Up
This may seem like a strange way to do things, but it has very useful
practical implications as you will see in the next chapter about global
offset tables. On IA64 an add instruction can only take a maximum 22
bit immediate value1. An immediate value is one that is specified
directly, rather than in a register (e.g. in add r1 + 100 100 is the
immediate value).
8 Starting a process
We mentioned before that simply saying the program starts with the
main() function is not quite true. Below we examine what happens to
a typical dynamically linked program when it is loaded and run
(statically linked programs are similar but different XXX should we go
into this?).
In this case, the kernel also reads in the dynamic linker code, and
starts the program from the entry point address as specified by it. We
examine the role of the dynamic linker in depth in the next chapter,
but suffice to say it does some setup like loading any libraries
required by the application (as specified in the dynamic section of the
253
Computer Science from the Bottom Up
binary) and then starts execution of the program binary at its entry
point address (i.e. the _init function).
When the kernel starts the dynamic linker it adds an entry to the auxv
called AT_SYSINFO_EHDR , which is the address in memory that the
254
Computer Science from the Bottom Up
special kernel library lives in. When the dynamic linker starts it can
look for the AT_SYSINFO_EHDR pointer, and if found load that library for
the program. The program has no idea this library exists; this is a
private arrangement between the dynamic linker and the kernel.
255
Computer Science from the Bottom Up
1 $ cat test.c
int main(void)
{
5 return 0;
}
15 [...]
080482b0 <_start>:
80482b0: 31 ed xor %ebp,%ebp
80482b2: 5e pop %esi
20 80482b3: 89 e1 mov %esp,%ecx
80482b5: 83 e4 f0 and $0xfffffff0,%esp
80482b8: 50 push %eax
80482b9: 54 push %esp
80482ba: 52 push %edx
25 80482bb: 68 00 84 04 08 push $0x8048400
80482c0: 68 90 83 04 08 push $0x8048390
80482c5: 51 push %ecx
80482c6: 56 push %esi
80482c7: 68 68 83 04 08 push $0x8048368
30 80482cc: e8 b3 ff ff ff call 8048284 <__libc_
256
Computer Science from the Bottom Up
80482d1: f4 hlt
80482d2: 90 nop
80482d3: 90 nop
35 08048368 <main>:
8048368: 55 push %ebp
8048369: 89 e5 mov %esp,%ebp
804836b: 83 ec 08 sub $0x8,%esp
804836e: 83 e4 f0 and $0xfffffff0,%esp
40 8048371: b8 00 00 00 00 mov $0x0,%eax
8048376: 83 c0 0f add $0xf,%eax
8048379: 83 c0 0f add $0xf,%eax
804837c: c1 e8 04 shr $0x4,%eax
804837f: c1 e0 04 shl $0x4,%eax
45 8048382: 29 c4 sub %eax,%esp
8048384: b8 00 00 00 00 mov $0x0,%eax
8048389: c9 leave
804838a: c3 ret
804838b: 90 nop
50 804838c: 90 nop
804838d: 90 nop
804838e: 90 nop
804838f: 90 nop
55 08048390 <__libc_csu_init>:
8048390: 55 push %ebp
8048391: 89 e5 mov %esp,%ebp
[...]
60 08048400 <__libc_csu_fini>:
257
Computer Science from the Bottom Up
init and fini are two special concepts that call parts of code in
shared libraries that may need to be called before the library starts or
if the library is unloaded respectively. You can see how this might be
useful for library programmers to setup variables when the library is
started, or to clean up at the end. Originally the functions _init and
_fini were looked for in the library; however this became somewhat
limiting as everything was required to be in these functions. Below
we will examine just how the init / fini process works.
258
Computer Science from the Bottom Up
1 $ cat test.c
#include <stdio.h>
int main(void)
{
return 0;
15 }
$ ./test
20 init
fini
259
Computer Science from the Bottom Up
[...]
08048280 <_init>:
8048280: 55 push %ebp
8048281: 89 e5 mov %esp,%ebp
35 8048283: 83 ec 08 sub $0x8,%esp
8048286: e8 79 00 00 00 call 8048304 <call_gm
804828b: e8 e0 00 00 00 call 8048370 <frame_d
8048290: e8 2b 02 00 00 call 80484c0 <__do_gl
8048295: c9 leave
40 8048296: c3 ret
[...]
080484c0 <__do_global_ctors_aux>:
80484c0: 55 push %ebp
45 80484c1: 89 e5 mov %esp,%ebp
80484c3: 53 push %ebx
80484c4: 52 push %edx
80484c5: a1 2c 95 04 08 mov 0x804952c,%eax
80484ca: 83 f8 ff cmp $0xffffffff,%eax
50 80484cd: 74 1e je 80484ed <__do_gl
80484cf: bb 2c 95 04 08 mov $0x804952c,%ebx
80484d4: 8d b6 00 00 00 00 lea 0x0(%esi),%esi
80484da: 8d bf 00 00 00 00 lea 0x0(%edi),%edi
80484e0: ff d0 call *%eax
55 80484e2: 8b 43 fc mov 0xfffffffc(%ebx)
80484e5: 83 eb 04 sub $0x4,%ebx
80484e8: 83 f8 ff cmp $0xffffffff,%eax
80484eb: 75 f3 jne 80484e0 <__do_gl
80484ed: 58 pop %eax
60 80484ee: 5b pop %ebx
260
Computer Science from the Bottom Up
261
Computer Science from the Bottom Up
The last value pushed onto the stack for the __libc_start_main was
the initialisation function __libc_csu_init . If we follow the call chain
through from __libc_csu_init we can see it does some setup and
262
Computer Science from the Bottom Up
then calls the _init function in the executable. The _init function
eventually calls a function called __do_global_ctors_aux . Looking at
the disassembley of this function we can see that it appears to start at
address 0x804952c and loop along, reading an value and calling it. We
can see that this starting address is in the .ctors section of the file;
if we have a look inside this we see that it contains the first value -1 ,
a function address (in big endian format) and the value zero.
A similar process is enacted with the .dtors for destructors when the
program exits. __libc_start_main calls these when the main()
function completes.
As you can see, a lot is done before the program gets to start, and
even a little after you think it is finished!
With virtual memory this can be easily done. The physical pages of
memory the library code is loaded into can be easily referenced by
any number of virtual pages in any number of address spaces. So
while you only have one physical copy of the library code in system
memory, every process can have access to that library code at any
virtual address it likes.
Thus people quickly came up with the idea of a shared library which,
as the name suggests, is shared by multiple executables. Each
263
Computer Science from the Bottom Up
The ELF header has two mutually exclusive flags, ET_EXEC and
ET_DYN to mark an ELF file as either an executable or a shared object
file.
1.2.1 Compilation
When you compile your program that uses a dynamic library, object
files are left with references to the library functions just as for any
other external reference.
You need to include the header for the library so that the compiler
knows the specific types of the functions you are calling. Note the
compiler only needs to know the types associated with a function
(such as, it takes an int and returns a char * ) so that it can
correctly allocate space for the function call.1
1. This has not always been the case with the C standard. Previously,
compilers would assume that any function it did not know about
returned an int . On a 32 bit system, the size of a pointer is the same
size as an int , so there was no problem. However, with a 64 bit system,
the size of a pointer is generally twice the size of an int so if the
function actually returns a pointer, its value will be destroyed. This is
clearly not acceptable, as the pointer will thus not point to valid
264
Computer Science from the Bottom Up
1.2.2 Linking
Even though the dynamic linker does a lot of the work for shared
libraries, the traditional linker still has a role to play in creating the
executable.
Again, we can inspect these fields with the readelf program. Below
we have a look at a very standard binary, /bin/ls
You can see that it specifies three libraries. The most common library
shared by most, if not all, programs on the system is libc . There are
also some other libraries that the program needs to run correctly.
Reading the ELF file directly is sometimes useful, but the usual way
to inspect a dynamically linked executable is via ldd . ldd "walks"
the dependencies of libraries for you; that is if a library depends on
another library, it will show it to you.
memory. The C99 standard has changed such that you are required to
specify the types of included functions.
265
Computer Science from the Bottom Up
1 $ ldd /bin/ls
librt.so.1 => /lib/tls/librt.so.1 (0x2000000000058000)
libacl.so.1 => /lib/libacl.so.1 (0x2000000000078000)
libc.so.6.1 => /lib/tls/libc.so.6.1 (0x2000000000098000
5 libpthread.so.0 => /lib/tls/libpthread.so.0 (0x20000000
/lib/ld-linux-ia64.so.2 => /lib/ld-linux-ia64.so.2 (0x2
libattr.so.1 => /lib/libattr.so.1 (0x2000000000310000)
$ readelf --dynamic /lib/librt.so.1
266
Computer Science from the Bottom Up
Program Headers:
Type Offset VirtAddr PhysAddr
5 FileSiz MemSiz Flags
PHDR 0x0000000000000040 0x4000000000000040 0x400000
0x0000000000000188 0x0000000000000188 R E
INTERP 0x00000000000001c8 0x40000000000001c8 0x400000
0x0000000000000018 0x0000000000000018 R
10 [Requesting program interpreter: /lib/ld-linux-ia64.so.2]
LOAD 0x0000000000000000 0x4000000000000000 0x400000
0x0000000000022e40 0x0000000000022e40 R E
LOAD 0x0000000000022e40 0x6000000000002e40 0x600000
0x0000000000001138 0x00000000000017b8 RW
15 DYNAMIC 0x0000000000022f78 0x6000000000002f78 0x600000
0x0000000000000200 0x0000000000000200 RW
NOTE 0x00000000000001e0 0x40000000000001e0 0x400000
0x0000000000000020 0x0000000000000020 R
IA_64_UNWIND 0x0000000000022018 0x4000000000022018 0x400000
20 0x0000000000000e28 0x0000000000000e28 R
267
Computer Science from the Bottom Up
2.1 Relocations
The essential part of the dynamic linker is fixing up addresses at
runtime, which is the only time you can know for certain where you
are loaded in memory. A relocation can simply be thought of as a note
that a particular address will need to be fixed at load time. Before the
code is ready to run you will need to go through and read all the
relocations and fix the addresses it refers to to point to the right
place.
There are many types of relocation for each architecture, and each
types exact behaviour is documented as part of the ABI for the
system. The definition of a relocation is quite straight forward.
1 typedef struct {
Elf32_Addr r_offset; <--- address to fix
Elf32_Word r_info; <--- symbol table pointer and reloca
}
5
typedef struct {
Elf32_Addr r_offset;
Elf32_Word r_info;
Elf32_Sword r_addend;
10 } Elf32_Rela
The r_offset field refers to the offset in the file that needs to be fixed
up. The r_info field specifies the type of relocation which describes
what exactly must be done to fix this code up. The simplest relocation
usually defined for an architecture is simply the value of the symbol.
In this case you simply substitute the address of the symbol at the
location specified, and the relocation has been "fixed-up".
268
Computer Science from the Bottom Up
The two types, one with an addend and one without specify different
ways for the relocation to operate. An addend is simply something
that should be added to the fixed up address to find the correct
address. For example, if the relocation is for the symbol i because
the original code is doing something like i[8] the addend will be set
to 8. This means "find the address of i , and go 8 past it".
The trade offs of each approach should be clear. With REL you need
to do an extra memory reference to find the addend before the fixup,
but you don't waste space in the binary because you use relocation
target memory. With RELA you keep the addend with the relocation,
but waste that space in the on disk binary. Most modern systems use
RELA relocations.
269
Computer Science from the Bottom Up
1 $ cat addendtest.c
extern int i[4];
int *j = i + 2;
5 $ cat addendtest2.c
int i[4];
3. LSB : Since IA64 can operate in big and little endian modes,
this relocation is little endian (least significant byte).
The ABI continues to say that that relocation means "the value of the
symbol pointed to by the relocation, plus any addend". We can see we
have an addend of 8, since sizeof(int) == 4 and we have moved two
int's into the array ( *j = i + 2 ). So at runtime, to fix this relocation
you need to find the address of symbol i and put its value, plus 8
into 0x104f8 .
270
Computer Science from the Bottom Up
Libraries have no such guarantee. They can know that their data
section will be a specified offset from the base address; but exactly
where that base address is can only be known at runtime.
The problem stems from the fact that libraries have no guarantee
about where they will be put into memory. The dynamic linker will
find the most convenient place in virtual memory for each library
required and place it there. Think about the alternative if this were
not to happen; every library in the system would require its own
chunk of virtual memory so that no two overlapped. Every time a new
library were added to the system it would require allocation.
Someone could potentially be a hog and write a huge library, not
leaving enough space for other libraries! And chances are, your
program doesn't ever want to use that library anyway.
Thus, if you modify the code of a shared library with a relocation, that
code no longer becomes sharable. We've lost the advantage of our
shared library.
271
Computer Science from the Bottom Up
memory address of that symbol and re-write the code to load that
address.
The area that is set aside for these addresses is called the Global
Offset Table, or GOT. The GOT lives in a section of the ELF file called
.got .
272
Computer Science from the Bottom Up
PROCESS 1
LIBRARY CODE
GOT/PLT
SHARED VARIABLE
PROCESS 2
GOT/PLT
PHYSICAL
MEMORY
LIBRARY CODE
VIRTUAL ADDRESSES
The GOT is private to each process, and the process must have write
permissions to it. Conversely the library code is shared and the
process should have only read and execute permissions on the code;
it would be a serious security breach if the process could modify
code.
273
Computer Science from the Bottom Up
274
Computer Science from the Bottom Up
1 $ cat got.c
extern int i;
void test(void)
5 {
i = 100;
}
0000000000000410 <test>:
410: 0d 10 00 18 00 21 [MFI] mov r2=r12
416: 00 00 00 02 00 c0 nop.f 0x0
20 41c: 81 09 00 90 addl r14=24,r1;;
420: 0d 78 00 1c 18 10 [MFI] ld8 r15=[r14]
426: 00 00 00 02 00 c0 nop.f 0x0
42c: 41 06 00 90 mov r14=100;;
430: 11 00 38 1e 90 11 [MIB] st4 [r15]=r14
25 436: c0 00 08 00 42 80 mov r12=r2
43c: 08 00 84 00 br.ret.sptk.many b0
275
Computer Science from the Bottom Up
Section Headers:
[Nr] Name Type Address Off
Size EntSize Flags Link Info Al
[ 0] NULL 0000000000000000 000
35 0000000000000000 0000000000000000 0 0
[ 1] .hash HASH 0000000000000120 000
00000000000000a0 0000000000000004 A 2 0
[ 2] .dynsym DYNSYM 00000000000001c0 000
00000000000001f8 0000000000000018 A 3 e
40 [ 3] .dynstr STRTAB 00000000000003b8 000
000000000000003f 0000000000000000 A 0 0
[ 4] .rela.dyn RELA 00000000000003f8 000
0000000000000018 0000000000000018 A 2 0
[ 5] .text PROGBITS 0000000000000410 000
45 0000000000000030 0000000000000000 AX 0 0
[ 6] .IA_64.unwind_inf PROGBITS 0000000000000440 000
0000000000000018 0000000000000000 A 0 0
[ 7] .IA_64.unwind IA_64_UNWIND 0000000000000458 000
0000000000000018 0000000000000000 AL 5 5
50 [ 8] .data PROGBITS 0000000000010470 000
0000000000000000 0000000000000000 WA 0 0
[ 9] .dynamic DYNAMIC 0000000000010470 000
0000000000000100 0000000000000010 WA 3 0
[10] .got PROGBITS 0000000000010570 000
55 0000000000000020 0000000000000000 WAp 0 0
[11] .sbss NOBITS 0000000000010590 000
0000000000000000 0000000000000000 W 0 0
[12] .bss NOBITS 0000000000010590 000
0000000000000000 0000000000000000 WA 0 0
60 [13] .comment PROGBITS 0000000000000000 000
276
Computer Science from the Bottom Up
0000000000000026 0000000000000000 0 0
[14] .shstrtab STRTAB 0000000000000000 000
000000000000008a 0000000000000000 0 0
[15] .symtab SYMTAB 0000000000000000 000
65 0000000000000258 0000000000000018 16 12
[16] .strtab STRTAB 0000000000000000 000
0000000000000045 0000000000000000 0 0
Key to Flags:
W (write), A (alloc), X (execute), M (merge), S (strings)
70 I (info), L (link order), G (group), x (unknown)
O (extra OS processing required) o (OS specific), p (processo
The disassembly reveals just how we do this with the .got . On IA64
(the architecture which the library was compiled for) the register r1
is known as the global pointer and always points to where the .got
section is loaded into memory.
If we have a look at the readelf output we can see that the .got
section starts 0x10570 bytes past where library was loaded into
memory. Thus if the library were to be loaded into memory at address
0x6000000000000000 the .got would be at 0x6000000000010570,
and register r1 would always point to this address.
277
Computer Science from the Bottom Up
We can also check out the relocation for this entry too. The relocation
says "replace the value at offset 10588 with the memory location that
symbol i is stored at".
We know that the .got starts at offset 0x10570 from the previous
output. We have also seen how the code loads an address 0x18 (24 in
decimal) past this, giving us an address of 0x10570 + 0x18 =
0x10588 ... the address which the relocation is for!
So before the program begins, the dynamic linker will have fixed up
the relocation to ensure that the value of the memory at offset
0x10588 is the address of the global variable i !
4 Libraries
4.1 The Procedure Lookup Table
Libraries may contain many functions, and a program may end up
including many libraries to get its work done. A program may only
use one or two functions from each library of the many available, and
depending on the run-time path through the code may use some
functions and not others.
278
Computer Science from the Bottom Up
Lazy binding defers this expense until the actual function is called by
using a PLT.
Each library function has an entry in the PLT, which initially points to
some special dummy code. When the program calls the function, it
actually calls the PLT entry (in the same was as variables are
referenced through the GOT).
Thus, the next time the function is called the address can be loaded
without having to go back into the dynamic loader again. If a function
is never called, then the PLT entry will never be modified but there
will be no runtime overhead.
Let us consider the simple "hello World" application. This will only
make one library call to printf to output the string to the user.
279
Computer Science from the Bottom Up
1
$ cat hello.c
#include <stdio.h>
5 int main(void)
{
printf("Hello, World!\n");
return 0;
}
10
$ gcc -o hello hello.c
280
Computer Science from the Bottom Up
1
4000000000000790 <main>:
4000000000000790: 00 08 15 08 80 05 [MII] all
4000000000000796: 20 02 30 00 42 60 mov
5 400000000000079c: 04 08 00 84 mov
40000000000007a0: 01 00 00 00 01 00 [MII] nop
40000000000007a6: 00 02 00 62 00 c0 mov
40000000000007ac: 81 0c 00 90 add
40000000000007b0: 1c 20 01 1c 18 10 [MFB] ld8
10 40000000000007b6: 00 00 00 02 00 00 nop
40000000000007bc: 78 fd ff 58 br.
40000000000007c0: 02 08 00 46 00 21 [MII] mov
40000000000007c6: e0 00 00 00 42 00 mov
40000000000007cc: 01 70 00 84 mov
15 40000000000007d0: 00 00 00 00 01 00 [MII] nop
40000000000007d6: 00 08 01 55 00 00 mov
40000000000007dc: 00 0a 00 07 mov
40000000000007e0: 1d 60 00 44 00 21 [MFB] mov
40000000000007e6: 00 00 00 02 00 80 nop
20 40000000000007ec: 08 00 84 00 br.
281
Computer Science from the Bottom Up
1
$ readelf --sections ./hello
There are 40 section headers, starting at offset 0x25c0:
5 Section Headers:
[Nr] Name Type Address Off
Size EntSize Flags Link Info Al
[ 0] NULL 0000000000000000 000
0000000000000000 0000000000000000 0 0
10 ...
[11] .plt PROGBITS 40000000000004c0 000
00000000000000c0 0000000000000000 AX 0 0
[12] .text PROGBITS 4000000000000580 000
00000000000004a0 0000000000000000 AX 0 0
15 [13] .fini PROGBITS 4000000000000a20 000
0000000000000040 0000000000000000 AX 0 0
[14] .rodata PROGBITS 4000000000000a60 000
000000000000000f 0000000000000000 A 0 0
[15] .opd PROGBITS 4000000000000a70 000
20 0000000000000070 0000000000000000 A 0 0
[16] .IA_64.unwind_inf PROGBITS 4000000000000ae0 000
00000000000000f0 0000000000000000 A 0 0
[17] .IA_64.unwind IA_64_UNWIND 4000000000000bd0 000
00000000000000c0 0000000000000000 AL 12 c
25 [18] .init_array INIT_ARRAY 6000000000000c90 000
0000000000000018 0000000000000000 WA 0 0
[19] .fini_array FINI_ARRAY 6000000000000ca8 000
0000000000000008 0000000000000000 WA 0 0
[20] .data PROGBITS 6000000000000cb0 000
30 0000000000000004 0000000000000000 WA 0 0
282
Computer Science from the Bottom Up
283
Computer Science from the Bottom Up
284
Computer Science from the Bottom Up
1
40000000000004c0 <.plt>:
40000000000004c0: 0b 10 00 1c 00 21 [MMI] mov
40000000000004c6: e0 00 08 00 48 00 add
5 40000000000004cc: 00 00 04 00 nop
40000000000004d0: 0b 80 20 1c 18 14 [MMI] ld8
40000000000004d6: 10 41 38 30 28 00 ld8
40000000000004dc: 00 00 04 00 nop
40000000000004e0: 11 08 00 1c 18 10 [MIB] ld8
10 40000000000004e6: 60 88 04 80 03 00 mov
40000000000004ec: 60 00 80 00 br.
40000000000004f0: 11 78 00 00 00 24 [MIB] mov
40000000000004f6: 00 00 00 02 00 00 nop
40000000000004fc: d0 ff ff 48 br.
15 4000000000000500: 11 78 04 00 00 24 [MIB] mov
4000000000000506: 00 00 00 02 00 00 nop
400000000000050c: c0 ff ff 48 br.
4000000000000510: 11 78 08 00 00 24 [MIB] mov
4000000000000516: 00 00 00 02 00 00 nop
20 400000000000051c: b0 ff ff 48 br.
4000000000000520: 0b 78 40 03 00 24 [MMI] add
4000000000000526: 00 41 3c 70 29 c0 ld8
400000000000052c: 01 08 00 84 mov
4000000000000530: 11 08 00 1e 18 10 [MIB] ld8
25 4000000000000536: 60 80 04 80 03 00 mov
400000000000053c: 60 00 80 00 br.
4000000000000540: 0b 78 80 03 00 24 [MMI] add
4000000000000546: 00 41 3c 70 29 c0 ld8
400000000000054c: 01 08 00 84 mov
30 4000000000000550: 11 08 00 1e 18 10 [MIB] ld8
285
Computer Science from the Bottom Up
4000000000000556: 60 80 04 80 03 00 mov
400000000000055c: 60 00 80 00 br.
4000000000000560: 0b 78 c0 03 00 24 [MMI] add
4000000000000566: 00 41 3c 70 29 c0 ld8
35 400000000000056c: 01 08 00 84 mov
4000000000000570: 11 08 00 1e 18 10 [MIB] ld8
4000000000000576: 60 80 04 80 03 00 mov
400000000000057c: 60 00 80 00 br.
286
Computer Science from the Bottom Up
1
$ objdump --disassemble-all ./hello
Disassembly of section .got:
5 6000000000000ec0 <.got>:
...
6000000000000ee8: 80 0a 00 00 00 00 dat
6000000000000eee: 00 40 90 0a dep
6000000000000ef2: 00 00 00 00 00 40 [MIB] (p20) bre
10 6000000000000ef8: a0 0a 00 00 00 00 dat
6000000000000efe: 00 40 50 0f br.
6000000000000f02: 00 00 00 00 00 60 [MIB] (p58) bre
6000000000000f08: 60 0a 00 00 00 00 dat
6000000000000f0e: 00 40 90 06 br.
15 Disassembly of section .IA_64.pltoff:
6000000000000f10 <.IA_64.pltoff>:
6000000000000f10: f0 04 00 00 00 00 [MIB] (p39) bre
6000000000000f16: 00 40 c0 0e 00 00 dat
20 6000000000000f1c: 00 00 00 60 dat
6000000000000f20: 00 05 00 00 00 00 [MII] (p40) bre
6000000000000f26: 00 40 c0 0e 00 00 dat
6000000000000f2c: 00 00 00 60 dat
6000000000000f30: 10 05 00 00 00 00 [MIB] (p40) bre
25 6000000000000f36: 00 40 c0 0e 00 00 dat
6000000000000f3c: 00 00 00 60 dat
287
Computer Science from the Bottom Up
being loaded here. Swapping the byte order of the first 8 bytes f0 04
00 00 00 00 00 40 we end up with 0x4000000000004f0 . Now that
address looks familiar! Looking back up at the assemble output of the
PLT we see that address.
The code at 0x4000000000004f0 firstly puts a zero value into r15, and
then branches back to 0x40000000000004c0 . Wait a minute! That's the
start of our PLT section.
We can trace this code through too. Firstly we save the value of the
global pointer ( r2 ) then we load three 8 byte values into r16 , r17
and finally, r1 . We then branch to the address in r17 . What we are
seeing here is the actual call into the dynamic linker!
288
Computer Science from the Bottom Up
289
Computer Science from the Bottom Up
Do you notice anything about it? Its the same value as the GOT. This
means that the first three 8 byte entries in the GOT are actually the
reserved area; thus will always be pointed to by the global pointer.
When the dynamic linker starts it is its duty to fill these values in. The
ABI says that the first value will be filled in by the dynamic linker
giving this module a unique ID. The second value is the global pointer
value for the dynamic linker, and the third value is the address of the
function that finds and fixes up the symbol.
290
Computer Science from the Bottom Up
1
/* Set up the loaded object described by L so its u
entries will jump to the on-demand fixup code in dl-runtime.
291
Computer Science from the Bottom Up
40 reserve[1] = doit;
reserve[2] = gp;
}
return lazy;
45 }
We can see how this gets setup by the dynamic linker by looking at
the function that does this for the binary. The reserve variable is set
from the PLT_RESERVE section pointer in the binary. The unique
value (put into reserve[0] ) is the address of the link map for this
object. Link maps are the internal representation within glibc for
shared objects. We then put in the address of _dl_runtime_resolve to
the second value (assuming we are not using profiling). reserve[2] is
finally set to gp, which has been found from r2 with the __asm__ call.
Looking back at the ABI, we see that the relocation index for the
entry must be placed in r15 and the unique identifier must be passed
in r16 .
r15 has previously been set in the stub code, before we jumped back
to the start of the PLT. Have a look down the entries, and notice how
each PLT entry loads r15 with an incremented value? It should come
as no surprise if you look at the relocations the printf relocation is
number zero.
r16 we load up from the values that have been initialised by the
292
Computer Science from the Bottom Up
The relocation record provides the dynamic linker with the address it
needs to "fix up"; remember it was in the GOT and loaded by the
initial PLT stub? This means that after the first time the function is
called, the second time it is loaded it will get the direct address of the
function; short circuiting the dynamic linker.
4.1.2 Summary
You've seen the exact mechanism behind the PLT, and consequently
the inner workings of the dynamic linker. The important points to
remember are
• The dynamic linker re-writes the address that the stub code
reads, so that the next time the function is called it will go
straight to the right address.
293
Computer Science from the Bottom Up
However, changes in the way the dynamic library work could cause
multiple problems. In the best case, the modifications are completely
compatible and nothing externally visible is changed. On the other
hand the changes might cause the application to crash; for example if
a function that used to take an int changes to take an int * . Worse,
the new library version could have changed semantics and suddenly
start silently returning different, possibly wrong values. This can be a
very nasty bug to try and track down; when an application crashes
you can use a debugger to isolate where the error occurs whilst data
corruption or modification may only show up in seemingly unrelated
parts of the application.
5.1.1 sonames
Using sonames we can add some extra information to a library to help
identify versions.
294
Computer Science from the Bottom Up
Thus each minor version library file on disc can specify the same
major version number in its DT_SONAME field, allowing the dynamic
linker to know that this particular library file implements a particular
major revision of the library API and ABI.
The final piece of the hierarchy is the compile name for the library.
When you compile your program, to link against a library you use the
-lNAME flag, which goes off searching for the libNAME.so file in the
library search path. Notice however, we have not specified any
version number; we just want to link against the latest library on the
system. It is up to the installation procedure for the library to create
the symbolic link between the compile libNAME.so name and the
latest library code on the system. Usually this is handled by your
package management system (dpkg or rpm). This is not an automated
process because it is possible that the latest library on the system
may not be the one you wish to always compile against; for example if
1. You can optionally have a release as a final identifier after the minor
number. Generally this is enough to distinguish all the various versions
library.
295
Computer Science from the Bottom Up
New Build
$ gcc -o test test.c -lfoo
libfoo.so.2.0
Major Revision 2
Minor Revision 0
/usr/lib/libfoo.so libfoo.so.2
libfoo.so.1.2
Newer revisions
Major Revision 1
Minor Revision 2
/usr/lib/libfoo.so.2
libfoo.so.1
libfoo.so.1.1
Old Application
Major Revision 1
Minor Revision 1
DT_NEEDED
libfoo.so.1
libfoo.so.1
/usr/lib
File Name
soname
296
Computer Science from the Bottom Up
linker needs to search through all the libraries to find those that
implement the given soname . Secondly the file names for the minor
revisions need to be compared to find the latest version, which is then
ready to be loaded.
The dynamic linker has a few pieces of information; firstly the symbol
that it is searching for, and secondly a list of libraries that that symbol
might be in, as defined by the DT_NEEDED fields in the binary.
297
Computer Science from the Bottom Up
1 typedef struct {
Elf32_Word st_name;
Elf32_Addr st_value;
Elf32_Word st_size;
5 unsigned char st_info;
unsigned char st_other;
Elf32_Half st_shndx;
} Elf32_Sym;
As you can see, the actual string of the symbol name is held in a
separate section ( .dynstr ; the entry in the .dynsym section only holds
an index into the string section. This creates some level of overhead
for the dynamic linker; the dynamic linker must read all of the symbol
entries in the .dynsym section and then follow the index pointer to
find the symbol name for comparison.
298
Computer Science from the Bottom Up
refers is the process of binding that symbol, the symbol binding has a
separate meaning.
299
Computer Science from the Bottom Up
1 $ cat test.c
static int static_variable;
int function(void)
{
10 return external_function();
}
$ gcc -c test.c
25 $ objdump --syms test.o
SYMBOL TABLE:
30 00000000 l df *ABS* 00000000 test.c
300
Computer Science from the Bottom Up
$ nm test.o
U external_function
45 00000000 T function
00000038 t static_function
00000000 s static_variable
0000005c W weak_function
We inspect the symbols with two different tools; in both cases the
binding is shown in the second column; the codes should be quite
straight forward (are are documented in the tools man page).
301
Computer Science from the Bottom Up
302
Computer Science from the Bottom Up
1 $ cat override.c
#define _GNU_SOURCE 1
#include <stdio.h>
#include <stdlib.h>
5 #include <unistd.h>
#include <sys/types.h>
#include <dlfcn.h>
pid_t getpid(void)
10 {
pid_t (*orig_getpid)(void) = dlsym(RTLD_NEXT, "getpid")
printf("Calling GETPID\n");
return orig_getpid();
15 }
$ cat test.c
#include <stdio.h>
#include <stdlib.h>
20 #include <unistd.h>
int main(void)
{
printf("%d\n", getpid());
25 }
303
Computer Science from the Bottom Up
15187
The logical extension of this for the dynamic loader is that all libraries
should be loaded, and any weak symbols in those libraries should be
ignored for normal symbols in any other library. This was indeed how
weak symbol handling was originally implemented in Linux by glibc.
However, this was actually incorrect to the letter of the Unix standard
at the time (SysVr4). The standard actually dictates that weak
symbols should only be handled by the static linker; they should
remain irrelevant to the dynamic linker (see the section on binding
order below).
Libraries are loaded in the order they are specified in the DT_NEEDED
flag of the binary. This in turn is decided from the order that libraries
are passed in on the command line when the object is built. When
symbols are to be located, the dynamic linker starts at the last loaded
library and works backwards until the symbol is found.
304
Computer Science from the Bottom Up
305
Computer Science from the Bottom Up
1 $ cat Makefile
all: test testsym
clean:
5 rm -f *.so test testsym
liboverride.so : override.c
$(CC) -shared -fPIC -o liboverride.so override.c
10 libtest.so : libtest.c
$(CC) -shared -fPIC -o libtest.so libtest.c
libtestsym.so : libtest.c
$(CC) -shared -fPIC -Wl,-Bsymbolic -o libtestsym.so lib
15
test : test.c libtest.so liboverride.so
$(CC) -L. -ltest -o test test.c
$ cat libtest.c
#include <stdio.h>
25 int foo(void) {
printf("libtest foo called\n");
return 1;
}
30 int test_foo(void)
306
Computer Science from the Bottom Up
{
return foo();
}
35 $ cat override.c
#include <stdio.h>
int foo(void)
{
40 printf("override foo called\n");
return 0;
}
$ cat test.c
45 #include <stdio.h>
int main(void)
{
printf("%d\n", test_foo());
50 }
$ cat Versions
{global: test_foo; local: *; };
307
Computer Science from the Bottom Up
If the developer implements a function that has the same name but
possibly a binary or programatically different implementation he can
increase the version number. When new applications are built against
the shared library, they will pick up the latest version of the symbol.
However, applications built against earlier versions of the same
library will be requesting older versions (e.g. will have older @VER
strings in the symbol name they request) and thus get the original
implementation. XXX : example
Glossary
A
308
Computer Science from the Bottom Up
MMU
Mutually Exclusive
Open Source
Shell
309