Making Plain Binary Files Using A C Comp
Making Plain Binary Files Using A C Comp
Cornelis Frank
April 10, 2000
I wrote this article because there isn’t much information on the Internet concerning this topic
and I needed this for the EduOS project.
So if you blow up your computer because of my bad “English” that’s your problem not mine.
GNU GCC compiler. This C compiler usually comes with Linux. To check if you’re having
GCC type the following at the prompt:
gcc --version
2.7.2.3
The number probably will not match the above one, but that doesn’t really matter.
NASM Version 0.97 or higher. The Netwide Assembler, NASM, is an 80x86 assembler
designed for portability and modularity. It supports a range of object file formats, including
Linux ‘a.out’ and ELF, NetBSD/FreeBSD, COFF, Microsoft 16-bit OBJ and Win32. It will
also output plain binary files. Its syntax is designed to be simple and easy to understand,
similar to Intel’s but less complex. It supports Pentium, P6 and MMX opcodes, and has
macro capability.
Normally you don’t have NASM on your system. Download it from:
http://sunsite.unc.edu/pub/Linux/devel/lang/assemblers/
1
GCC IA-32 COMPILER CORNELIS FRANK
2
GCC IA-32 COMPILER CORNELIS FRANK
int main () {
int i; /* declaration of an int */
i = 0x12345678; /* hexadecimal */
}
Compile this by typing:
gcc -c test.c
ld -o test -Ttext 0x0 -e main test.o
objcopy -R .note -R .comment -S -O binary test test.bin
After we compiled we get the next binary file:
gcc -c test.c
ld -o test.bin -Ttext 0x0 -e main -oformat binary test.o
This gives us the same binary file as before.
int i;
i = 0x12345678;
1 See also: Intel Architecture Software Developer’s Manual, Volume 1: Basic Architecture, 1.4.1. Bit and Byte
Order
3
GCC IA-32 COMPILER CORNELIS FRANK
into,
int i = 0x12345678;
we get exactly the same binary file. This is very important to notice as it is not so when we use
global variables.
4
GCC IA-32 COMPILER CORNELIS FRANK
Now the global variable is being stored at our gives address 0x1234. Thus, if we use the parameter
-Tdata with ld, we can specify the location of the data segment ourself. Otherwise the data
segment is located right after the code. By storing the variable somewhere in the data memory it
remains accessible even outside the main function. This is why they call int i a global variable.
We can also create directly the binary file using ld with the parameter -oformat binary.
gcc -c test.c
ld -o test.bin -Ttext 0x0 -e main -N -oformat binary test.o
We can see that there are some extra bytes at the end of our binary file. This is a read-only data
section aligned on 4 bytes which contains our global constant.
5
GCC IA-32 COMPILER CORNELIS FRANK
00000000 <main>:
0: 55 pushl %ebp
1: 89 e5 movl %esp,%ebp
3: c9 leave
4: c3 ret
Disassembly of section .data:
Disassembly of section .rodata:
00000000 <c>:
0: 78 56 js 58 <main+0x58>
2: 34 12 xorb $0x12,%al
We can clearly see the read-only data section containing our global constant c. Now take a look at
the next program:
int i = 0x12345678;
const int c = 0x12346578;
int main () {
}
00000000 <main>:
0: 55 pushl %ebp
1: 89 e5 movl %esp,%ebp
3: c9 leave
4: c3 ret
Disassembly of section .data:
00000000 <i>:
0: 78 56 js 58 <main+0x58>
2: 34 12 xorb $0x12,%al
Disassembly of section .rodata:
00000000 <c>:
0: 78 56 js 58 <main+0x58>
2: 34 12 xorb $0x12,%al
We can see our int i in the data section and our constant c in the read-only data section. So when
ld has to use global constants it automatically uses the data section to store global variables.
6
GCC IA-32 COMPILER CORNELIS FRANK
5 Pointers
Now let’s see how GCC handles pointers to variables. Therefore we will use the following pro-
gram.
int main () {
int i;
int *p; /* a pointer to an integer */
p = &i; /* let pointer p points to integer i */
*p = 0x12345678; /* makes i = 0x12345678 */
}
This instruction will reserve 8 bytes on the stack for local variables. Seems like a pointer is being
stored using 4 bytes. At this point the stack looks like in figure 1. As you can see the lea instruction
stack
ebp
int i
4 bytes ebp-0x4
int *p
4 bytes esp = ebp-0x8
0 0
will load the effective address of int i. Next this value is being stored in int *p. After this the
value of int *p is being used as a pointer to a dword wherein the value 0x12345678 is being
stored.
7
GCC IA-32 COMPILER CORNELIS FRANK
6 Calling a function
Now let’s take a look on how GCC handles function calls. Take a look at the next example:
void f (); /* function prototype */
int main () {
f (); /* function call */
}
8
GCC IA-32 COMPILER CORNELIS FRANK
00000000 <main>:
0: 55 pushl %ebp
1: 89 e5 movl %esp,%ebp
3: e8 04 00 00 00 call c <f>
8: c9 leave
9: c3 ret
a: 89 f6 movl %esi,%esi
0000000c <f>:
c: 55 pushl %ebp
d: 89 e5 movl %esp,%ebp
f: c9 leave
10: c3 ret
Disassembly of section .data:
Again this is very usefull when you want to study the binary code that GCC creates. Notice that
they are not using the Intel syntax for displaying the instructions. They use instruction represen-
tations like pushl and movl. The l at the end of the instructions indicates that the instructions
perform operations on 32-bit (long) operands. An other important difference contrary to Intels
syntax is that the order of the operands is reversed. Next example shows us the two different
notations for the instruction that moves the data from register EBX to register EAX.
As for Intel the first operand is the destination and the second operand is the source.
7 Return codes
You probably noticed that I always use int main () as my function definition, but I never actually
return an int. So, let us try it.
int main () {
return 0x12345678;
}
9
GCC IA-32 COMPILER CORNELIS FRANK
printf (...);
While printf actually returns an int to the caller. Of course the compiler can’t use this method if
the type of the return parameter is bigger than 4 bytes. In the next paragraph we will demonstrate
a situation inwhich this occures.
typedef struct {
int a,b,c,d;
int i [10];
} MyDef;
10
GCC IA-32 COMPILER CORNELIS FRANK
Dissection of test.bin
At address 0x3 of the function main we see that the compiler reserves 0x38 bytes on the stack.
This is the size of the structure MyDef. At address 0x6 to 0x9 we see the solution to “the problem”.
Since MyDef is bigger than 4 bytes, the compiler passes a pointer to d to the function MyFunc at
address 0x14. This function can then use that pointer to fill up d with data. Please notice that
a parameter is being passed to the function MyFunc while this function actual doesn’t have any
parameters at all in its C function declaration. To fill the data structure, MyFunc uses a 32 bit data
movement instruction:
typedef struct {
int a,b,c,d;
int i [10];
} MyDef;
11
GCC IA-32 COMPILER CORNELIS FRANK
MyFunc ();
}
Dissection
This code shows us that — although there aren’t any local variables in the entry function main at
address 0x0 — the function reserves some place on the stack for a variable of exactly 0x38 bytes in
size. Then a pointer to this data structure is being passed to the function MyFunc at address 0x14,
just as in the previous example. Also notice that the function MyFunc hasn’t change internally.
12
GCC IA-32 COMPILER CORNELIS FRANK
13
GCC IA-32 COMPILER CORNELIS FRANK
The caller pushes the function’s parameters on the stack, one after another, in reverse order
(right to left, so that the first argument specified to the function is pushed last).
The caller then executes a near CALL instruction to pass control to the callee.
The callee receives control, and typically (although this is not actually necessary, in func-
tions which do not need to access their parameters) starts by saving the value of ESP in EBP
so as to be able to use EBP as a base pointer to find its parameters on the stack. However, the
caller was probably doing this too, so part of the calling convention states that EBP must be
preserved by any C function. Hence the callee, if it is going to set up EBP as a frame pointer,
must push the previous value first.
The callee may then access its parameters relative to EBP. The doubleword at [EBP] holds
the previous value of EBP as it was pushed; the next doubleword, at [EBP+4], holds the
return address, pushed implicitly by CALL. The parameters start after that, at [EBP+8]. The
leftmost parameter of the function, since it was pushed last, is accessible at this offset from
EBP; the others follow, at successively greater offsets. Thus, in a function such as printf
which takes a variable number of parameters, the pushing of the parameters in reverse order
means that the function knows where to find its first parameter, which tells it the number and
type of the remaining ones.
The callee may also wish to decrease ESP further, so as to allocate space on the stack for
local variables, which will then be accessible at negative offsets from EBP.
The callee, if it wishes to return a value to the caller, should leave the value in AL, AX or EAX
depending on the size of the value. Floating-point results are typically returned in ST0.
Once the callee has finished processing, it restores ESP from EBP if it had allocated local
stack space, then pops the previous value of EBP, and returns via RET (equivalently, RETN).
When the caller regains control from the callee, the function parameters are still on the
stack, so it typically adds an immediate constant to ESP to remove them (instead of execut-
ing a number of slow POP instructions). Thus, if a function is accidentally called with the
wrong number of parameters due to a prototype mismatch, the stack will still be returned
to a sensible state since the caller, which knows how many parameters it pushed, does the
removing.
8.2 Dissection
So after the two bytes are pushed onto the stack there is a call to the function f at address 0x1c.
This function first descreases esp with 4 bytes for local use. Next the function makes local copies
of it’s function parameters. After that a + b is being calculated and returned in register eax.
14
GCC IA-32 COMPILER CORNELIS FRANK
10 Other statements
Of course we also could look on how GCC handles for loops, while loops, if-else statements
and case constructions, but this doesn’t really matter when you want to write them yourself. And
if you don’t want to write them yourself it also doesn’t matter since you don’t have to bother about
it.
First we will have a look on how the computer handles signed data types.
Range
✝✟✝✟✝ ✝✟✝✟✝
unsigned 128 ✝✟✝✟✝
255 0 1 ✝✟✝✟✝
127
signed -128 -1 0 1 127
Table 1: The two’s complement of a char
of the two’s complement notation is that you can calculate with negative numbers the same way as
with positive numbers.
15
GCC IA-32 COMPILER CORNELIS FRANK
11.2 Assignments
Here we will take a look at some C assignments and there result in assembly. The used C program
is displayed below
main () {
unsigned int i = 251;
}
int i = 251;
results in
int i = -5;
results in
Seems like signed and unsigned assignments are treated the same way.
main () {
char c = -5;
int i;
i = c;
}
16
GCC IA-32 COMPILER CORNELIS FRANK
Dissection
First we see at address 0x3 the reservation of 8 bytes onto the stack for the local variables c and i.
The compiler takes 8 bytes to make it possible to align the integer i. Next we see that the char c
at [ebp-0x1] is being filled with 0xfb, which of course represents 5. (0xfb = 251, 251 - 256 =
✁
-5) Notice also that the compiler uses [ebp-0x1] instead of [ebp-0x4]. This because of the little
endian representation. The next instruction movsx does the actual conversion from a signed char
to a signed integer. MOVSX sign-extends its source (second) operand to the length of its destination
(first) operand, and copies the result into the destination operand3 . The last instruction (before
leave) then writes the signed integer stored in eax to int i.
main () {
char c;
int i = -5;
c = i;
}
Notice that the statement c = i only make sense when the value in i is between -128 and 127.
Because it has to be in the range of the signed char. Compilation results into next binary file
Dissection
0xfffffffb is indeed 5. When we only look at the less significant byte 0xfb and we move this
✁
to a signed char, we also get 5. So for the conversion from a signed int to a signed char we can
✁
17
GCC IA-32 COMPILER CORNELIS FRANK
main () {
unsigned char c = 5;
unsigned int i;
i = c;
}
Dissection
We get the same binary file as for the conversion from signed char to signed int except for the
instruction at address 0xA. Here we have the instruction movzx. MOVZX zero-extends its source
(second) operand to the length of its destination (first) operand, and copies the result into the
destination operand.
main () {
unsigned char c;
unsigned int i = 251;
c = i;
}
Please notice again that the integer value is restricted from 0 to 255. This because an unsigned
char can’t handle any bigger numbers. The accompanying binary file
18
GCC IA-32 COMPILER CORNELIS FRANK
Dissection
The actual conversion instruction, the mov instruction at address 0xD, is the same as for the con-
version from signed integers to signed chars.
main () {
int i = -5;
unsigned int u;
u = i;
}
The binary
Dissection
There is no specific conversion between signed and unsigned integers. The only difference is when
you perform operations on the integers. Signed integers will have to use instructions like idiv,
imul where unsigned integers will use the unsigned versions of there instructions being div, mul.
32-bit mode, so protected mode with enabled 32 bit code flag in GDT or LDT table.
Segment registers CS, DS, ES, FS, GS and SS have to point to the same memory area.
(aliases)
Because un-initialised global variables are stored “right” after the code you have to keep a
little area free. This area is called the BSS section. Notice that initialised global variables
are stored in the DATA section in the binary file itself right after the code section. Variables
declared with const are stored in the RODATA (read-only) section which is also part of the
binary file itself.
Make sure the stack can’t overwrite the code and global variables.
19
GCC IA-32 COMPILER CORNELIS FRANK
In the Intel documentation[2] they refer to this as Basic Flat Model4 . Don’t misunderstand this.
We don’t have to use the Basic Flat Model. As long as the C compiled binary has his CS, DS and
SS pointing to the same memory area (using aliases) everything will work. So we can use the full
multisegment protected paging model as long as every C compiled binary has his local basic flat
memory model5 .
int myVar = 5;
int main () {
}
gcc -c test.c
ld -Map memmap.txt -Ttext 0x0 -e main -oformat binary -N \
-o test.bin test.o
ndisasm -b 32 test
As you can see the variable myVar is stored at location 0x8. Now we have to get that address
from ld using its memory map file memmap.txt which we did create using the parameter -Map.
Herefore we use the command:
20
GCC IA-32 COMPILER CORNELIS FRANK
0x00000008
When we put this value in an environment variable (UNIX) MYVAR, we can use this to tell nasm
where to look for the global C variable myVar. Example:
...
mov ax,CProgramSelector
mov es,ax
mov eax,[TheValueThatMyVarShouldContain]
mov [es:MYVAR_ADDR],eax
...
0x0
We can pass this value like the way we did it for the global variables.
#define va_rounded_size(type) \
(((sizeof (type) + sizeof (int) - 1) / sizeof (int)) * sizeof (int))
6 Source: A Book on C, fourth edition, A.10. Variable Arguments
21
GCC IA-32 COMPILER CORNELIS FRANK
#define va_start(ap, v) \
((void) (ap = (va_list) &v + va_rounded_size (v)))
#endif
In the macro va start, the variable v is the last argument that is declared in the header to your
variable argument function definition. This variable cannot be of storage class register, and
it cannot be an array type or a type such as char that is widened by automatic conversions. The
macro va start initializes the argument pointer ap. The macro va arg accesses the next argument
in the list. The macro va end performs any cleanup that may be required before function exit.
In the given implementation we’re using a macro va rounded size. This macro is needed since
the IA-32 aligns the stack — which is used to pass us the variables of a function — on 32-bit
boundaries, indicated by the statement sizeof (int). The macro va start will let the argument
arg 1
4 bytes
arg 0
4 bytes ebp + 0x8
eip
4 bytes ebp + 0x4
ebp
4 bytes ebp
pointer ap point to the variable after the given (first) variable v. This macro doesn’t return anything
(indicated by the leading (void)).
The macro va arg first increases the argument pointer ap by the size of the given type type.
After that it returns the next (actually the previous argument since the argument pointer ap first
did increase) argument on the stack of type type. At first sight this way of handling seems very
weird but its the only way since we have to put the variable we want to return at the end of a macro
definition, after the last comma.
Finally macro va end will reset the argument pointer ap without returning anything.
22
GCC IA-32 COMPILER CORNELIS FRANK
References
[1] A Book on C
Programming in C, fourth edition
Addison-Wesley — ISBN 0-201-18399-4
———————————
23