Let's Build A Simple Interpreter. Part 1
Let's Build A Simple Interpreter. Part 1
If you dont know how compilers work, then you dont know how computers work. If
youre not 100% sure whether you know how compilers work, then you dont know how
they work. Steve Yegge
There you have it. Think about it. It doesnt really matter whether youre a newbie or a seasoned
software developer: if you dont know how compilers and interpreters work, then you dont know how
computers work. Its that simple.
So, do you know how compilers and interpreters work? And I mean, are you 100% sure that you know
Do not worry. If you stick around and work through the series and build an interpreter and a compiler
with me you will know how they work in the end. And you will become a confident happy camper too.
var
n: integer;
begin
for n := 0 to 16 do
writeln(n, '! = ', factorial(n));
end.
The implementation language of the Pascal interpreter will be Python, but you can use any language
you want because the ideas presented dont depend on any particular implementation language. Okay,
lets get down to business. Ready, set, go!
You will start your first foray into interpreters and compilers by writing a simple interpreter of
arithmetic expressions, also known as a calculator. Today the goal is pretty minimalistic: to make your
calculator handle the addition of two single digit integers like 3+5. Here is the source code for your
calculator, sorry, interpreter:
# Token types
#
# EOF (end-of-file) token is used to indicate that
# there is no more input left for lexical analysis
INTEGER, PLUS, EOF = 'INTEGER', 'PLUS', 'EOF'
class Token(object):
def __init__(self, type, value):
# token type: INTEGER, PLUS, or EOF
self.type = type
# token value: 0, 1, 2. 3, 4, 5, 6, 7, 8, 9, '+', or None
self.value = value
def __str__(self):
"""String representation of the class instance.
Examples:
Token(INTEGER, 3)
Token(PLUS '+')
"""
return 'Token({type}, {value})'.format(
type=self.type,
value=repr(self.value)
)
def __repr__(self):
return self.__str__()
class Interpreter(object):
def __init__(self, text):
# client string input, e.g. "3+5"
self.text = text
# self.pos is an index into self.text
self.pos = 0
# current token instance
self.current_token = None
def error(self):
raise Exception('Error parsing input')
def get_next_token(self):
"""Lexical analyzer (also known as scanner or tokenizer)
self.error()
def expr(self):
"""expr -> INTEGER PLUS INTEGER"""
# set current token to the first token taken from the input
self.current_token = self.get_next_token()
def main():
while True:
try:
# To run under Python3 replace 'raw_input' call
# with 'input'
text = raw_input('calc> ')
except EOFError:
break
if not text:
continue
interpreter = Interpreter(text)
result = interpreter.expr()
print(result)
if __name__ == '__main__':
main()
Save the above code into calc1.py file or download it directly from GitHub. Before you start digging
deeper into the code, run the calculator on the command line and see it in action. Play with it! Here is a
sample session on my laptop (if you want to run the calculator under Python3 you will need to replace
raw_input with input):
$ python calc1.py
calc> 3+4
7
calc> 3+5
8
calc> 3+9
12
calc>
For your simple calculator to work properly without throwing an exception, your input needs to follow
certain rules:
Only single digit integers are allowed in the input
The only arithmetic operation supported at the moment is addition
No whitespace characters are allowed anywhere in the input
Those restrictions are necessary to make the calculator simple. Dont worry, youll make it pretty
complex pretty soon.
Okay, now lets dive in and see how your interpreter works and how it evaluates arithmetic expressions.
When you enter an expression 3+5 on the command line your interpreter gets a string 3+5. In order
for the interpreter to actually understand what to do with that string it first needs to break the input
3+5 into components called tokens. A token is an object that has a type and a value. For example,
for the string 3 the type of the token will be INTEGER and the corresponding value will be integer
3.
The process of breaking the input string into tokens is called lexical analysis. So, the first step your
interpreter needs to do is read the input of characters and convert it into a stream of tokens. The part of
the interpreter that does it is called a lexical analyzer, or lexer for short. You might also encounter
other names for the same component, like scanner or tokenizer. They all mean the same: the part of
your interpreter or compiler that turns the input of characters into a stream of tokens.
The method get_next_token of the Interpreter class is your lexical analyzer. Every time you call it, you
get the next token created from the input of characters passed to the interpreter. Lets take a closer look
at the method itself and see how it actually does its job of converting characters into tokens. The input
is stored in the variable text that holds the input string and pos is an index into that string (think of the
string as an array of characters). pos is initially set to 0 and points to the character 3. The method first
checks whether the character is a digit and if so, it increments pos and returns a token instance with the
type INTEGER and the value set to the integer value of the string 3, which is an integer 3:
The pos now points to the + character in the text. The next time you call the method, it tests if a
character at the position pos is a digit and then it tests if the character is a plus sign, which it is. As a
result the method increments pos and returns a newly created token with the type PLUS and value +:
The pos now points to character 5. When you call the get_next_token method again the method
checks if its a digit, which it is, so it increments pos and returns a new INTEGER token with the value
of the token set to integer 5:
Because the pos index is now past the end of the string 3+5 the get_next_token method returns the
EOF token every time you call it:
Try it out and see for yourself how the lexer component of your calculator works:
>>> from calc1 import Interpreter
>>>
>>> interpreter = Interpreter('3+5')
>>> interpreter.get_next_token()
Token(INTEGER, 3)
>>>
>>> interpreter.get_next_token()
Token(PLUS, '+')
>>>
>>> interpreter.get_next_token()
Token(INTEGER, 5)
>>>
>>> interpreter.get_next_token()
Token(EOF, None)
>>>
So now that your interpreter has access to the stream of tokens made from the input characters, the
interpreter needs to do something with it: it needs to find the structure in the flat stream of tokens it gets
from the lexer get_next_token. Your interpreter expects to find the following structure in that stream:
INTEGER -> PLUS -> INTEGER. That is, it tries to find a sequence of tokens: integer followed by a
plus sign followed by an integer.
The method responsible for finding and interpreting that structure is expr. This method verifies that the
sequence of tokens does indeed correspond to the expected sequence of tokens, i.e INTEGER -> PLUS
-> INTEGER. After its successfully confirmed the structure, it generates the result by adding the value
of the token on the left side of the PLUS and the right side of the PLUS, thus successfully interpreting
the arithmetic expression you passed to the interpreter.
The expr method itself uses the helper method eat to verify that the token type passed to the eat method
matches the current token type. After matching the passed token type the eat method gets the next
token and assigns it to the current_token variable, thus effectively eating the currently matched token
and advancing the imaginary pointer in the stream of tokens. If the structure in the stream of tokens
doesnt correspond to the expected INTEGER PLUS INTEGER sequence of tokens the eat method
throws an exception.
Lets recap what your interpreter does to evaluate an arithmetic expression:
The interpreter accepts an input string, lets say 3+5
The interpreter calls the expr method to find a structure in the stream of tokens returned by the
lexical analyzer get_next_token. The structure it tries to find is of the form INTEGER PLUS
INTEGER. After its confirmed the structure, it interprets the input by adding the values of two
INTEGER tokens because its clear to the interpreter at that point that what it needs to do is add
two integers, 3 and 5.
Congratulate yourself. Youve just learned how to build your very first interpreter!
Now its time for exercises.
You didnt think you would just read this article and that would be enough, did you? Okay, get your
hands dirty and do the following exercises:
1. Modify the code to allow multiple-digit integers in the input, for example 12+3
2. Add a method that skips whitespace characters so that your calculator can handle inputs with
whitespace characters like 12 + 3
3. Modify the code and instead of + handle - to evaluate subtractions like 7-5
Check your understanding
1. What is an interpreter?
2. What is a compiler?
3. Whats the difference between an interpreter and a compiler?
4. What is a token?
5. What is the name of the process that breaks input apart into tokens?
6. What is the part of the interpreter that does lexical analysis called?
7. What are the other common names for that part of an interpreter or a compiler?
Before I finish this article, I really want you to commit to studying interpreters and compilers. And I
want you to do it right now. Dont put it on the back burner. Dont wait. If youve skimmed the article,
start over. If youve read it carefully but havent done exercises - do them now. If youve done only
some of them, finish the rest. You get the idea. And you know what? Sign the commitment pledge to
start learning about interpreters and compilers today!
I, ________, of being sound mind and body, do hereby pledge to commit to studying interpreters and
compilers starting today and get to a point where I know 100% how they work!
Signature:
Date:
Sign it, date it, and put it somewhere where you can see it every day to make sure that you stick to your
commitment. And keep in mind the definition of commitment:
Commitment is doing the thing you said you were going to do long after the mood you
said it in has left you. Darren Hardy
Okay, thats it for today. In the next article of the mini series you will extend your calculator to handle
more arithmetic expressions. Stay tuned.