The document provides an overview of compilers, detailing the roles of translators, compilers, and interpreters in converting source code from one programming language to another. It outlines the phases of compilation, including lexical analysis, syntax analysis, semantic analysis, intermediate code generation, code optimization, and code generation, along with the importance of symbol table management and error handling. Additionally, it discusses the grouping of compiler phases into front end and back end, as well as tools used in compiler construction.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0 ratings0% found this document useful (0 votes)
3 views33 pages
Unit-1,2
The document provides an overview of compilers, detailing the roles of translators, compilers, and interpreters in converting source code from one programming language to another. It outlines the phases of compilation, including lexical analysis, syntax analysis, semantic analysis, intermediate code generation, code optimization, and code generation, along with the importance of symbol table management and error handling. Additionally, it discusses the grouping of compiler phases into front end and back end, as well as tools used in compiler construction.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 33
INTRODUCTION TO COMPILERS
4.0 Translators
A translator in a programming language is. a way of converting a
em written in a given programming, language into a functionally
It program ina different computer language.
Compiler:
A compiler isa program that reads a program written in one language
the source language into equivalent program in another language —
get language.
Source
Target
Program
Compiler
Program
Error
messages
Fig.1.1 Compiler
The source program may be developed using any programming
languages such as Fortran, C, C++... The target program may be another
programming language or the machine language of any computer.
Interpreter:
Itis also a translator which converts the source program into target
program. Interpreter is a program in which the code is directly interpreted
by the microcode resides in the control memory ofa machine and executes
the code line by line. This microcode generates the control signals for
execution.The target program from compiler is then translated into.assembly
code and is then translated using assembler into machine code.
A linker in turn links the library routines with machine code and is
then loaded into primary memory using loader.
Source program
with assembler directives
Preprocessor
Source program
L
Target ee program
Assembly
Relocatab tefaachine code
Links library,
Loader/Linker | TScatable object files
Absolute niachine code
Fig.1.2 A language processing systemThe input to @ ooMpiler Hiay Hh
ef dy he produced by one or
‘ more =
context in which the compiler typically operates are: ne
+ Preprocessor
+ Assembler
+ Loader and Link Pditer
The compilation tool chain
mblet] ——+ [Linker] + [Loader]
[Preprocessor] = [Conmpiler] .
Preprocessor:
A preprocessor in a program that processes its input data to produce output that
ram. The output is said to be a preprocessed form of
is used as input to another pre
the input data, whieh is often used by some subsequent programs like compilers.
They may perform the following functions:
i) Macro processing
ii) File Inclusion
iii) Rational Preprocessors
iv) Languape extension
i) Macro processing
rm that specifies how a certain input sequence shor
ccording to a defined procedure. The mappi
cific output sequence is known as mat
A macro is a rule or patte
be mapped to an output sequence @
process that instantiates a macro into a spel
expansion.
ii) File Inelu
t. When the preproce:
header files into the program tex!
e specified file
Preprocessor includes
yy the entire content of th
finds an #include directive it replaces it b:iii) Rational Preprocessors
These processors change older languages with more modem flow-of-control ang +
data structuring facilitics. 2
iv) Language extension
These processors atlempt to add capabilities to the language by what amounts 1g
anguage Equel is a database query language embeddeg
built-in macros, For example, th
in C.
Assembler:
sembly language into machine language. Typically a moder .
assembly instruction mnemonics into ;
|
«cmblers based on how many passes through the source -
It converts the
assembler creates object code by translating
opcodes, and by resolving symbolic names for memory locations and other entities.
There are two types of
are needed to produce the executable program.
i) One -pass
ii) Two -Pass
+ One-pass assembler gocs through the source code once and assumes that all
symbols will be defined before any instruction that references them.
ass assemblers create a table with all symbols and their values in the first.
+ Two-|
pass, then use the table in a second pass to generate code.
Loader and Link Editor:
The process of loading consists of taking relocatable machine code, altering the
ing the altered instructions and data in memory at the
relocatable addresses and pl:
proper locations.
A linker or link editor is a program that takes one or more objects generated by
gle executable program.
a compiler and combines them into a1.3 PHASES OF ACOMPILER
In each phase of the compiler, the source program is transformed —
from one representation to another.
The six phase of compiler includes lexical analysis, syntax analysis,
semantic analysis, intermediate code generation, code optimization and
code generation.
The two other activities symbol — table management and error
handling interacts with all the six phases of the cimpiler.
Consider the statement
total = balance +i * 50;
1.3.1 Lexical Analysis
The lexical analysis phase reads the character in the source program
and groups them into a stream of tokens in which each token represents a
logically cohesive sequence of characters such as an identifier , a keyword,
a punctuation, or a multi-character operator := (PASCAL Programming
Language).
This phase is also called as scanner. The character sequence forming
atoken is called lexeme.
eg. Balance is called as lexeme.
total = balance +i * 50;
This code can be represented in lexical form as
id) = id,+ id, x 50total = balance + j*50
Lexical Analyzer
id = idy'hid, *50
Ni
id) y
id, A
id, No
wt
Semantic Analyzer
id, aN
id, A
idy int real
| 50.0
Intermediate Code Generator]
temp |
temp
temp 1:= id, * 50.0
id, = id,+ temp!
MOVF id, R,
MULF #50.0,R,
MOVF id, R,
ADDF R,,R,
MOVF R,, id,4.3.2 Syntax Analysis
In this phase, the scanned input symbols called tokens are
hierarchically structured and generate a valid syntax tree.
2 Z
a
id
FigA.5 Syntax tree4.3.3 Semantic Analysis
In this phase, the compiler perforins (ype checking and attaches
Jes oractions to the grammar, which is then converted into a code Suitable
rules
for intermediate code generation,
«>
id, *
id, inttoreal
|
50
Fig. 1.7 Semantic tree
1.3.4 Intermediate Code Generation
In this phase, the compiler receives valid syntactical constructs and
convert it into an intermediate form which are of the form syntax tree or
postfix notation or three address code,
The three-address code looks like embly language where every
memory location can acts like a register, Three ~ address code consists
ofa sequence of instructions each o f which has atmost three operands.
The three-address code generated for the syntax tree as follows
temp := inttoreal($0)
temp2 = id,* temp]
temp2 (= id femp2
id, = temp3Before starting the conversion, the compiler has to decide the order
in which the operation are to be done,
The intermediate form has the following properties.
i) _ Each three-address instruction has atmost one operator in addition
to the assignment operator.
ii) |The compiler must generate a temporary name to hold the value
computed by each instruction.
iii) Some instructions may have one or more operands,
Eg.,(id temp3)
4.3.5 Code Optimization
In order to improve the intermediate code to run faster, code
optimizer performs code improving transformations like redundant code
elimination, unreachable code elimination, etc., It is an optional phase.
‘The optimized code for the intermediate code is
temp 1 := id, * 50.0
id = id, + temp 1
1
1.3.6 Code Generation
The final phase of the compiler is the target code generation. The
ble machine code or assembly code.
target code may be either relocatal
to be selected by the compiler
The memory location for each variable has
in this phase.
Acomplex task here is to assign variables to registers.The target code generated for this, is as follows
MOVE ide R, Rid
MULE #50.0,R, Ry = 50.0%R,
MOVE Wd, R, Rid,
ADDF RAR, R= R, +R,
MOVE = Ryid, ER,
All these phases work in association with symbol-table management
and error handling.
1.3.7 Symbol - Table Management
A symbol table is a data structure maintained by the compiler to
record the identifier used in the source program and to collect information
about various attributes of cach identifier,
The attributes provide information about
id the storage allocated for an identifier,
ii) itstype
iti) its scope (where in the program it is valid)
iv) the number and types of arguments used in procedures.
Using this symbol table, the identifier can be fetched quickly or stored
easily.
j
eps ‘ |
The code generator enters and uses detailed information about the |
storage assigned to identifiers,
1.3.8 Error Handling 1
Each phases of the compiler can encounter errors. Each error is 1
handled by error handling mechanism and to handle the errors at the |
respective phase,
The lexical analysis phase detect errors for the token stream which |
violates the structure rules or syntax rules of the language. |
The semantic analysis phase detect constructs that have the right /
syntactic structure but no meaning to the operation involved.(eg. Adding
an array element with procedure),Source program
‘ Analysts Phase
Lexical Analyzer
Syntax Analyzer
Semantic
Analyzer
Symbol Table
Management
Error Detection
nd Handling
ase
Intermediate Code
Generator
‘Target Program4,6 GROUPING OF PHASES
1epending on the relationship between phases, the phases arg
grouped together as front end and back end.
1.6.4 Front and Back Ends
‘The front end consists of those phases or parts of phases which
depend primarily on the source language and are largely independent of
the target machine,
This includes phases like
i)Lexical Analysis
ii)Syntactic Analysis Front
Error jii)Symbol table creation End
Handler jy)Semantic analysis
v)Generation of intermediate code
Some part of the code optimization can be performed at the front
end as well,
‘The back end includes these phases of compiler which depend
mainly on the target machine but not on the source language.
i) Part of code optimization phase
ii) | Code generation Back End
1.6.2 Passes
All the phases of compiler are implemented using either a single
pass or (wo — pass compile i
and writing an output file.
It is common for several phases to be grouped into one pass and
for the activity of these phases to be interleaved during the pass.
Compiling involves performing lot of work and early computers did
not have enough memory to contain one program in which performs all
the work.te
Front End and Back End
The different phases of the compiler are grouped into two phases. The first three
5 are grouped into back end. In
phases are grouped into front end. The last two pha:
hetween these two groups the intermediate code generator will be placed.
Front End Back End
—4 = Intermediate Code
nner (>| Parser ‘Semantic / Code Generator 1| Generator
Fig. 1.12 Grouping of Phases
Front End:
+ The front end dependent on the source language but independent on target
language.
+ Phases in analysis (machine independent) are grouped as frontend.
+ These normally include lexical and syntactic analysis, the creation of the symbol
table, semantic analysis and the generation of intermediate code. It also includes
error handling that goes along with each of these phases.
Back End:
+ The back end dependent on the target language but independent on the source
language.
+ Phases in synthesis (machine dependent) are grouped as backend.
+ [t includes code optimization phase and code generation along with the necessary
error handling and symbol table operations
Advantages of Grouping of Phases
+ Keeping same front end and attaching different back ends, we can produce a
compiler for same Source Language.
+ By keeping different front ends and same back end, we can compile several
different languages in the same machine_— le and writi
One complete sean of source language i.e. reading one input file users te to
an output file. 4 collection of phases is done only once (single pass) oF Plc times ;
(multi pass)
+ Single pass; usually requires everything to be defined before being used in
snitve program,
+ Multi pass: compiler may have to keep entire program representation in memory,
Several phases can be grouped into single pass and the activities of these Phases
syntax analysis. semantic —
are interleaved during the pays, For example, lexical analyst
analysis and intermed
¢ code generation might be grouped into one pass
and pass is Compilation phases of source language —
a IN pass, the phases are grouped logically.
in six phases, wher
1.9 COMPILER CONSTRUCTION ‘TOOLS
The compiler writers use sofiware tools such as, debuggers, version managers,
profilers, and so on, The Jollowing is a list of some useful compiler-construction tools:
1, Scanner generators
2 Parser generators
3 Syntax-direeted translation engines
4. Automatic code generators
5. Data-flow engines
Scanner Generator
* These generate lexical analyzers, normally from 8 specificat
2 Ication based o
expressions, mn regular
* The
+ LEX j
‘ae organization of lexical analyzers is based o i
yi wee on finite automat;
automation,
anner generator used in compilers,
Parser Generator
* These produce syntax analyzers, normally from input that is b
free grammar. *OsSed On a context.+ It consumes a large fraction of the running time of a compiler.
+ Example-YACC (Yet Another Compiler-Compiler).
syntax-directed Translation
+ These produce routines that walk the parse tree and as a result generate
intermediate code.
+ Each translation is defined in terms of translations at its neighbor nodes in the
tree.
+ This construction tool converts the parse tree into intermediate code as output
Automatic code Generator
+ It takes a collection of rules to translate intermediate language into machine
language. The rules must include sufficient details to handle different possible
access methods for data. It converts the intermediate code into Assembly language
Data Flow Engine
g data-flow analysis, that is, the gathering «
transmitted from one part of a program to ca
bly code into optimized assembly code.
+ [t does code optimization usin|
information about how values ar
other part. It converts the assem!4.4 Errors Encountered in Different Phases
A good compiler should detect the errors in prior and should have —
good diagnostic facilities. Acomplex compiler should repair the error ang
transforms the errorneous input into anormal input by which the compiler
can resume its operations.
Errors encountered in all the phases of compiler. For example, the
errors that are encountered in each of the phases are as follows.
i) Lexical Analyzer - misspelling of tokens
ii) | Syntax Analyzer - missing parenthesis
Intermediate code generator - incompatible types
iti)
iv) Code Optimizer - unreachable statements
v) Code generator - compiler created constants are incompatible with
target memory.
A good error diagnostic routine will reduce debugging and
maintenance activities. It should possess the following properties,
i) The error messages should specify the error in the source program
which should be understandable by the users.
ii) The error messages should be specific and clear. Eg. Missing right
parenthesis in line 4.
iti) The error message should locate the problem line with the
corresponding message.
eg. rate not declared,
\v) The error messages should not be reduntant,
1.4.1 Sources of Error
There i i
simple se i Way for classifyimg the Programming errors. A
ssifying errors according to how they are introduced. A
Compiling program
a may enco i .
compilation, ‘unter different errors in different stages ofitmay be
a
a
a
i)
¥)
vi)
Due to the inconsistency in the design specifications.
Due to inadequate or incorrect design called ‘algorithmic errors’.
Due to logical errors while implementing the programs using
programming languages,
Due to keypunching or transcription errors
Due to exceeding the compiler or machine limit. (eg. An array may
be large enough to be allocated at run time).
Due to compiler during translation.
Some examples are listed below
i)
i)
iii)
iv)
Insertion of an extraneous character or token.
Deletion of a required character or token.
Replacement of a correct character or token by an incorrect
character or token.
Transcription of two adjacent characters or tokens.
* Normally the compiler designer classifies the errors as either
syntactic or semantic. A syntactic error can be an error detectable by the
lexical or syntactic phase of the compiler. Other kinds of errors are called
as semantic errors.
1.4.2 Syntactic Errors
These errors are identified in the syntax analysis phase.
i) Missing right parenthesis
x=(xty) #(ytz
ii) | Extraneous comma (in Fortran Program)
Do 20, 1 = 1,200
ii) Colon in place of semicolonOa —h
iv) Misspelled keyword
wihle (x> y)
y) Extra Blank
/* Area of circle */
From these examples, it is easy to understand the type of error and
the position of the error.
1,.4,3 Semantic Errors
‘These types of errors can be detected both at the compile time as
well at the run time.
Most of the semantic errors which are detected at compile time are
errors of declaration and scope.
eg. i) Undeclared identifier.
ii) Type incompatibility between operators and operands.
iii) Type incompatibility between formal and actual arguments.
Depending upon the language used, proper type checking needs to
be performed. Some languages supports automatic type conversions.
1.4.4 Errors in each phases of Compiler
All the phases ofa compiler should follow certain specifications for
proper execution of the source code. If it does not complaint with the
specification, it encounters an error. A good compiler should recover from
cach error and continue processin, g the input. The recovery routines detect
the error and print the error diagnostic messages. The diagnostic routines
communicate with the symbol table to avoid redundant messages.
Ifa particular phase transmits errors without repair, then all the }
Subsequent phases deal with the
errorenous inputs passed. The errors —
can be classifi
ied i i
broadly as lexical error, syntactic error or semantic error.11.2 APPROACHES TO COMPILER DEVELOPMENT
There are several get | approw that a compiler writer can adopt to
implement o compiler, The simplest ix to retarget or rehost an existing com-
piler, I there no suitable existing compiler, the compiler writer might
adopt the organization of 4 kaown compiler for a similar language and imple-
ment the corresponding components, using component-generation tools or
implementing them by hand, Ul is relatively care that a completely new com-
piler organization is required
No matter what approach is adopted, compiler writing is an exercise in
software engincering, Lessons from other software efforts (see for example
Brooks [1975]) can be applied to enhance the reliability and maintainability of
the final product, A design that readily accommodates change will allow the
compiler to evolve with the language. The use of compiler-building tools can
be a significant help in this regard.
Bootstrapping
A compiler is a complex enough program that we would like to write it ina
friendlier language than assembly language. In the UNIX programming
environment, compilers are usually written in C. Even C compilers are writ-
ten in C. Using the facilities offered by a language to compile itself is the
essence of bootstrapping. Mere we shall look at the use of bootstrapping to
create compilers and to move them from one machine to another by modifying
the back end, The basic ideas of bootstrapping have been known since the
mid 1950's (Strong et al. [1958]).
Bootstrapping may raise the question, “How was the first compiler com-
piled?" which sounds like, “What came first, the chicken or the egg?” but is
casier to answer, For an answer we consider how Lisp became a program-
ming language. McCarthy [1981] notes that in late 1958 Lisp was used as a
Notation for writing functions; they were then hand-translated into assembly
language and run, The implementation of an interpreter for Lisp occurred
unexpectedly. McCarthy wanted to show that Lisp was a notation for describ-
ing functions “much neater than Turing machines or the general recursive
definitions used in recursive function theory," so he wrote a function
eval le, a] in Lisp that took a Lisp, expression ¢ as an argument. S, R. Russell
noticed that eval could serve as an interpreter for Lisp, hand-coded it, and
thus created a programming language with an interpreter. As mentioned in
Section \.1, rather than generating target code, an interpreter actually per-
forms the operations of the source program.
For bootstrapping purposes. a compiler is characterized by three languages:
the source language S that it compiles, the target language T that it generates
code for, and the implementation language I that it is written in. We
noo the three languages using the following diagram, called a T-diagram,
‘ause of its shape (Bratman [1961]).agram as $1'T. The three languages
‘or example, a compiler may run on
another machine, Such a compiler is
Within text, we abbreviate the above T
S, 1, and T may all be quite at
one machine and produce target code for
often called a cross-compiler
Suppose we write a cross-compiler for a new language L in implementation
language S to generate code for machine N; that is, we create LgN. If an
existing compiler for $ runs on machine M and generates code for M, it is
characterized by SMM. If LN is run through SMM, we get a compiler
LMN, that is, a compiler from L to N that runs on M. This process is illus-
trated in Fig. 11.1 by putting together the T-diagrams for these compilers.
Fig. 11.1. Compiling a compiler.
When T-diagrams are put together as in Fig. 11.1, note that the implemen-
tation language S of the compiler LgN must be the same as the source
language of the existing compiler Sj M and that the target language M of the
existing compiler must be that same as the implementation language of the
translated form LN. A trio of T-diagrams such as Fig. 11.1 can be thought
of as an cquation
LsN + SMM=LMN
Example 11.1. The first version of the EQN compiler (see Section 12.1) had
C as the implementation language and generated commands for the text for-
matter TROFF. As shown in the following diagram, a cross-compiler for
EQN, running on a PDP-11, was obtained by running EQN CTROFF through
the C compiler C 1 11 on the PDP-11.
[Ban2.4 SPECIFICATION OF TOKENS
We can specify the tokens in the following ways.
1. String
2. Language
3. Expression
Strings and Languages
+ An alphabet or character class is a finite set of symbols.
+ A string over an alphabet is a finite sequence of symbols drawn
alphabet.
+ The empty string denoted by e - length of empty string is zero.
of strings over some fixed alphabet. !
In language theory, the terms “sentence” and “word” are often used as synonyms |
he number of occurrences
for “string.” The length of a string s, usually written |s|, is tl E
of symbols in s. For example, compiler is a string of length eight. The emply string:
denoted €, is the string of length zero.
from that
+ A language is any countable seterations on Strings
The following are terms which are related to Strings.
A prefix of string s is any string obtained by removing zero or more
symbols from the end of string s
For example, com is a prefix of compiler.
A suffix of string s is any string obtained by removing zero or more
symbols from the beginning of s.
For example, piler is a suflix of compiler.
A substring of s is obtained by deleting any prefix and any suffix from s.
For example, pil is a substring of compiler.
The proper prefixes, suffixes, and substrings of a string s are those
prefixes, suffixes, and substrings, respectively of s that are not € or not
equal to s itself.
A subsequence of s is any string formed by deleting zero or more not
necessarily consecutive positions of s.
erations on Languages
The following are the operations that can be applied to languages:
1. Union
2. Concatenation
3. Kleene closure
4. Positive closure
The following example shows the operations on strings:
Let L = {0,1} and S = {a,b,c}
Ll. Union; L US = {0, 1, a, b, c}
2. Concatenation: L.S = {0a,1a,0b,1b, Oc, ley
3. Kleene closure: L’ = { €, 0, |, 00...)
4.
Positive closure: L* = {0, 1, 00...)Regular Expressions
+ Each regular expression r denotes a language L(r).
+ Here are the rules that define the regular expressions over some alphabet x
the languages that those expressions denote:
1. e€ is a regular expression, and L(e) is { € }, that is, the language Whose gay.
member is the empty string.
2. If ‘a’ is a symbol in ¥, then ‘a’ is a regular expression, and L(a) = {a}, that
the language with one string, of length one, with ‘a’ in its one position,
3. Suppose r and s are regular expressions denoting the languages L(®) and Uy)
Then,
a. (n)|(s) is a regular expression denoting the language L(r) U L(s),
b. (x(s) is a regular expression denoting the language L(r)L(s).
c. (r)* is a regular expression denoting (L(r))*.
d. (r) is a regular expression denoting L(r).
4. The unary operator * has highest precedence and is left associative.
5. Concatenation has second highest precedence and is left associative.
6. | has lowest precedence and is left associative
AXIOM DESCRIPTION
tls = sr
ri(s|) = (s)It
(rs)t = r(st)
r(s|t) = rs[rt (s|t)r = sr|tr
=
| is commutative
| is associative
Concatenation is associative
Concatenation distributes over |
er=re=r € is the identity element for concatenation
m= (1e)* Relation between * and €
L re* = p* * Is idempotent
Table 2.2 Algebraic Properties of regular expressionsRegular Set
A language that can be defi
wo regular expressions be defined by a regular expression is called a regular set, Uf
et Pressions r and s denote the same regular set, we say they ure equivalent
and write r = s. 2
There are a nu aleebr:
ie timber of algebraic laws for regular expressions that can be used to
manipulate into equivalent forms. For instance, r | s = s | r is commutative; t | (8 \o
=(r|s)|t is associative.
Regular Definitions
Giving names to regular expressions is referred to as a Regular definition. If Ois
an alphabet of basic symbols, then a regular definition is a sequence of definitions of
the form
dry
dy > ry
1. Each d; is a distinct name.
2. Each r; is a regular expression over the alphabet EU {d), dy...» diy}.
Example: Identifiers is the set of strings of letters and digits beginning with a letter.
Regular definition for this set:
letter > A | B|
digit > O|1]...19
id — letter ( letter | digit ) *
[Zalbl.. lz]
Notational Shorthands
Certain constructs occur so frequently in regular expressions that it is convenient
to introduce notational shorthand’s for them.
1. One or more instances (+)
+ The unary postfix operator + means “one or more instances of”.+ Ifr isa regular expression that denotes the language L(r), then (rig a
expression that denotes the language (L (r ))*
+ Thus the regular expression a* denotes the set of all strings of one or
+ The operator + has the same precedence and associativity as the 9 ‘
Zero or one instance ( ? )
nN
+ The unary postfix operator ? means “zero or one instance of”,
+ The notation r? is a shorthand for r | €. q
+ If ‘r’ is a regular expression, then (r )? is a regular expression that denn
language L( rr) U { € }. 3
3. Character Classes
* The notation [abc] where a, b and c are alphabet symbols denotes the red
expression a | b | c.
+ Character class such as [a — z] denotes the regular expression albjcld)....J2
+ We can describe identifiers as being strings generated by the regular express
[A-Za-z][A-Za-z0-9]*
Non-regular Set
A language which cannot be described by any regular expression is a non-regu
set. Example: The set of all strings of balanced parentheses and repeating strings cant
be described by a regular expression. This set can be specified by a context-fi
grammar, ;
2.5 RECOGNITION OF TOKENS
For a programming language there are various types of tokens such as i
keywords, constants and operators and so on. The token is usually represented
pair token type and token value.
Fig. 2.6 Token representation+ Token type : It is pointer to symbol variable and constants
Token value : Information about the token
Consider the following grammar fragment:
stmt — if expr then stmt
| if expr then stmt else stmt
je
expr — term relop term
| term
tem — id
\num
where the terminals if , then, else, relop, id and num generate sets of strings given
by the following regular definitions:
if > f
then > then
ese 7 else
relop — <|<=|=|<>17 17"
id > letter(letter | digit)*
mm > digit™ (digit*) ? (E(+-)2digit*)?
e fi t the le: Zer Wi 21 the keywords if,
rt he lexical analyze! will recognize
For this language fragmen! 4
en, else, as well as the lexemes denoted by relop, id, and num. To simp! ify matters,
2 ,
e assume ke: ords are re: ‘erved; that is, they cannot be used as identifiers
rest 5 S, f
keyw2.3.3 Expressing tokens by Regular Expressions
A regular expression is built out of ‘simpler regular expressions using
aset of definite rules. Each regular expression r denotes a language L(r).
i)
itl)
Rules that define the regular expressions
« isa regular expression denotes the language { ¢}
Ifa isasymbol in ¥, thenais the regular expression denotes the
language {a}.
Suppose r and s are regular expression denoting the languages L(t)
and L(s). Then
a) (r)| (s) is aregular expression denoting L(r) U L(s)
b) (x). (s) isa regular expression senoting L(r). L(s)
c) (x)* isaregular expression denoting (L(r))’
d) — (r) isa regular expression denoting L (r)
The precedence and associativity of operators are as follows.
i) Unary operator * has the highest precedence and is left
associative.
ii) | Concatenation operator has the second highest precedence
and is left associative.
i) — 'has the lowest precedence and is left associative.2.1, ROLE OF LEXICAL ANALYZER
Lexical Analyzer is also called as scanner. It is the first phase ofa
compiler. The main task of the lexical analyzer is to read the input characters
and produce as output a sequence of tokens to the parser used for syntax
analysis.
Synrat
tree
Fig.2.1 Lexical analyzer
Lexical analyzer reads the source program and removes comments,
white spaces in the form of blank, tab and new line characters. The lexical
analyzer keeps track of the number of new line characters in order to
associate the message with the line number.
2.1.1 Issues in Lexical Analysis
There are certain reasons for separating the analysis phase of
compiling into lexical analysis and parsing.
i) Tomake the design simpler. The separation of lexical analysis from
syntax analysis allows the other phases to be simpler. For eg. Parsing
adocument with comments and white spaces is more complex than
it is removed in the previous phase itself.ii) Toimprove the efficiency of the compiler. A separate lexical analyze,
allows us to construct an efficient processor. A large amount of time
is spent in reading the source program and partitioning it into tokens,
Specialized buffering techniques speed up the performance. 4
ii) | To enhance the compiler portability. Input alphabets and device
specific anomalies can be restricted to the lexical analyzer.
2.1.2 Tokens, Patterns, Lexemes
A token is an atomic unit which represents a logically cohesive
sequence of characters such as an identifier, a keyword, an operator,
constants, literal strings, punctuation symbols such as parentheses, commas
and semicolons.
eg, rate -- identifier
+ -- operator
if - keyword
A pattern is a set of strings in the input for which the same tokens is
produced as output. It’s a rule to represent lexeme.
A lexeme is a sequence of characters in the source program which
is matched by the pattern for a token.
eg. float rate; Pattern for the token float is the
Tokens - float sequence of characters start with f
~ rate | followed by ‘I’, ‘o’,’a’, *t?
Lexeme for
thetoken - rate
identifier
Ina grammar, tokens are considered as terminal symbols.—
2.4.3. Attributes for Tokens
When more than one pattern matches a lexeme, the lexical analyzer
must provide additional information about the particular lexeme that
matched to the subsequent phases of the compiler,
For example, the pattern relation matches with the operators like <,
>> <-. It is necessary to identify operator which is matched
with the pattern.
The lexical analyzer collects other information about tokens as its
attributes. Atoken has only a single attribute, a pointer to the symbol-
table entry in which the information about the token is kept.
For eg. The tokens and attribute — values for the given statement is
x=yx10
For certain attribute pairs, there is no need for an attribute value.
eg.
For others, the compiler store:
value ina symbol table.
s the character string that forms a
2.1.4 Lexical Errors
A lexical analyzer has a very localized view of a source programs.
The possible error — recovery actions are
i) deleting an extraneous character
ii) inserting a missing character
iii) replacing an incorrect character by correct character
iv) transposing two adjacent charactersFor eg..
fi@a==D..-
Here ‘fi’ isa valid identifier. But the open parentheses followeg),
the identifier may tell ‘fi’ isa misspelling of the keyword ‘iP oran mndeal
function identifier.LEX
Lex is a computer program that generates lexical analyzers. Lex is commonly used
with the yacc parser generator.
Creating a lexical analyzer
+ First, a specification of a lexical analyzer is prepared by creating a program lex.1
in the Lex language. Then, lex.| is run through the Lex compiler to produce a C
program lex.yy.c.
+ Finally, lex.yy.c is run through the C compiler to produce an object program
a.out, which is the lexical analyzer that transforms an input stream into a sequence
of tokens.
lex ——» Lex | _ tex yy
compiler
lex.yy.c_ ———»} C compiler }—» a.out
Sequence
- input
stream aout of tokens
Fig. 2.9 Creating a lexical analyzer with Lex
x Specification
A Lex program consists of three parts:
{ definitions }
WN
{ rules }
Ww
{ user subroutines }
* Definitions include declarations of variables, constants, and regular definitions
* Rules are statements of the formPy {aetion,|
fi
v, {ction}
P,, faction, |
ion and action, describes what action the lexical
= i
Wheres p, is ropular expr
i ular express : the |
ions are written in ¢
‘aulyz04 showld take when pattem p; matches a lexeme. Acti
woe
e actions. The:
User subroutines arc auxiliary procedures needed by the actio
compiled separately and loaded with the lexical analyzer.
i ither non-deterministic or
She plementation of a tex compiler can be based on either
In be
deterininistic automata,