0% found this document useful (0 votes)
22 views28 pages

Screenshot 2025-02-01 at 5.32.27 PM

The document outlines the syllabus and study materials for Compiler Design at APJ Abdul Kalam Technological University, detailing the phases of a compiler including lexical analysis, syntax analysis, semantic analysis, intermediate code generation, code optimization, and code generation. It explains the roles of each phase in translating source code into target code, emphasizing the importance of error detection and the use of a symbol table. Additionally, it provides examples of tokenization and the structure of syntax trees to illustrate the compilation process.

Uploaded by

rifanasherin547
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views28 pages

Screenshot 2025-02-01 at 5.32.27 PM

The document outlines the syllabus and study materials for Compiler Design at APJ Abdul Kalam Technological University, detailing the phases of a compiler including lexical analysis, syntax analysis, semantic analysis, intermediate code generation, code optimization, and code generation. It explains the roles of each phase in translating source code into target code, emphasizing the importance of error detection and the use of a symbol table. Additionally, it provides examples of tokenization and the structure of syntax trees to illustrate the compilation process.

Uploaded by

rifanasherin547
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

APJ ABDUL KALAM TECHNOLOGICAL UNIVERSITY

SYLLABUS | STUDY MATERIALS | TEXTBOOK


PDF | SOLVED QUESTION PAPERS

www.keralanotes.com
KTU STUDY MATERIALS

COMPILER DESIGN

Module 1
Related Link :

KTU S6 CSE NOTES KTU CSE TEXTBOOKS S6


2019 SCHEME B.TECH PDF DOWNLOAD

KTU S6 SYLLABUS CSE KTU S6 CSE NOTES |


COMPUTER SCIENCE SYLLABUS | QBANK |
TEXTBOOKS DOWNLOAD
KTU PREVIOUS QUESTION
BANK S6 CSE SOLVED

www.keralanotes.com
https://www.keralanotes.com/
CS304 Compiler Design /B.Tech/S6

MODULE 1
om
.c
Introduction to compilers – Analysis of the source
es

program, Phases of a compiler, Grouping of phases,


ot

compiler writing tools – bootstrapping


Lexical Analysis:
n

The role of Lexical Analyzer, Input Buffering,


la

Specification of Tokens using Regular Expressions,


ra

Review of Finite Automata, Recognition of Tokens.


ke

Ms. Anakha Satheesh P , Assistant Professor/CSE 1

For More Study Materials : www.keralanotes.com


https://www.keralanotes.com/
CS304 Compiler Design /B.Tech/S6

1.1 INTRODUCTION TO COMPILERS


A compiler is a program that can read a program in one language (the source
language) and translate it into an equivalent program in another language (the target
language).

An important role of the compiler is to report any errors in the source program that it
detects during the translation process.

Source Compiler Target


Program Program

Error Meeages

om
.c
Fig: Compiler
es

Compilers are sometimes classified as single pass, multi-pass, load-and-go,


ot

debugging, or optimizing, depending on how they have been constructed or on what


function they are supposed to perform.
n
la

1.1.1 ANALYSIS OF THE SOURCE PROGRAM


ra

In compiling, analysis consists of three phases:


ke

 Lexical Analysis
 Syntax Analysis
 Semantic Analysis

Lexical Analysis
In a compiler linear analysis is called lexical analysis or scanning. The lexical analysis
phase reads the characters in the source program and grouped into tokens that are
sequence of characters having a collective meaning.

EXAMPLE

position : = initial + rate * 60

Ms. Anakha Satheesh P , Assistant Professor/CSE 2

For More Study Materials : www.keralanotes.com


https://www.keralanotes.com/
CS304 Compiler Design /B.Tech/S6

This can be grouped into the following tokens;

1. The identifier position.

2. The assignment symbol : =

3. The identifier initial

4. The plus sign

5. The identifier rate

6. The multiplication sign

7. The number 60

Blanks separating characters of these tokens are normally eliminated during lexical
analysis.

om
Syntax Analysis
Hierarchical Analysis is called parsing or syntax analysis.
.c
It involves grouping the tokens of the source program into grammatical phrases that
es
are used by the complier to synthesize output. They are represented using a syntax
tree.
ot

A syntax tree is the tree generated as a result of syntax analysis in which the interior
nodes are the operators and the exterior nodes are the operands. This analysis shows
n

an error when the syntax is incorrect.


la
ra
ke

Ms. Anakha Satheesh P , Assistant Professor/CSE 3

For More Study Materials : www.keralanotes.com


https://www.keralanotes.com/
CS304 Compiler Design /B.Tech/S6

Semantic Analysis
This phase checks the source program for semantic errors and gathers type
information for subsequent code generation phase.

An important component of semantic analysis is type checking.

Here the compiler checks that each operator has operands that are permitted by the
source language specification.

1.1.2 PHASES OF A COMPILER


The phases include:

 Lexical Analysis

om
 Syntax Analysis
 Semantic Analysis
.c
 Intermediate Code Generation
es
 Code Optimization
 Target Code Generation
n ot
la
ra
ke

Ms. Anakha Satheesh P , Assistant Professor/CSE 4

For More Study Materials : www.keralanotes.com


https://www.keralanotes.com/
CS304 Compiler Design /B.Tech/S6

Lexical Analysis
The first phase of a compiler is called lexical analysis or scanning.

The lexical analyzer reads the stream of characters making up the source program and
groups the characters into meaningful sequences called lexemes.

For each lexeme, the lexical analyzer produces as output a token of the form

< token- name, attribute-value >


that it passes on to the subsequent phase, syntax analysis.

In the token, the first component token- name is an abstract symbol that is used during
syntax analysis, and the second component attribute-value points to an entry in the
symbol table for this token.

Information from the symbol-table entry 'is needed for semantic analysis and code

om
generation.

For example, suppose a source program contains the assignment statement


.c
position = initial + rate * 60
es
The characters in this assignment could be grouped into the following lexemes and
mapped into the following tokens passed on to the syntax analyzer:
ot

1. position is a lexeme that would be mapped into a token <id, 1>, where id is an
abstract symbol standing for identifier and 1 points to the symbol table entry for
n

position. The symbol- table entry for an identifier holds information about the
identifier, such as its name and type.
la

2. The assignment symbol = is a lexeme that is mapped into the token < = >. Since
ra

this token needs no attribute-value, we have omitted the second component.

3. initial is a lexeme that is mapped into the token < id, 2> , where 2 points to the
ke

symbol-table entry for initial .

4. + is a lexeme that is mapped into the token <+>.


5. rate is a lexeme that is mapped into the token < id, 3 >, where 3 points to the
symbol-table entry for rate.

6. * is a lexeme that is mapped into the token <* > .


7. 60 is a lexeme that is mapped into the token <60>
Blanks separating the lexemes would be discarded by the lexical analyzer. The
representation of the assignment statement position = initial + rate * 60 after
lexical analysis as the sequence of tokens as:

< id, l > < = > <id, 2> <+> <id, 3> < * > <60>

Ms. Anakha Satheesh P , Assistant Professor/CSE 5

For More Study Materials : www.keralanotes.com


https://www.keralanotes.com/
CS304 Compiler Design /B.Tech/S6

Token : Token is a sequence of characters that can be treated as a single logical entity. Typical
tokens are,

 Identifiers
 keywords
 operators
 special symbols
 constants

Pattern : A set of strings in the input for which the same token is produced as output. This
set of strings is described by a rule called a pattern associated with the token.

Lexeme : A lexeme is a sequence of characters in the source program that is matched by the
pattern for a token.

om
.c
es
n ot
la

Syntax Analysis
ra

The second phase of the compiler is syntax analysis or parsing.

The parser uses the first components of the tokens produced by the lexical analyzer to
ke

create a tree-like intermediate representation that depicts the grammatical structure of


the token stream.

A typical representation is a syntax tree in which each interior node represents an


operation and the children of the node represent the arguments of the operation.

The syntax tree for above token stream is:

Ms. Anakha Satheesh P , Assistant Professor/CSE 6

For More Study Materials : www.keralanotes.com


https://www.keralanotes.com/
CS304 Compiler Design /B.Tech/S6

The tree has an interior node labeled with ( id, 3 ) as its left child and the integer 60 as
its right child.

The node (id, 3) represents the identifier rate.

The node labeled * makes it explicit that we must first multiply the value of rate by 60.

The node labeled + indicates that we must add the result of this multiplication to the
value of initial.

The root of the tree, labeled =, indicates that we must store the result of this addition
into the location for the identifier position.

Semantic Analysis
The semantic analyzer uses the syntax tree and the information in the symbol table to
check the source program for semantic consistency with the language definition.

om
It also gathers type information and saves it in either the syntax tree or the symbol
table, for subsequent use during intermediate-code generation.

An important part of semantic analysis is type checking, where the compiler checks
.c
that each operator has matching operands.
es
For example, many programming language definitions require an array index to be an
integer; the compiler must report an error if a floating-point number is used to index
ot

an array.

Some sort of type conversion is also done by the semantic analyzer.


n

For example, if the operator is applied to a floating point number and an integer, the
la

compiler may convert the integer into a floating point number.


ra

In our example, suppose that position, initial, and rate have been declared to be
floating- point numbers, and that the lexeme 60 by itself forms an integer.
ke

The semantic analyzer discovers that the operator * is applied to a floating-point


number rate and an integer 60.

In this case, the integer may be converted into a floating-point number.

In the following figure, notice that the output of the semantic analyzer has an extra
node for the operator inttofloat , which explicitly converts its integer argument into a
floating-point number.

Ms. Anakha Satheesh P , Assistant Professor/CSE 7

For More Study Materials : www.keralanotes.com


https://www.keralanotes.com/
CS304 Compiler Design /B.Tech/S6

Intermediate Code Generation


In the process of translating a source program into target code, a compiler may
construct one or more intermediate representations, which can have a variety of forms.

Syntax trees are a form of intermediate representation; they are commonly used
during syntax and semantic analysis.

After syntax and semantic analysis of the source program, many compilers generate
an explicit low-level or machine-like intermediate representation, which we can think
of as a program for an abstract machine.

This intermediate representation should have two important properties:

 It should be simple and easy to produce

 It should be easy to translate into the target machine.

om
In our example, the intermediate representation used is three-address code, which
consists of a sequence of assembly-like instructions with three operands per
instruction.
.c
t1 = inttofloat(60)
es
t2 = id3 * t1
t3 = id2 + t2
ot

id1 = t3
n

Code Optimization
la

The machine-independent code-optimization phase attempts to improve the


intermediate code so that better target code will result.
ra

The objectives for performing optimization are: faster execution, shorter code, or target
ke

code that consumes less power.

In our example, the optimized code is:

t1 = id3 * 60.0
id1 = id2 + t1

Code Generator
The code generator takes as input an intermediate representation of the source
program and maps it into the target language.

If the target language is machine code, registers or memory locations are selected for
each of the variables used by the program.

Then, the intermediate instructions are translated into sequences of machine


instructions that perform the same task.

Ms. Anakha Satheesh P , Assistant Professor/CSE 8

For More Study Materials : www.keralanotes.com


https://www.keralanotes.com/
CS304 Compiler Design /B.Tech/S6

A crucial aspect of code generation is the judicious assignment of registers to hold


variables.

If the target language is assembly language, this phase generates the assembly code as
its output.

In our example, the code generated is:

LDF R2, id3


MULF R2, #60.0
LDF R1, id2
ADDF R1, R2
STF id1, R1
The first operand of each instruction specifies a destination.

The F in each instruction tells us that it deals with floating-point numbers.

The above code loads the contents of address id3 into register R2, then multiplies it

om
with floating-point constant 60.0.

The # signifies that 60.0 is to be treated as an immediate constant.


.c
The third instruction moves id2 into register Rl and the fourth adds to it the value
previously computed in register R2.
es

Finally, the value in register Rl is stored into the address of idl , so the code correctly
implements the assignment statement position = initial + rate * 60.
ot

Symbol Table
n

An essential function of a compiler is to record the variable names used in the source
la

program and collect information about various attributes of each name.


ra

These attributes may provide information about the storage allocated for a name, its
type, its scope (where in the program its value may be used), and in the case of
ke

procedure names, such things as the number and types of its arguments, the method
of passing each argument (for example, by value or by reference), and the type
returned.

The symbol table is a data structure containing a record for each variable name, with
fields for the attributes of the name.

The data structure should be designed to allow the compiler to find the record for each
name quickly and to store or retrieve data from that record quickly.

Error Detection And Reporting


Each phase can encounter errors.

However, after detecting an error, a phase must somehow deal with that error, so that
compilation can proceed, allowing further errors in the source program to be detected.

Ms. Anakha Satheesh P , Assistant Professor/CSE 9

For More Study Materials : www.keralanotes.com


https://www.keralanotes.com/
CS304 Compiler Design /B.Tech/S6

A compiler that stops when it finds the first error is not a helpful one.

LEXICAL ANALYZER

< id, l > < = > <id, 2> <+> <id, 3> < * > <60>

SYNTAX ANALYZER

SEMANTIC ANALYZER

om
.c
es
ot

INTERMEDIATE CODE GENERATOR


n

t1 = inttofloat(60)
la

t2 = id3 * t1

t3 = id2 + t2
ra

id1 = t3
ke

CODE OPTIMIZER

t1 = id3 * 60.0

id1 = id2 + t1

CODE GENERATOR

LDF R2, id3

MULF R2, #60.0

LDF R1, id2

ADDF R1, R2

STF id1, R1

Figure : Translation of an assignment statement

Ms. Anakha Satheesh P , Assistant Professor/CSE 10

For More Study Materials : www.keralanotes.com


https://www.keralanotes.com/
CS304 Compiler Design /B.Tech/S6

1.1.3 GROUPING OF PHASES


The process of compilation is split up into following phases:

 Analysis Phase
 Synthesis phase
Analysis Phase
Analysis Phase performs 4 actions namely:

a. Lexical analysis
b. Syntax Analysis
c. Semantic analysis
d. Intermediate Code Generation

om
The analysis part breaks up the source program into constituent pieces and imposes a
grammatical structure on them.

It then uses this structure to create an intermediate representation of the source


.c
program.
es
If the analysis part detects that the source program is either syntactically ill formed or
semantically unsound, then it must provide informative messages, so the user can take
corrective action.
ot

The analysis part also collects information about the source program and stores it in a
n

data structure called a symbol table, which is passed along with the intermediate
la

representation to the synthesis part.

The analysis part breaks up the source program into constituent pieces and imposes a
ra

grammatical structure on them.


ke

It then uses this structure to create an intermediate representation of the source


program.

If the analysis part detects that the source program is either syntactically ill formed or
semantically unsound, then it must provide informative messages, so the user can take
corrective action.

The analysis part also collects information about the source program and stores it in a
data structure called a symbol table, which is passed along with the intermediate
representation to the synthesis part.

Synthesis Phase
Synthesis Phase performs 2 actions namely:

a. Code Optimization
b. Code Generation

Ms. Anakha Satheesh P , Assistant Professor/CSE 11

For More Study Materials : www.keralanotes.com


https://www.keralanotes.com/
CS304 Compiler Design /B.Tech/S6

The synthesis part constructs the desired target program from the intermediate
representation and the information in the symbol table.

The analysis part is often called the front end of the compiler; the synthesis part is the
back end.

1.1.4 COMPILER WRITING TOOLS


Compiler writers use software development tools and more specialized tools for
implementing various phases of a compiler. Some commonly used compiler
construction tools include the following.

 Parser Generators
 Scanner Generators
 Syntax-directed translation engine
 Automatic code generators

om
 Data-flow analysis Engines
 Compiler Construction toolkits
Parser Generators.
.c
Input : Grammatical description of a programming language
es
Output : Syntax analyzers.
ot

These produce syntax analyzers, normally from input that is based on a context-free
grammar.
n

In early compilers, syntax analysis consumed not only a large fraction of the running
la

time of a compiler, but a large fraction of the intellectual effort of writing a compiler.
ra

This phase is one of the easiest to implement.

Scanner Generators
ke

Input : Regular expression description of the tokens of a language

Output : Lexical analyzers.

These automatically generate lexical analyzers, normally from a specificalion based on


regular expressions.

The basic organization of the resulting lexical analyzer is in effect a finite automaton.

Syntax-directed Translation Engines


Input : Parse tree.

Output : Intermediate code.

These produce collections of routines that walk the parse tree, generating intermediate
code.

Ms. Anakha Satheesh P , Assistant Professor/CSE 12

For More Study Materials : www.keralanotes.com


https://www.keralanotes.com/
CS304 Compiler Design /B.Tech/S6

The basic idea is that one or more "translations" are associated with each node of the
parse tree, and each translation is defined in terms of translations at its neighbour
nodes in the tree.

Automatic Code Generators


Input : Intermediate language.

Output : Machine language.

Such a tool takes a collection of rules that define the translation of each operation of
the intermediate language into the machine language for the target machine.

The rules must include sufficient detail that we can handle the different possible access
methods for data.

Data-flow Analysis Engines

om
Data-flow analysis engine gathers the Information that is, the values transmitted from
one part of a program to each of the other parts.

Data-flow analysis is a key part of code optimization.


.c
1.1.4.1 BOOTSTRAPPING
es

Bootstrapping is widely used in the compilation development.


ot

Bootstrapping is used to produce a self-hosting compiler. Self-hosting compiler is a


type of compiler that can compile its own source code.
n
la

Bootstrap compiler is used to compile the compiler and then you can use this compiled
compiler to compile everything else as well as future versions of itself.
ra

A compiler is characterized by three languages:


ke

 Source Language
 Target Language
 Implementation Language
Notation : represents a compiler for Source S , Target T , implemented in I . The
T-diagram shown above is also used to depict the same compiler.

To create a new language, L, for machine A:

1. Create , a compiler for a subset, S, of the desired language L, using language


A, which runs on machine A. (Language A may be assembly language.)

Ms. Anakha Satheesh P , Assistant Professor/CSE 13

For More Study Materials : www.keralanotes.com


https://www.keralanotes.com/
CS304 Compiler Design /B.Tech/S6

2. Create , a compiler for language L written in a subset of L.

3. Compile using to obtain , a compiler for language L, which runs


on machine A and produces code for machine A.

om
.c
es
The process illustrated by the T-diagrams is called bootstrapping and can be
summarized by the equation:
n ot
la
ra
ke

Ms. Anakha Satheesh P , Assistant Professor/CSE 14

For More Study Materials : www.keralanotes.com


https://www.keralanotes.com/
CS304 Compiler Design /B.Tech/S6

1.2 LEXICAL ANALYSIS


1.2.1 ROLE OF LEXICAL ANALYSIS
As the first phase of a compiler, the main task of the lexical analyzer is to read the
input characters of the source program, group them into lexemes, and produce as
output a sequence of tokens for each lexeme in the source program.

The stream of tokens is sent to the parser for syntax analysis.

LEXICAL
Source Program ANALYSER Sequence of Tokens

Lexical Analyzer also interacts with the symbol table.

om
When the lexical analyzer discovers a lexeme constituting an identifier, it needs to
enter that lexeme into the symbol table. .c
In some cases, information regarding the kind of identifier may be read from the
symbol table by the lexical analyzer to assist it in determining the proper token it must
es
pass to the parser.

These interactions are given in following figure.


ot

Commonly, the interaction is implemented by having the parser call the lexical
n

analyzer.
la

The call, suggested by the getNextToken command, causes the lexical analyzer to read
ra

characters from its input until it can identify the next lexeme and produce for it the
next token, which it returns to the parser.
ke

token

Source Lexical Analyser Parser to semantic


Program analysis
getNextToken

Symbol Table

Interactions between lexical analyser and parser

Ms. Anakha Satheesh P , Assistant Professor/CSE 15

For More Study Materials : www.keralanotes.com


https://www.keralanotes.com/
CS304 Compiler Design /B.Tech/S6

Other tasks of Lexical Analyzer


1. Stripping out comments and whitespace (blank, newline, tab, and perhaps other
characters that are used to separate tokens in the input).

2. Correlatingerrormessagesgeneratedbythecompilerwiththesourceprogram.Forinstanc
e, the lexical analyzer may keep track of the number of newline characters seen, so it
can associate a line number with each error message.

3. If the source program uses a macro-pre-processor, the expansion of macros may also
be performed by the lexical analyzer.

Issues In Lexical Analysis


Following are the reasons why lexical analysis is separated from syntax analysis

Simplicity Of Design

om
The separation of lexical analysis and syntactic analysis often allows us to simplify at least
one of these tasks. The syntax analyzer can be smaller and cleaner by removing the
lowlevel details of lexical analysis
.c
Efficiency
es

Compiler efficiency is improved. A separate lexical analyzer allows us to apply specialized


techniques that serve only the lexical task, not the job of parsing. In addition, specialized
ot

buffering techniques for reading input characters can speed up the compiler significantly.
n

Portability
la

Compiler portability is enhanced. Input-device-specific peculiarities can be restricted to


the lexical analyzer.
ra

Attributes For Tokens


ke

Sometimes a token need to be associate with several pieces of information.

The most important example is the token id, where we need to associate with the token
a great deal of information.

Normally, information about an identifier - e.g., its lexeme, its type, and the location
at which it is first found (in case an error message about that identifier must be issued)
- is kept in the symbol table.

Thus, the appropriate attribute value for an identifier is a pointer to the symbol-table
entry for that identifier.

Lexical Errors
A character sequence that can’t be scanned into any valid token is a lexical error.

Ms. Anakha Satheesh P , Assistant Professor/CSE 16

For More Study Materials : www.keralanotes.com


https://www.keralanotes.com/
CS304 Compiler Design /B.Tech/S6

Suppose a situation arises in which the lexical analyzer is unable to proceed because
none of the patterns for tokens matches any prefix of the remaining input.

The simplest recovery strategy is "panic mode" recovery.

We delete successive characters from the remaining input, until the lexical analyzer
can find a well-formed token at the beginning of what input is left.

This recovery technique may confuse the parser, but in an interactive computing
environment it may be quite adequate.

Other possible error-recovery actions are:

1. Delete one character from the remaining input.

2. Insert a missing character into the remaining input.

3. Replace a character by another character.

om
4. Transpose two adjacent characters.
Transformations like these may be tried in an attempt to repair the input.
.c
The simplest such strategy is to see whether a prefix of the remaining input can be
transformed into a valid lexeme by a single transformation.
es

A more general correction strategy is to find the smallest number of transformations


needed to convert the source program into one that consists only of valid lexemes, but
ot

this approach is considered too expensive in practice to be worth the effort.


n

1.2.2 INPUT BUFFERING


la

To ensure that a right lexeme is found, one or more characters have to be looked up
ra

beyond the next lexeme.


ke

Hence a two-buffer scheme is introduced to handle large lookaheads safely.

Techniques for speeding up the process of lexical analyzer such as the use of sentinels
to mark the buffer end have been adopted.

There are three general approaches for the implementation of a lexical analyzer:

a. By using a lexical-analyzer generator, such as lex compiler to produce the


lexical analyzer from a regular expression based specification. In this, the
generator provides routines for reading and buffering the input.

b. By writing the lexical analyzer in a conventional systems-programming


language, using I/O facilities of that language to read the input.
c. By writing the lexical analyzer in assembly language and explicitly managing
the reading of input.

The three choices are listed in order of increasing difficulty for the implementer

Ms. Anakha Satheesh P , Assistant Professor/CSE 17

For More Study Materials : www.keralanotes.com


https://www.keralanotes.com/
CS304 Compiler Design /B.Tech/S6

Buffer Pairs
Because of large amount of time consumption in moving characters, specialized
buffering techniques have been developed to reduce the amount of overhead required
to process an input character.

Fig shows the buffer pairs which are used to hold the input data.

An input buffer in two halves

om
Scheme
Consists of two buffers, each consists of N-character size which are reloaded
alternatively.
.c
N-Number of characters on one disk block.
es
N characters are read from the input file to the buffer using one system read command.

eof is inserted at the end if the number of characters is less than N.


ot

Pointers
n

Two pointers lexemeBegin and forward are maintained.


la

lexeme Begin points to the beginning of the current lexeme which is yet to be found.
ra

forward scans ahead until a match for a pattern is found.


ke

Once a lexeme is found, lexemebegin is set to the character immediately after the
lexeme which is just found and forward is set to the character at its right end.

Current lexeme is the set of characters between two pointers.

if forward at end of first half then begin


reload second half;
forward := forward + 1
end
else if forward at end of second half then begin
reload second half;
move forward to beginning of first half
end
else forward := forward + 1;

Code to advance forward pointer

Ms. Anakha Satheesh P , Assistant Professor/CSE 18

For More Study Materials : www.keralanotes.com


https://www.keralanotes.com/
CS304 Compiler Design /B.Tech/S6

Disadvantages Of This Scheme


This scheme works well most of the time, but the amount of lookahead is limited.

This limited lookahead may make it impossible to recognize tokens in situations where
the distance that the forward pointer must travel is more than the length of the buffer.

(eg.) DECLARE (ARGl, ARG2, . . . , ARGn) in PL/1 program;

It cannot determine whether the DECLARE is a keyword or an array name until the
character that follows the right parenthesis.

Sentinels
In the previous scheme, each time when the forward pointer is moved, a check is done
to ensure that one half of the buffer has not moved off. If it is done, then the other half
must be reloaded.

om
Therefore the ends of the buffer halves require two tests for each advance of the
forward pointer

o Test 1: For end of buffer.


.c
o Test 2: To determine what character is read.
es
The usage of sentinel reduces the two tests to one by extending each buffer half to hold
a sentinel character at the end.
ot

The sentinel is a special character that cannot be part of the source program. (eof
character is used as sent
n
la
ra
ke

Sentinels at end of each buffer half

Lookahead code with sentinels

Advantages
Most of the time, It performs only one test to see whether forward pointer points to an
eof.

Only when it reaches the end of the buffer half or eof, it performs more tests.

Since N input characters are encountered between eofs, the average number of tests
per input character is very close to 1.

Ms. Anakha Satheesh P , Assistant Professor/CSE 19

For More Study Materials : www.keralanotes.com


https://www.keralanotes.com/
CS304 Compiler Design /B.Tech/S6

forward : = forward + 1;
if forward ↑ = eof then begin
if forward at end of first half then begin
reload second half;
forward := forward + 1
end
else if forward at end of second half then begin
reload first half;
move forward to beginning of first half
end
else /* eof within a buffer signifying end of input */
terminate lexical analysis
end

Lookahead code with sentinels

1.2.3 SPECIFICATION OF TOKENS

om
There are 3 specifications of tokens:

 Strings
.c
 Language
es
 Regular expression
Strings and Languages
ot

An alphabet or character class is a finite set of symbols


n

A string over an alphabet is a finite sequence of symbols drawn from that alphabet.
la

A language is any countable set of strings over some fixed alphabet.


ra

In language theory, the terms "sentence" and "word" are often used as synonyms for
"string." The length of a string s, usually written |s|, is the number of occurrences of
ke

symbols in s. For example, banana is a string of length six. The empty string, denoted
ε, is the string of length zero.

Operations On Strings
The following string-related terms are commonly used:

Terms for parts of a string

Ms. Anakha Satheesh P , Assistant Professor/CSE 20

For More Study Materials : www.keralanotes.com


https://www.keralanotes.com/
CS304 Compiler Design /B.Tech/S6

Operations On Languages:
The following are the operations that can be applied to languages:

 Union
 Concatenation
 Kleene closure
 Positive closure

om
.c
Regular Expressions
es

It allows defining the sets to form tokens precisely.


ot

Eg, letter ( letter|digit )*

Define a Pascal identifier- which says the identifier is formed by a letter followed by
n

zero or more letters or digits.


la

A regular expression is built up out of simpler regular expression using a set of


ra

defining rules.

Each regular expression r denotes a language L( r ).


ke

The Rules That Define Regular Expressions Over Alphabet 

(Associated with each rule is a specification of the language denoted by the regular
expression being defined)

1. ε is a regular expression that denotes {ε}, i.e. the set containing the empty string.
2. If a is a symbol in Σ, then a is a regular expression that denotes {a}, i.e. the set containing
the stringa.
3. Suppose r and s are regular expressions denoting the languages L(r) and L(s). Then

a. (r) | (s) is a regular expression denoting the languages L(r) U L(s).

b. (r)(s) is a regular expression denoting the languages L(r)L(s).

c. (r)* is a regular expression denoting the languages (L(r))*.

d. (r) is a regular expression denoting the languages L(r).


Ms. Anakha Satheesh P , Assistant Professor/CSE 21

For More Study Materials : www.keralanotes.com


https://www.keralanotes.com/
CS304 Compiler Design /B.Tech/S6

A language denoted by a regular expression is said to be a regular set.

The specification of a regular expression is an example of a recursive definition.

Rule (1) and (2) form the basis of the definition.

Rule (3) provides the inductive step.

om
.c
es

Algebraic properties of regular expressions


n ot

Regular Definition
la

If Σ is an alphabet of basic symbols, then a regular definition is a sequence of


ra

definitions of the form

d1 → r1
ke

d2 → r2

dn → rn

– Where each di is a distinct name, and each ri is a regular expression over the symbols
in Σ U {d1, d2, … , di-1}, i.e., the basic symbols and the previously defined names.

Example: Identifiers is the set of strings of letters and digits beginning with a letter.
Regular definition for this set:

letter → A|B|…|Z|a|b|…|z

digit → 0|1|…|9

id → letter ( letter | digit ) *

Ms. Anakha Satheesh P , Assistant Professor/CSE 22

For More Study Materials : www.keralanotes.com


https://www.keralanotes.com/
CS304 Compiler Design /B.Tech/S6

Notational Shorthand
Certain constructs occur so frequently in regular expressions that it is convenient to
introduce notational short hands for them.

1. One or more instances (+)

The unary postfix operator + means “ one or more instances of” .

If r is a regular expression that denotes the language L(r), then ( r )+ is a regular


expression that denotes the language (L (r ))+

Thus the regular expression a+ denotes the set of all strings of one or more a’s.

The operator + has the same precedence and associativity as the operator *.

2. Zero or one instance ( ?)


The unary postfix operator ? means “zero or one instance of”.

om
The notation r? is a shorthand for r | ε.

If ‘r’ is a regular expression, then ( r )? is a regular expression that denotes the language
.c
3. Character Classes
es

The notation [abc] where a, b and c are alphabet symbols denotes the regular
expression a | b | c.
ot

Character class such as [a – z] denotes the regular expression a | b | c | d | ….|z.


n

We can describe identifiers as being strings generated by the regular expression, [A–
la

Za–z][A– Za–z0–9]*
ra

Non-regular Set
ke

A language which cannot be described by any regular expression is a non-regular set.


Example: The set of all strings of balanced parentheses and repeating strings cannot be
described by a regular expression. This set can be specified by a context-free grammar.

1.2.4 REVIEW OF FINITE AUTOMATA


Refer from TOC

Finite Automata

1. Deterministic Finite automata

2. Non Deterministic Finite automata


NFA to DFA Conversion

Ms. Anakha Satheesh P , Assistant Professor/CSE 23

For More Study Materials : www.keralanotes.com


https://www.keralanotes.com/
CS304 Compiler Design /B.Tech/S6

1.2.5 RECOGNITION OF TOKENS


The question is how to recognize the tokens?

EXAMPLE

Assume the following grammar fragment to generate a specific language

stmt  if expr then stmt | if expr then stmt else stmt|


expr  term relop term| term
term  id| number
where the terminals if, then, else, relop, id and num generates sets of strings given by
following regular definitions.

if  if
then  then
else  else

om
rebop <|<=|< >|> |> =
id letter ( letter|digit )*
num digits optional-fraction optional-exponent
.c
where letter and digits are defined previously
es

For this language, the lexical analyzer will recognize the keywords i f , then,and else,
as well as lexemes that match the patterns for relop, id, and number.
ot

To simplify matters, we make the common assumption that keywords are also reserved
n

words: that is they cannot be used as identifiers.


la

The num represents the unsigned integer and real numbers of Pascal.
ra

In addition, we assume lexemes are separated by white space, consisting of nonnull


sequences of blanks, tabs, and newlines.
ke

Our lexical analyzer will strip out white space. It will do so by comparing a string
against the regular definition ws, below.

Delim  blank|tab|newline
ws  delim

If a match for ws is found, the lexical analyzer does not return a token to the parser.

Transition Diagram
As an intermediate step in the construction of a lexical analyzer, we first produce a
flowchart, called a r diagram. Transition diagrams.

Transition diagram depict the actions that take place when a lexical analyzer is called
by the parser to get the next token.

Ms. Anakha Satheesh P , Assistant Professor/CSE 24

For More Study Materials : www.keralanotes.com


https://www.keralanotes.com/
CS304 Compiler Design /B.Tech/S6

The TD uses to keep track of information about characters that are seen as the forward
pointer scans the input.

It does that by moving from position in the diagram as characters are read.

COMPONENTS OF TRANSITION DIAGRAM

1. One state is labelled the Start State start It is the initial state of transition
diagram where control resides when we begin to recognize a token.

2. Position is a transition diagram are drawn as circles and are called states.

3. The states are connected by Arrows called edges. Labels on edges are indicating
the input characters.

4. The Accepting states in which the tokens has been found

5. Retract one character use * to indicate states on which this input retraction.

om
.c
es
n ot
la
ra
ke

Ms. Anakha Satheesh P , Assistant Professor/CSE 25

For More Study Materials : www.keralanotes.com


https://www.keralanotes.com/
CS304 Compiler Design /B.Tech/S6

om
.c
es
n ot
la
ra
ke

**********

Ms. Anakha Satheesh P , Assistant Professor/CSE 26

For More Study Materials : www.keralanotes.com

You might also like