0% found this document useful (0 votes)
4 views

2

The document outlines the structure and function of a compiler, focusing on the role of the lexical analyzer (scanner) in processing source code into tokens for further analysis. It explains concepts such as regular expressions, finite automata, and the creation of a symbol table, which are essential for tokenization and syntax analysis. Additionally, it describes operations on strings and languages, providing examples of how tokens are defined and recognized within programming languages.

Uploaded by

Zooz 24
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

2

The document outlines the structure and function of a compiler, focusing on the role of the lexical analyzer (scanner) in processing source code into tokens for further analysis. It explains concepts such as regular expressions, finite automata, and the creation of a symbol table, which are essential for tokenization and syntax analysis. Additionally, it describes operations on strings and languages, providing examples of how tokens are defined and recognized within programming languages.

Uploaded by

Zooz 24
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 109

The Structure of a Compiler

Source
Program Tokens Syntactic Semantic
Scanner Parser
(Character Stream) Structure Routines

Intermediate
Scanner (Lexical Analyzer) Representation

➢ The scanner begins the analysis of the source program by


reading the input, character by character, and grouping
Symbol and Optimizer
characters into individual words and symbols (tokens)
Attribute
Tables
 RE ( Regular expression )
 NFA ( Non-deterministic Finite Automata )
 (Used by
DFA ( Deterministic Finite Automata ) all
 LEX Phases of
The Compiler) Code
Generator

Target machine code


1
The Role of the Lexical Analyzer
➢ The lexical Analyzer is the first phase of a compiler. A program or function which
performs lexical analysis is called a lexical analyzer, lexer, or scanner.
➢ The Main Task: is to read the input characters and produce as output a sequence of
tokens that the parser uses for syntax analysis.

token
source lexical
program analyzer parser
get next
token

symbol
table

2
The Role of the Lexical Analyzer
The Secondary Tasks:
1. Eliminating the following from the source program:
a. comments // global variables

b. whitespace a=1 + 4;

1. tab write ( a);

2. newline characters write (a,


a*2);

2. Correlating error messages from the compiler with the source program. It may keep
track of the number of newline characters seen, so that a line number can be
associated with an error message.
3. Making a copy of source program with errors marked (in some compilers)

3
What is a Language? What is going to be analysed
An alphabet (Σ) is a finite set of symbols . {a, b, c}
A symbol is an element of an alphabet. a
A word is a finite sequence of symbols drawn from the alphabet Σ. abcaa
A language (over alphabet Σ) is a set of words. {abcaa, abc, b, caa}
Σ* denotes the set of all words over the alphabet Σ.
| s | denotes the length of string | UQU | is a string of length 3
ε denotes the word of length 0, the empty word.
 denotes the empty set, or {ε}
Note1: In language theory the terms sentence and word are often used as synonyms for the term string
Note2: A language (over alphabet Σ) is a set of string (over alphabet Σ).
For example: Σ = {a}; one possible language is L = { ε, a; aa; aaa}.

4
Operations on Strings: Terms for parts of a string
TERM DEFINITION Example
e.g banana

prefix of s A string obtained by removing zero or more trailing symbols ε, b, ba, ban, ...,
of string s banana

suffix of s A string formed by deleting zero or more of the leading ε, a, na, ana, ...,
symbols of s banana

substring of s A string obtained by deleting a prefix and a suffix from s. ε, b, a, n, ba, an,
Every prefix and every suffix of s is a substring of s, but not na, nan, ...,
every substring of s is a prefix or a suffix of s. banana

subsequence of s Any string formed by deleting zero or more not necessarily ε, b, a, n, ba, bn,
contiguous symbols from s an, aa, na, nn, ...,

5
Operations on Strings

Concatenation: Concatenation of words is denoted by its position.


If x and y are strings, then the concatenation of x and y is xy
e.g. If x=dog and y= house, then xy=doghouse

Concatenation is not symmetric


Exponentiation
s0 = ε
s1 = s
s2 = ss

6
Operations on Languages or Strings
Let say we have two languages L and M, Then:
Union of L and M, L  M
L  M = {s  L or s  M}
Concatenation of L and M, LM
LM = {s  L and s  M}
Kleene closure of L, L*

L = i =0 L
* i

Positive closure

of
i
L, L+ (Kleene Plus)

L = i =1 L
+
7
Operations on Languages : Example
L is the set {A, B, . . ., Z, a, b, . . . , z}
D the set {0, 1, . . . , 9}
Since a symbol can be regarded as a string of length one, the sets L and D are each finite
languages. The following are some examples of new languages created from L and D
1. L U D is the set of letters and digits.
2. LD is the set of strings consisting of a letter followed by a digit.
3. L4 is the set of all four-letter strings.
4. L* is the set of all strings of letters, including ε, the empty string.
5. L(L U D)* is the set of all strings of letters and digits beginning with a letter.
6. D+ is the set of all strings of one or more digits.

8
➢ A token is a string of characters, categorized according to the rules as a
symbol (e.g., Identifier, Number, Comma, and so on).
➢ The process of forming tokens from an input stream of characters is called
tokenization, and the lexer categorizes them according to a symbol type.
➢ Tokenization: is frequently defined by regular expressions, which are
understood by a lexical analyzer generator such as lex.

➢ For each Lexeme, the Lexical Analyzer produces output as token of the form:
(Token-name, Attribute-value)
➢Token-name: symbol that is used during Syntax Analysis
➢Attribute-value: points to an entry in the symbol table for this token.

9
Pattern: is a rule associated with token that describes the set of strings
Lexeme: a string matched by the pattern of a token
Token: a set of strings SAMPLE INFORMAL DESCRIPTION OF
TOKEN LEXEME PATTERN
const const const
if if if
relation <,<=,=,<>,>,>= < or <= or = or <> or >= or >
id pi, count, D2 letter followed by letters and digits
num 3.1416, 0, 6.02E23 any numeric constant
literal “core dumped” any characters between “ and ”
except ”

➢ Symbol Table: is a Data Structure used to store information about various


source language constructs. The character string or lexeme forming an
identifier is saved in a symbol table entry. Later phases of the compiler might
add to this entry information such as the type of the identifier, its usage
(variable or label) and its position in storage (address).

10
Attributes are used to distinguish different lexemes in a token
E = M * C ** 2

Entry Type
<id, pointer to symbol-table entry for E> E
<assign_op, > =
<id, pointer to symbol-table entry for M > M
<mult_op, > *
<id, pointer to symbol-table entry for C> C

<exp-op, > **
2
<num, integer value 2>

Tokens affect syntax analysis &


Attributes affect semantic analysis
11
Consider this expression in the C++ programming language:
Position = initial + rate * 60
lexeme Token type Symbol
Position Identifier id,1
= Assignment OP =
Initial Identifier Id,2
Lexical Analysis
+ Addition OP +
rate Identifier Id,3
* Multi OP *
60 Integer 60
(id, 1) (=) (id, 2) (+) (id, 3) (*) (60)

12
➢ The specification of a programming language often includes a set of rules which defines the lexer.
These rules usually consist of regular expressions, and they define the set of possible
character sequences that are used to form individual tokens or lexemes.

➢ Note that is usually based on a Finite-State Machine (FSM) applying:


➢ Regular Expression
➢ Finite Automata ( Deterministic, Non-deterministic)

➢ It is encoded within its information on the possible sequences of characters that can be
contained within any of the tokens it handles

13
Describing Tokens using Regular Expression (1)
➢ We use regular expressions to describe programming language tokens.
➢ A Regular expression is built up out of simpler regular expressions using a set of defining rules
➢ A regular expression (RE) is defined inductively
a ordinary character stands for itself
ε empty string
R|S either R or S (alteration), where R,S = RE
RS R followed by S (concatenation)
R* concatenation of R 0 or more times
➢ A regular expression R describes a set of strings of characters denoted L(R)
➢ L(R) = the language defined by R: L(abc) = { abc }
➢ Each token can be defined L(hello|goodbye) = { hello, goodbye }
using a regular expression L(1(0|1)*) = all binary numbers that start with a 1

14
Describing Tokens using Regular Expression (2)
➢ A language denoted by a regular expression is said to be a regular set.
➢ Unnecessary parentheses can be avoided in regular expressions if we adopt the
conventions that:

1. The unary operator * has the highest precedence and is left associative,
2. Concatenation has the second highest precedence and is left associative,
3. | has the lowest precedence and is left associative.

(a)|((b)*(c)) is equivalent to a | b*c

Both expressions denote the set of strings that are either a


single a or zero or more b’s followed by one c.
15
Describing Tokens using Regular Expression (3)
Examples of regular expression over the language of :
={a, b}
a|b {a, b}
(a | b)(a | b) {aa, ab, ba, bb}
a* {, a, aa, aaa, ... }
(a | b)* The set of all strings of a’s and b’s
a | a*b The set containing the string a and all strings
consisting of zero or more a’s followed by a b
16
Describing Tokens using Regular Expression (5)
Notational Shorthands:
One or more instances
(r)+ denoting (L(r))+
r* = r + | 
r+ = r r *
Zero or one instance
r? = r | 
Character classes
[abc] = a | b | c
[a-z] = a | b | ... | z
[^a-z] = any character except [a-z]

18
Describing Tokens using Regular Expression (6)

25 12.55 1.4E10

digit → 0 | 1 | … | 9
digits → digit+
op_f → ( . digits)?
op_e → ( E ( + | - ) ? digits )?
num → digits op_f op_e

19
Given a regular expression
Generated
R Scanner
Generator P Program
(Finite State Machine )

R)
S  L( Yes

A string S P
R)
S  L(
No

20
FSM is a recognizer program for a language that takes string x as an
input and answer:
YES: if x is a sentence in the language
NO: if x is a sentence not in the language.

Types of FSM:
1. Nondeterministic Finite Automata (NFA)
2. Deterministic Finite Automata (DFA)

21
• Finite automata is TRANSITION DIAGRAM.
• Positions in a transition diagram are drawn as
circles and are called States. is a state
• The states are connected by arrows, called
transition. is a transition

• One state is labeled the Start State; it is the is the start state
initial state of the transition diagram where
control resides when we begin to recognize a
token is a final state
• One or more states is labeled the Final State;
it control when we stop recognize a token.

22
• Finite automaton (FA)
• can be used to recognize the tokens specified by a regular expression

FA = {Q, , s0, F, move }


• A FA consists of
• A finite set of states Q
• A set of input symbols  (the input symbol alphabet)
• A set of transitions (or moves) from one state to another,
labeled with characters in L
• A special start state s0 (only one)
• A set of final, or accepting, states F

23
• Example
• This machine accepts (abc+)+

( a b c +) +

a b c

24
(a | b)*abb a

start a b b
0 1 2 3

b
RE: (a | b)*abb
Input symbol
States: {0, 1, 2, 3}Q State
Input symbols: {a, b}  a b
Transition function: move 0 {0, 1} {0}
(0,a) = {0,1}, (0,b) = {0} 1 - {2}
(1,b) = {2}, (2,b) = {3} 2 - {3}
Start state: 0 3
Final states: {3} Transition Table
25
running token on FA
a

start a b b
(a | b)*abb 0 1 2 3

abb: {0} → {0, 1} → {0, 2} → {0, 3}


a b b
aabb: {0} → {0, 1} → {0, 1} → {0, 2} → {0, 3}
a a b b

26
An FA accepts an input string s if there is some path in the
transition diagram from the start state to some final state such
that the edge labels along this path spell out s
Alphabet = {a}
q1 a q2
Two choices a No transition

q0
a
q3
No transition
27
First Choice
a a

q1 a q2
a
q0
a
q3

28
First Choice
a a

q1 a q2
a
q0
a
q3

29
First Choice
a a All input is consumed

q1 a q2 “accept”
a
q0
a
q3

30
Second Choice
a a

q1 a q2
a
q0
a
q3
31
Second Choice
a a
Input cannot be consumed

q1 a q2
a
Automaton Halts
q0
a
q3 “reject”

32
aa is accepted by the NFA:

“accept”

q1 a q2 q1 a q2
a a
q0
a
q0
a
q3 q3 “reject”
because this
computation this computation
accepts aa is ignored
33
a

q1 a q2
a
q0
a
q3

34
First Choice

a
“reject”

q1 a q2
a
q0
a
q3

35
Second Choice

q1 a q2
a
q0
a
q3

36
Second Choice

q1 a q2
a
q0
a
q3 “reject”

37
Another Rejection example

a a a

q1 a q2
a
q0
a
q3

38
First Choice

a a a

q1 a q2
a
q0
a
q3

39
First Choice

a a a
Input cannot be consumed

q1 a q2 “reject”
a
q0
a Automaton halts
q3

40
Second Choice

a a a

q1 a q2
a
q0
a
q3

41
Second Choice

a a a
Input cannot be consumed

q1 a q2
a
Automaton halts
q0
a
q3 “reject”

42
Language accepted: L = {aa}

q1 a q2
a
q0
a
q3

43
Lambda Transitions  or (empty transition  )

•Note: the  symbol never appears on the input tape

q0 a q1  q2 a q3

44
a a

q0 a q1  q2 a q3

45
a a

q0 a q1  q2 a q3

46
input tape head does not move

a a
•Note: the  symbol never appears on the input tape

q0 a q1  q2 a q3

47
all input is consumed

a a

“accept”

q0 a q1  q2 a q3

String aa is accepted
48
Rejection Example

a a a

q0 a q1  q2 a q3

49
a a a

q0 a q1  q2 a q3

50
(read head doesn’t move)

a a a

q0 a q1  q2 a q3

51
Input cannot be consumed

a a a

Automaton halts
“reject”

q0 a q1  q2 a q3

String aaa is rejected


52
Language accepted: L = {aa}

q0 a q1  q2 a q3

53
Another NFA Example

q0 a q1 b q2  q3


54
a b

q0 a q1 b q2  q3


55
a b

q0 a q1 b q2  q3


56
a b

“accept”

q0 a q1 b q2  q3


57
Another String

a b a b

q0 a q1 b q2  q3


58
a b a b

q0 a q1 b q2  q3


59
a b a b

q0 a q1 b q2  q3


60
a b a b

q0 a q1 b q2  q3


61
a b a b

q0 a q1 b q2  q3


62
a b a b

q0 a q1 b q2  q3


63
a b a b

“accept”

q0 a q1 b q2  q3


0
q0 q1 0, 1 q2
1

65
Language accepted

L(M ) = {λ, 10, 1010, 101010, ...}


= {10} *
0
q0 q1 0, 1 q2
1 (redundant
state)

66
•Simple automata:

M1 M2
q0 q0

L(M1 ) = {} L(M 2 ) = {λ}


67

 (q , x ) = q1 , q2 , , qk 

q1
x resulting states with

q x
q1
following one transition
with symbol x
x

qk
68
Example of Transition Function 

 (q0 , 1) = q1

0
q0 q1 0, 1 q
2
1

69
Example of Transition Function 

 (q1,0) = {q0 , q2 }

0
q0 q1 0, 1 q
2
1

70
Example of Transition Function 
 (q0 ,  ) = {q2 }

0
q0 q1 0, 1 q
2
1

71
Example of Transition Function 

 (q2 ,1) = 

0
q0 q1 0, 1 q
2
1

72
*
Extended Transition Function 
Same with  but applied on strings
 (q0 , a ) = q1 
*

q4 q5
a a
q0 a q1 b q2  q3

73
*
Extended Transition Function 
 (q0 , aa ) = q4 , q5 
*

q4 q5
a a
q0 a q1 b q2  q3

74
*
Extended Transition Function 
 * (q0 , ab ) = q2 , q3, q0 

q4 q5
a a
q0 a q1 b q2  q3

75
76
RE
Thompson’s construction

NFA
Subset construction

DFA

77
* We can construct an NFA from a regular expression
* Thompson’s construction algorithm
1. Build the NFA inductively
2. Define rules for each base RE
3. Combine for more complex RE’s

s E f

general machine

78
start 
i f
empty string transition

start i
a f
alphabet symbol transition

79
– Suppose N(s) and N(t) are NFA for RE s and t
• for s | t, construct

 N(s) 
start f
i
 
N(t)

ε ε
E1
S F Alteration: (E1 | E2)
ε E2 ε
•New start state S ε-transitions to the start states of E1 and E2
•ε-transitions from the final/accepting states of E1 and E2 to the new final state F
80
– Suppose N(s) and N(t) are NFA for RE s and t
• for st, construct

start i N(s) N(t) f

ε ε ε ε
S E1 A E2 F Concatenation: (E1 E2)
• New start state S ε-transition to the start state of E1
• ε-transition from final/accepting state of E1 to A, ε-transition from A
to start state of E2
• ε-transitions from the final/accepting state E2 to the new final state F
81
• for s*, construct

start  
i N(s) f

E
ε ε
S A F Closure: (E*)
ε ε

82
Develop an NFA for the RE: (x | y)*
x ε
ε B C
A F First create NFA for (x | y)
ε D y E ε

x ε
ε B C
Then add in the closure
A ε ε
F operator
D y E
ε ε
ε ε
S G H
83
a
aa* | bb* a
1 2

start
0
RE: aa* | bb*

3 4
States: {0, 1, 2, 3, 4} b
b
Input symbols: {a, b}
Transition function:
(0, ) = {1, 3}, (1, a) = {2}, (2, a) = {2}
(3, b) = {4}, (4, b) = {4}
Start state: 0
Final states: {2, 4}
84

a
2 3
 
start   a b b
(a | b)*abb 0 1 6 7 8 9 10
 
b
4 5


85
RE
Thompson’s construction

NFA
Subset construction

DFA

86
A DFA is a special case of an NFA in which
1. No state has an -transition
2. For each state q and input symbol a, there is at most one edge labeled
a leaving q
Formal Definition of DFAs:
Q : Set of states, i.e. q0 , q1, q2 
: Input aplhabet, i.e. a, b  
 : Transition function
q0 : Initial state
M = (Q, ,  , q0 , F )
F : Accepting states
87
b
(a | b) * abb b

start b b
a
0 1 2 3
a
RE: (a | b)*abb
States: {0, 1, 2, 3} a
a
Input symbols: {a, b}
Transition function:
(0,a) = 1, (1,a) = 1, (2,a) = 1, (3,a) = 1
(0,b) = 0, (1,b) = 2, (2,b) = 3, (3,b) = 0
Start state: 0
Final states: {3}
88
Finding NFA States
aa* | b | ab

a a
1 4

 b
start 0 2 5
a  a b
 0 1,2,3 - -
3 1 - 4 -
2 - - 5
3 - 2 -
4 - 4 -
5 - - -
89
Is there an NFA States

a a
2
1
a
aa* | b | ab start 0 b

b 3
a b
0 1 3
1 2 3
2 2 -
3 - -

90
DFA
* Action on each input is fully determined
* Implement using table-driven approach
* More states generally required to implement RE
NFA
* May have a choice at each step
* Accepts string if there is any path to an accepting state
* Not obvious how to implement this

91
a set of NFA states  a DFA state
• Find the initial state of the DFA
• Find all the states in the DFA
• Construct the transition table
• Find the final states of the DFA
We can do that by removing every non-deterministic case
* Non- Deterministic cases:
1- States with multiple outgoing edges due to same input
2- ε transitions

92
• Solving 1: Multiple transitions a+b*
– Solve by subset construction a b
– Build new DFA based upon the power set of states on a
the NFA start 1 2
– Move (S,a) is relabeled to target a new state whenever
single input goes to multiple states

 (1,a) → 1 or 2, create new state 1/2


(1/2,a) →1/2
(1/2,b) → 2
(2,a) → - a b
a b
(2,b) → 2
Any state with “2” in name is a final state  start 1 1/2 2

94
• solving 2: ε transitions
– Any state reachable by an ε transition is “part of the state”
– ε-closure - Any state reachable from S by ε transitions is in the ε-closure; treat ε-
closure as 1 big state, always include ε-closure as part of the state

a b a b

start a ε start a b
1 2
ε-closure(2) = {2,3}
3  1 2/3 3
(1, a) → 2/3 (3, a) → -
create new state 2/3 (1, b) → - (3, b) → 3
(2/3, a) → 2/3
(2/3, b) → 3

95
NFA M a
q0 a q1  q2
b
DFA M
q0 

96
 * (q0 , a ) = {q1 , q2 }
NFA M a
q0 a q1  q2
b

DFA M
q0  a
q1, q2 

97
 * (q0 , b ) =  empty set

NFA M a
q0 a q1  q2
b

DFA M
q0  a
q1, q2 
b

 trap state
98
 (q1 , a ) = {q1 , q2 }
*

NFA M a  * (q2 , a ) = 
q0 a q1  q2 union

b q1, q2 

a
DFA M
q0  a
q1, q2 
b


99
 (q1 , b ) = {q0 }
*

NFA M a  * (q2 , b ) = {q0 }


a  union
q0 q1 q2
b q0 

b a
DFA M
q0  a
q1, q2 
b


100
NFA M a
q0 a q1  q2
b

b a
DFA M
q0  a
q1, q2 

101
END OF CONSTRUCTION

NFA M a
q0 a q1  q2 q1  F
b
a
DFA M b

q0  a
q1, q2 
q1, q2  F 

102
0
Example 2: Conversion NFA to DFA
for {q0} call it A
States 0 1 q1 1
 (A,0) = {q1, q3} call it B A B C
0
 (A,1) = {q2, q3} call it C B D E start 0, 1 q3
q0
 (B,0) = {q1} call it D C E F
 (B,1) = {q3} call it E D D E 1 q2 0

 (C,0) = {q3} it is E E 
F E F 1
 (C,1) = {q2} call it F
 (D,0) = {q1} it is D 0
B D 0
 (D,1) = {q3} it is E
0 1 1
 (E,0) = 
A 0,1  0,1
 (E,1) =  E
 (F,0) = {q3} it is E 1 0 0
 (F,1) = {q2} it is F C 1 1
103
F
a
• Prior to NFA to DFA conversion:
c
• Empty cycle removal 2 ε
ε
– Combine nodes that comprise cycle
start 1 ε 4
• Empty transition removal ε
ε 3 ε

a 2 ε
start c 4
1
ε
104
b
• Resulting DFA can be quite large
b
– Contains redundant or equivalent states 2 a
b
– find groups of equivalent states and merge them start a
1 4 5
b
a
3 a
b
Both DFAs accept
b*ab*a
b b
start
1 2 3
a a
105
• Two programs were developed at Bell Labs in mid 70’s
– Lex: transducer, transforms an input stream into the alphabet
of the grammar processed by yacc
Flex = fast lex, later developed by Free Software Foundation
– Yacc: yet another compiler/compiler

• Input to lexer generator


– List of regular expressions in priority order
– Associated action with each RE
• Output
– Program that reads input stream and breaks it up into tokens
according the the REs

106
PART 1: Convert Regular Language to RE
1. Write a regular expression for all strings of 𝟎 and 𝟏 which contains the substring 𝟎𝟏𝟏𝟎
2. Write a regular expression for all strings of 𝒀 and 𝒁 where every 𝒁 is immediately followed by
𝑎𝑡 𝑙𝑒𝑎𝑠𝑡 5 𝒁

3. Write a regular expression for all strings of 𝑷 and 𝑸 which contains an odd number of 𝑸

PART 2: Convert Regular Language to FSM


1. 𝑴𝟏=The set of strings that has exactly 3 b (and any number of a).
2. 𝑴𝟐=The set of strings where the number of B is a multiple of 3 (and there can be any number of A).

116
PART 3: Convert RE to NFA
1. 𝑴𝟑=Construct FA that can read the RE: a (b |c) d (e | f) g (h | i)

PART 4: Convert NFA to RE


1. 𝑴4=

117
PART 5: Convert NFA to DFA
1. 𝑴5=

118
Have a terrific day

119

You might also like