2
2
Source
Program Tokens Syntactic Semantic
Scanner Parser
(Character Stream) Structure Routines
Intermediate
Scanner (Lexical Analyzer) Representation
token
source lexical
program analyzer parser
get next
token
symbol
table
2
The Role of the Lexical Analyzer
The Secondary Tasks:
1. Eliminating the following from the source program:
a. comments // global variables
b. whitespace a=1 + 4;
2. Correlating error messages from the compiler with the source program. It may keep
track of the number of newline characters seen, so that a line number can be
associated with an error message.
3. Making a copy of source program with errors marked (in some compilers)
3
What is a Language? What is going to be analysed
An alphabet (Σ) is a finite set of symbols . {a, b, c}
A symbol is an element of an alphabet. a
A word is a finite sequence of symbols drawn from the alphabet Σ. abcaa
A language (over alphabet Σ) is a set of words. {abcaa, abc, b, caa}
Σ* denotes the set of all words over the alphabet Σ.
| s | denotes the length of string | UQU | is a string of length 3
ε denotes the word of length 0, the empty word.
denotes the empty set, or {ε}
Note1: In language theory the terms sentence and word are often used as synonyms for the term string
Note2: A language (over alphabet Σ) is a set of string (over alphabet Σ).
For example: Σ = {a}; one possible language is L = { ε, a; aa; aaa}.
4
Operations on Strings: Terms for parts of a string
TERM DEFINITION Example
e.g banana
prefix of s A string obtained by removing zero or more trailing symbols ε, b, ba, ban, ...,
of string s banana
suffix of s A string formed by deleting zero or more of the leading ε, a, na, ana, ...,
symbols of s banana
substring of s A string obtained by deleting a prefix and a suffix from s. ε, b, a, n, ba, an,
Every prefix and every suffix of s is a substring of s, but not na, nan, ...,
every substring of s is a prefix or a suffix of s. banana
subsequence of s Any string formed by deleting zero or more not necessarily ε, b, a, n, ba, bn,
contiguous symbols from s an, aa, na, nn, ...,
5
Operations on Strings
6
Operations on Languages or Strings
Let say we have two languages L and M, Then:
Union of L and M, L M
L M = {s L or s M}
Concatenation of L and M, LM
LM = {s L and s M}
Kleene closure of L, L*
L = i =0 L
* i
Positive closure
of
i
L, L+ (Kleene Plus)
L = i =1 L
+
7
Operations on Languages : Example
L is the set {A, B, . . ., Z, a, b, . . . , z}
D the set {0, 1, . . . , 9}
Since a symbol can be regarded as a string of length one, the sets L and D are each finite
languages. The following are some examples of new languages created from L and D
1. L U D is the set of letters and digits.
2. LD is the set of strings consisting of a letter followed by a digit.
3. L4 is the set of all four-letter strings.
4. L* is the set of all strings of letters, including ε, the empty string.
5. L(L U D)* is the set of all strings of letters and digits beginning with a letter.
6. D+ is the set of all strings of one or more digits.
8
➢ A token is a string of characters, categorized according to the rules as a
symbol (e.g., Identifier, Number, Comma, and so on).
➢ The process of forming tokens from an input stream of characters is called
tokenization, and the lexer categorizes them according to a symbol type.
➢ Tokenization: is frequently defined by regular expressions, which are
understood by a lexical analyzer generator such as lex.
➢ For each Lexeme, the Lexical Analyzer produces output as token of the form:
(Token-name, Attribute-value)
➢Token-name: symbol that is used during Syntax Analysis
➢Attribute-value: points to an entry in the symbol table for this token.
9
Pattern: is a rule associated with token that describes the set of strings
Lexeme: a string matched by the pattern of a token
Token: a set of strings SAMPLE INFORMAL DESCRIPTION OF
TOKEN LEXEME PATTERN
const const const
if if if
relation <,<=,=,<>,>,>= < or <= or = or <> or >= or >
id pi, count, D2 letter followed by letters and digits
num 3.1416, 0, 6.02E23 any numeric constant
literal “core dumped” any characters between “ and ”
except ”
10
Attributes are used to distinguish different lexemes in a token
E = M * C ** 2
Entry Type
<id, pointer to symbol-table entry for E> E
<assign_op, > =
<id, pointer to symbol-table entry for M > M
<mult_op, > *
<id, pointer to symbol-table entry for C> C
<exp-op, > **
2
<num, integer value 2>
12
➢ The specification of a programming language often includes a set of rules which defines the lexer.
These rules usually consist of regular expressions, and they define the set of possible
character sequences that are used to form individual tokens or lexemes.
➢ It is encoded within its information on the possible sequences of characters that can be
contained within any of the tokens it handles
13
Describing Tokens using Regular Expression (1)
➢ We use regular expressions to describe programming language tokens.
➢ A Regular expression is built up out of simpler regular expressions using a set of defining rules
➢ A regular expression (RE) is defined inductively
a ordinary character stands for itself
ε empty string
R|S either R or S (alteration), where R,S = RE
RS R followed by S (concatenation)
R* concatenation of R 0 or more times
➢ A regular expression R describes a set of strings of characters denoted L(R)
➢ L(R) = the language defined by R: L(abc) = { abc }
➢ Each token can be defined L(hello|goodbye) = { hello, goodbye }
using a regular expression L(1(0|1)*) = all binary numbers that start with a 1
14
Describing Tokens using Regular Expression (2)
➢ A language denoted by a regular expression is said to be a regular set.
➢ Unnecessary parentheses can be avoided in regular expressions if we adopt the
conventions that:
1. The unary operator * has the highest precedence and is left associative,
2. Concatenation has the second highest precedence and is left associative,
3. | has the lowest precedence and is left associative.
18
Describing Tokens using Regular Expression (6)
25 12.55 1.4E10
digit → 0 | 1 | … | 9
digits → digit+
op_f → ( . digits)?
op_e → ( E ( + | - ) ? digits )?
num → digits op_f op_e
19
Given a regular expression
Generated
R Scanner
Generator P Program
(Finite State Machine )
R)
S L( Yes
A string S P
R)
S L(
No
20
FSM is a recognizer program for a language that takes string x as an
input and answer:
YES: if x is a sentence in the language
NO: if x is a sentence not in the language.
Types of FSM:
1. Nondeterministic Finite Automata (NFA)
2. Deterministic Finite Automata (DFA)
21
• Finite automata is TRANSITION DIAGRAM.
• Positions in a transition diagram are drawn as
circles and are called States. is a state
• The states are connected by arrows, called
transition. is a transition
• One state is labeled the Start State; it is the is the start state
initial state of the transition diagram where
control resides when we begin to recognize a
token is a final state
• One or more states is labeled the Final State;
it control when we stop recognize a token.
22
• Finite automaton (FA)
• can be used to recognize the tokens specified by a regular expression
23
• Example
• This machine accepts (abc+)+
( a b c +) +
a b c
24
(a | b)*abb a
start a b b
0 1 2 3
b
RE: (a | b)*abb
Input symbol
States: {0, 1, 2, 3}Q State
Input symbols: {a, b} a b
Transition function: move 0 {0, 1} {0}
(0,a) = {0,1}, (0,b) = {0} 1 - {2}
(1,b) = {2}, (2,b) = {3} 2 - {3}
Start state: 0 3
Final states: {3} Transition Table
25
running token on FA
a
start a b b
(a | b)*abb 0 1 2 3
26
An FA accepts an input string s if there is some path in the
transition diagram from the start state to some final state such
that the edge labels along this path spell out s
Alphabet = {a}
q1 a q2
Two choices a No transition
q0
a
q3
No transition
27
First Choice
a a
q1 a q2
a
q0
a
q3
28
First Choice
a a
q1 a q2
a
q0
a
q3
29
First Choice
a a All input is consumed
q1 a q2 “accept”
a
q0
a
q3
30
Second Choice
a a
q1 a q2
a
q0
a
q3
31
Second Choice
a a
Input cannot be consumed
q1 a q2
a
Automaton Halts
q0
a
q3 “reject”
32
aa is accepted by the NFA:
“accept”
q1 a q2 q1 a q2
a a
q0
a
q0
a
q3 q3 “reject”
because this
computation this computation
accepts aa is ignored
33
a
q1 a q2
a
q0
a
q3
34
First Choice
a
“reject”
q1 a q2
a
q0
a
q3
35
Second Choice
q1 a q2
a
q0
a
q3
36
Second Choice
q1 a q2
a
q0
a
q3 “reject”
37
Another Rejection example
a a a
q1 a q2
a
q0
a
q3
38
First Choice
a a a
q1 a q2
a
q0
a
q3
39
First Choice
a a a
Input cannot be consumed
q1 a q2 “reject”
a
q0
a Automaton halts
q3
40
Second Choice
a a a
q1 a q2
a
q0
a
q3
41
Second Choice
a a a
Input cannot be consumed
q1 a q2
a
Automaton halts
q0
a
q3 “reject”
42
Language accepted: L = {aa}
q1 a q2
a
q0
a
q3
43
Lambda Transitions or (empty transition )
q0 a q1 q2 a q3
44
a a
q0 a q1 q2 a q3
45
a a
q0 a q1 q2 a q3
46
input tape head does not move
a a
•Note: the symbol never appears on the input tape
q0 a q1 q2 a q3
47
all input is consumed
a a
“accept”
q0 a q1 q2 a q3
String aa is accepted
48
Rejection Example
a a a
q0 a q1 q2 a q3
49
a a a
q0 a q1 q2 a q3
50
(read head doesn’t move)
a a a
q0 a q1 q2 a q3
51
Input cannot be consumed
a a a
Automaton halts
“reject”
q0 a q1 q2 a q3
q0 a q1 q2 a q3
53
Another NFA Example
q0 a q1 b q2 q3
54
a b
q0 a q1 b q2 q3
55
a b
q0 a q1 b q2 q3
56
a b
“accept”
q0 a q1 b q2 q3
57
Another String
a b a b
q0 a q1 b q2 q3
58
a b a b
q0 a q1 b q2 q3
59
a b a b
q0 a q1 b q2 q3
60
a b a b
q0 a q1 b q2 q3
61
a b a b
q0 a q1 b q2 q3
62
a b a b
q0 a q1 b q2 q3
63
a b a b
“accept”
q0 a q1 b q2 q3
0
q0 q1 0, 1 q2
1
65
Language accepted
M1 M2
q0 q0
(q , x ) = q1 , q2 , , qk
q1
x resulting states with
q x
q1
following one transition
with symbol x
x
qk
68
Example of Transition Function
(q0 , 1) = q1
0
q0 q1 0, 1 q
2
1
69
Example of Transition Function
(q1,0) = {q0 , q2 }
0
q0 q1 0, 1 q
2
1
70
Example of Transition Function
(q0 , ) = {q2 }
0
q0 q1 0, 1 q
2
1
71
Example of Transition Function
(q2 ,1) =
0
q0 q1 0, 1 q
2
1
72
*
Extended Transition Function
Same with but applied on strings
(q0 , a ) = q1
*
q4 q5
a a
q0 a q1 b q2 q3
73
*
Extended Transition Function
(q0 , aa ) = q4 , q5
*
q4 q5
a a
q0 a q1 b q2 q3
74
*
Extended Transition Function
* (q0 , ab ) = q2 , q3, q0
q4 q5
a a
q0 a q1 b q2 q3
75
76
RE
Thompson’s construction
NFA
Subset construction
DFA
77
* We can construct an NFA from a regular expression
* Thompson’s construction algorithm
1. Build the NFA inductively
2. Define rules for each base RE
3. Combine for more complex RE’s
s E f
general machine
78
start
i f
empty string transition
start i
a f
alphabet symbol transition
79
– Suppose N(s) and N(t) are NFA for RE s and t
• for s | t, construct
N(s)
start f
i
N(t)
ε ε
E1
S F Alteration: (E1 | E2)
ε E2 ε
•New start state S ε-transitions to the start states of E1 and E2
•ε-transitions from the final/accepting states of E1 and E2 to the new final state F
80
– Suppose N(s) and N(t) are NFA for RE s and t
• for st, construct
ε ε ε ε
S E1 A E2 F Concatenation: (E1 E2)
• New start state S ε-transition to the start state of E1
• ε-transition from final/accepting state of E1 to A, ε-transition from A
to start state of E2
• ε-transitions from the final/accepting state E2 to the new final state F
81
• for s*, construct
start
i N(s) f
E
ε ε
S A F Closure: (E*)
ε ε
82
Develop an NFA for the RE: (x | y)*
x ε
ε B C
A F First create NFA for (x | y)
ε D y E ε
x ε
ε B C
Then add in the closure
A ε ε
F operator
D y E
ε ε
ε ε
S G H
83
a
aa* | bb* a
1 2
start
0
RE: aa* | bb*
3 4
States: {0, 1, 2, 3, 4} b
b
Input symbols: {a, b}
Transition function:
(0, ) = {1, 3}, (1, a) = {2}, (2, a) = {2}
(3, b) = {4}, (4, b) = {4}
Start state: 0
Final states: {2, 4}
84
a
2 3
start a b b
(a | b)*abb 0 1 6 7 8 9 10
b
4 5
85
RE
Thompson’s construction
NFA
Subset construction
DFA
86
A DFA is a special case of an NFA in which
1. No state has an -transition
2. For each state q and input symbol a, there is at most one edge labeled
a leaving q
Formal Definition of DFAs:
Q : Set of states, i.e. q0 , q1, q2
: Input aplhabet, i.e. a, b
: Transition function
q0 : Initial state
M = (Q, , , q0 , F )
F : Accepting states
87
b
(a | b) * abb b
start b b
a
0 1 2 3
a
RE: (a | b)*abb
States: {0, 1, 2, 3} a
a
Input symbols: {a, b}
Transition function:
(0,a) = 1, (1,a) = 1, (2,a) = 1, (3,a) = 1
(0,b) = 0, (1,b) = 2, (2,b) = 3, (3,b) = 0
Start state: 0
Final states: {3}
88
Finding NFA States
aa* | b | ab
a a
1 4
b
start 0 2 5
a a b
0 1,2,3 - -
3 1 - 4 -
2 - - 5
3 - 2 -
4 - 4 -
5 - - -
89
Is there an NFA States
a a
2
1
a
aa* | b | ab start 0 b
b 3
a b
0 1 3
1 2 3
2 2 -
3 - -
90
DFA
* Action on each input is fully determined
* Implement using table-driven approach
* More states generally required to implement RE
NFA
* May have a choice at each step
* Accepts string if there is any path to an accepting state
* Not obvious how to implement this
91
a set of NFA states a DFA state
• Find the initial state of the DFA
• Find all the states in the DFA
• Construct the transition table
• Find the final states of the DFA
We can do that by removing every non-deterministic case
* Non- Deterministic cases:
1- States with multiple outgoing edges due to same input
2- ε transitions
92
• Solving 1: Multiple transitions a+b*
– Solve by subset construction a b
– Build new DFA based upon the power set of states on a
the NFA start 1 2
– Move (S,a) is relabeled to target a new state whenever
single input goes to multiple states
94
• solving 2: ε transitions
– Any state reachable by an ε transition is “part of the state”
– ε-closure - Any state reachable from S by ε transitions is in the ε-closure; treat ε-
closure as 1 big state, always include ε-closure as part of the state
a b a b
start a ε start a b
1 2
ε-closure(2) = {2,3}
3 1 2/3 3
(1, a) → 2/3 (3, a) → -
create new state 2/3 (1, b) → - (3, b) → 3
(2/3, a) → 2/3
(2/3, b) → 3
95
NFA M a
q0 a q1 q2
b
DFA M
q0
96
* (q0 , a ) = {q1 , q2 }
NFA M a
q0 a q1 q2
b
DFA M
q0 a
q1, q2
97
* (q0 , b ) = empty set
NFA M a
q0 a q1 q2
b
DFA M
q0 a
q1, q2
b
trap state
98
(q1 , a ) = {q1 , q2 }
*
NFA M a * (q2 , a ) =
q0 a q1 q2 union
b q1, q2
a
DFA M
q0 a
q1, q2
b
99
(q1 , b ) = {q0 }
*
b a
DFA M
q0 a
q1, q2
b
100
NFA M a
q0 a q1 q2
b
b a
DFA M
q0 a
q1, q2
101
END OF CONSTRUCTION
NFA M a
q0 a q1 q2 q1 F
b
a
DFA M b
q0 a
q1, q2
q1, q2 F
102
0
Example 2: Conversion NFA to DFA
for {q0} call it A
States 0 1 q1 1
(A,0) = {q1, q3} call it B A B C
0
(A,1) = {q2, q3} call it C B D E start 0, 1 q3
q0
(B,0) = {q1} call it D C E F
(B,1) = {q3} call it E D D E 1 q2 0
(C,0) = {q3} it is E E
F E F 1
(C,1) = {q2} call it F
(D,0) = {q1} it is D 0
B D 0
(D,1) = {q3} it is E
0 1 1
(E,0) =
A 0,1 0,1
(E,1) = E
(F,0) = {q3} it is E 1 0 0
(F,1) = {q2} it is F C 1 1
103
F
a
• Prior to NFA to DFA conversion:
c
• Empty cycle removal 2 ε
ε
– Combine nodes that comprise cycle
start 1 ε 4
• Empty transition removal ε
ε 3 ε
a 2 ε
start c 4
1
ε
104
b
• Resulting DFA can be quite large
b
– Contains redundant or equivalent states 2 a
b
– find groups of equivalent states and merge them start a
1 4 5
b
a
3 a
b
Both DFAs accept
b*ab*a
b b
start
1 2 3
a a
105
• Two programs were developed at Bell Labs in mid 70’s
– Lex: transducer, transforms an input stream into the alphabet
of the grammar processed by yacc
Flex = fast lex, later developed by Free Software Foundation
– Yacc: yet another compiler/compiler
106
PART 1: Convert Regular Language to RE
1. Write a regular expression for all strings of 𝟎 and 𝟏 which contains the substring 𝟎𝟏𝟏𝟎
2. Write a regular expression for all strings of 𝒀 and 𝒁 where every 𝒁 is immediately followed by
𝑎𝑡 𝑙𝑒𝑎𝑠𝑡 5 𝒁
3. Write a regular expression for all strings of 𝑷 and 𝑸 which contains an odd number of 𝑸
116
PART 3: Convert RE to NFA
1. 𝑴𝟑=Construct FA that can read the RE: a (b |c) d (e | f) g (h | i)
117
PART 5: Convert NFA to DFA
1. 𝑴5=
118
Have a terrific day
119