Lecture 02
Lecture 02
NFA
Thus, the only strings getting to the accepting state are those that
end in abb.
Lexical Analysis
The Role of the Lexical Analyzer
• The lexical analyzer is to read the input characters of the
source program, group them into lexemes, and produce
as output a sequence of tokens for each lexeme in the
source program.
• The stream of tokens is sent to the parser for syntax
analysis.
• It is common for the lexical analyzer to interact with the
symbol table as well.
• When the lexical analyzer discovers a lexeme constituting
an identifier, it needs to enter that lexeme into the symbol
table.
• In some cases, information regarding the kind of identifier
may be read from the symbol table by the lexical analyzer
to assist it in determining the proper token it must pass to
the parser.
Interaction between the lexical
analyzer & the parser
• Other Tasks:
Stripping out comments and whitespace
(blank, newline, tab, and perhaps other
characters that are used to separate tokens in
the input).
Correlating error messages generated by the
compiler with the source program.
Divided into two processes
1. Scanning consists of the simple processes that
do not require tokenization of the input, such
as deletion of comments and compaction of
consecutive whitespace characters into one.
2. Lexical analysis proper is the more complex
portion, where the scanner produces the
sequence of tokens as output.
Lexical Analysis Versus Parsing
• Why the analysis portion of a compiler is
normally separated into lexical analysis and
parsing (syntax analysis) phases?
1. Simplicity of design is the most important
consideration.
The separation of lexical and syntactic analysis
often allows us to simplify at least one of these
tasks.
If we are designing a new language, separating
lexical and syntactic concerns can lead to a
cleaner overall language design.
2. Compiler efficiency is improved.
A separate lexical analyzer allows us to
apply specialized techniques that serve only
the lexical task, not the job of parsing.
In addition, specialized buffering
techniques for reading input characters can
speed up the compiler significantly.
3. Compiler portability is enhanced.
Input-device-specific peculiarities can be
restricted to the lexical analyzer.
Tokens, Patterns & Lexemes
• Token
❖ a token name + an optional attribute value.
❖ The token name is an abstract symbol
representing a kind of lexical unit, e.g., a
particular keyword, or a sequence of input
characters denoting an identifier.
❖ The token names are the input symbols
that the parser processes.
• Pattern
o A pattern is a description of the form that the
lexemes of a token may take.
o In the case of a keyword as a token, the pattern is
just the sequence of characters that form the
keyword.
o For identifiers and some other tokens, the pattern
is a more complex structure that is matched by
many strings.
• Lexemes
❑ A lexeme is a sequence of characters in the source
program that matches the pattern for a token & is
identified by the lexical analyzer as an instance of
that token.
Example: Patterns & Lexemes
• C statement
printf ("Total = %d\n" , score ) ;
• both printf and score are lexemes matching the
pattern for token id, and " Total = %d\n" is a lexeme
matching literal.
Covering most or all of the tokens
1. One token for each keyword. The pattern for a
keyword is the same as the keyword itself.
2. Tokens for the operators, either individually or in
classes .
3. One token representing all identifiers.
4. One or more tokens representing constants, such
as numbers and literal strings .
5. Tokens for each punctuation symbol, such as left
and right parentheses, comma, and semicolon.
Attributes for Tokens
• When more than one lexeme can match a pattern, the
lexical analyzer must provide the subsequent compiler
phases additional information about the particular
lexeme that matched.
• For example, the pattern for token number matches both
0 and 1, but it is extremely important for the code
generator to know which lexeme was found in the source
program.
• Thus, in many cases the lexical analyzer returns to the
parser not only a token name, but an attribute value that
describes the lexeme represented by the token ;
• Token name influences parsing decisions, while the
attribute value influences translation of tokens after the
parse.
Attributes for Tokens
• Tokens have at most one associated attribute,
although this attribute may have a structure
that combines several pieces of information.
• Normally, information about an identifier-e.g.,
its lexeme, its type, and the location at which
it is first found is kept in the symbol table.
• Thus, the appropriate attribute value for an
identifier is a pointer to the symbol-table
entry for that identifier.
Example 3.2 : The token names & associated
attribute values for the Fortran statement
E = M * C ** 2
• Sequence of pairs:
<id, pointer to symbol-table entry for E>
<assign_op> [no need to assign]
<id, pointer to symbol-table entry for M>
<mult_op> [no need to assign]
<id, pointer to symbol-table entry for C>
<exp_op> [no need to assign]
<number, integer value 2>
Lexical Errors
• It is hard for a lexical analyzer to tell, without the
aid of other components, that there is a
source-code error.
• For instance, if the string fi is encountered for the
first time in a C program in the context :
fi ( a == f (x) ) . . .
• A lexical analyzer cannot tell whether fi is a
misspelling of the keyword if or an undeclared
function identifier.
• Since fi is a valid lexeme for the token id, the
lexical analyzer must return the token id to the
parser
Lexical Errors
• Let the parser - handle an error due to transposition of the
letters.
• However, suppose a situation arises in which the lexical
analyzer is unable to proceed because none of the patterns
for tokens matches any prefix of the remaining input.
• Error Recovery
✔ The simplest recovery strategy is "panic mode" recovery.
✔ We delete successive characters from the remaining input,
until the lexical analyzer can find a well-formed token at the
beginning of what input is left.
✔ This recovery technique may confuse the parser, but in an
interactive computing environment it may be quite
adequate.
• Other Recovery
• How: first replacing uses of d1 in r2 (which cannot use any of the d's
except for d1), then replacing uses of d1 and d2 in r3 by r1 and (the
substituted) r2 , and so on.
• digit → 0| 1 |· · · | 9
re Meaning
+ single + character
! single ! character
= single = character
!= 2 character sequence
<= 2 character sequence
xyzzy 5 character sequence
Extensions of Regular Expressions
• 1 . One or more instances. The unary, postfix operator + represents the
positive closure of a regular expression and its language.
That is, if r is a regular expression, then (r)+ denotes the language (L(r) ) + .
The operator + has the same precedence and associativity as the operator *
Two useful algebraic laws, r* = r+ |e and r+ = rr* = r*r relate the Kleene
closure & positive closure.
• 2. Zero or one instance. The unary postfix operator ? means "zero or one
occurrence." That is, r? is equivalent to r l ε , or put another way, L (r?) = L (r)
U {ε}.
The ? operator has the same precedence and associativity as * and + .
• 3. Character classes. A regular expression a1l a2| · · · I an , where the ai 's are
each symbols of the alphabet, can be replaced by the shorthand [a1 a2 · · ·
an].
when a1 , a2 , . · · , an form a logical sequence, e.g., consecutive uppercase
letters, lowercase letters, or digits, we can replace them by a1-an , that is, just
the first and last separated by a hyphen.
[abc] == a|b|c, [a-z] == a|b| · · · |z
Recognition of Tokens
• Build a piece of code that examines the input
string & finds a prefix that is a lexeme
matching one of the patterns.
Patterns
Example:
Terminals:
if, then, else,
relop , id,
number---names
of tokens
ws → ( blank | tab | newline )+
a) The same symbol can label edges from one state to several
different states,
b) An edge may be labeled by ε, the empty string, instead of, or
in addition to, symbols from the input alphabet.
Transition Tables
• Rows : states
• Columns: input symbols and ε.
• The entry for a given state & input is
value of the transition function
applied to those arguments.
• If the transition function has no
information about that state-input
pair, put Φ.
• Adv: Easily find the transitions on a
given state and input.
• Disadv: takes a lot of space, when
the input alphabet is large,
Acceptance of Input Strings by
Automata
• An NFA accepts input string x if & only if there
is some path in the transition graph from the
start state to one of the accepting states
• ε labels along the path are effectively ignored,
since the empty string does not contribute to
the string constructed along the path.
Example : The string aabb is accepted by the NFA
• States = squares.
• Inputs = r (move to an adjacent red square)
and b (move to an adjacent black square).
• Start state, final state are in opposite
corners.
54
Example: Chessboard – (2)
1 2 3
r b
1 2,4 5
4 5 6 2 4,6 1,3,5
3 2,6 5
7 8 9 4 2,8 1,5,7
5 2,4,6,8 1,3,7,9
r b b 6 2,8 3,5,9
1 2 1 5 7 4,8 5
4 3 1
8 4,6 5,7,9
5 3
* 9 6,8 5
7 7
9 Accept, since final state reached 55
Example
• An NFA accepting all strings that end in 01
0
,
Start 1 1
0
q0 q1 q2
Input: 00101
q0 q0 q0 q0 q0 q0
q1 q1 q1
(Stuck)
q2 Accept
q2
(Stuck) ed
1 0 1
0 0 56
Example
• NFA that has an input alphabet {0} consisting of a
single symbol. It accepts all strings of the form 0k
where k is a multiple of 2 or 3 (accept: ∈, 00, 0000,
000000 but not 0, 00000)
0
∈ 0
∈
0
0
0
57
Example
q2 q3
a a, b
58
Transition Table
NFA A= ({q0,q1,q2},{0,1}, δ ,q0,{q2})
0
,
Start 1 1
0
q0 q1 q2
0 1
→q0 {q0,q1} {q0}
q1 Ø {q2}
*q2 Ø Ø
59
Transition Table
• Accept all strings that contains either 101
or 11 as a substring (010110) 0
0
,
,
1
1 0,
Start 1 1
q1 q2 ∈ q3 q4
60
Deterministic Finite Automata (DFA)
1. There are no moves on input ε
2. For each state s & input symbol a, there is
exactly one edge out of s labeled a
• If we are using a transition table to represent a
DFA, then each entry is a single state.
• Represent this state without the curly braces
that we use to form sets.
• Lexical Analyzer---DFA
Algorithm: Simulating a DFA.
• INPUT: An input string x terminated by an
end-of-file character eof. A DFA D with start state
s0 , accepting states F, and transition function
move.
• OUTPUT: Answer "yes" if D accepts x ; "no"
otherwise.
• METHOD: Apply the algorithm to the input string
x. The function move(s, c) gives the state to which
there is an edge from state s on input c. The
function nextChar returns the next character of
the input string x.
(a|b)* abb
ababb,
Sequence of states: 0, 1 , 2, 1 , 2, 3
& returns "yes."
Example
● Draw the Transition Diagram for the DFA
accepting all string with a substring 01.
0
1 0 ,
1
Start 0 1
q0 q2 q1
A=({q0,q1,q2},{0,1}, δ ,q0,{q1})
Check with the string 01,11010,100011,
0111,110101,11101101, 111000
64
Transition Function & Table
0
1 0 ,
1
Start 0 1
q0 q2 q1
● (q0,0)=q2
● (q0,1)=q0 0 1
● (q1,0)=q1
● (q1,1)=q1
→q0 q2 q0
● (q2,0)=q2 *q1 q1 q1
● (q2,1)=q1
q2 q2 q1
Example
● Let us design a DFA to accept the language
L={w | w has both an even number of 0’s
and even number of 1’s} q0→0(even) 1 (even)
q1→0(even) 1 (odd)
q2→0(odd) 1 (even)
1 q3→0(odd) 1 (odd)
Start
q q
0
1 1
0 1
*q0 q2 q1 0 0 0 0
q1 q3 q0
1
q2 q0 q3
q q
2 3
q3 q3 q1 1
66
Example: Try Yourself
• A = {w | w contains at least one 1 and an even
number of 0s follow the last 1
• Hints: A1 = (Q, ∑, δ, q1, F)
1. Q = {q1, q2, q3}
2. ∑ = {0, 1}
3. δ try yourself
4. Start state: q1
5. Final state: {q2}
67
Example
0 1 1
q1 q2
q1 q2
70
DFA vs. NFA
Parallel computation
tree
reject
accept
Accept/reject
71
NFA to DFA
Subset Construction
• Given an NFA with states Q, inputs Σ,
transition function δN, state state q0, and
final states F, construct equivalent DFA with:
– States 2Q (Set of subsets of Q).
– Inputs Σ.
– Start state {q0}.
– Final states = all those with a member of F.
72
Subset Construction
• Given, NFA: N = (QN, Σ, δN, q0, FN)
• Goal: DFA, D = (QD, Σ, δD, {q0}, FD)
• L(D) = L(N)
States
❑ QD is the set of subsets of QN
- QD is the power set of QN
- If QN has n states, QD will have 2n states
❑ Inaccessible states can be thrown away, so
effectively, the number of states D << 2n
73
Subset construction
Final States
• FD is the set of subsets S of QN such that S ∩ FN
≠ ∅ . That is FD is all sets of N’s states that
include at least one accepting state of N.
Transition Function
• The transition function δD is defined by:
δD({q1,…,qk}, a) is the union over all i = 1,…,k of
δN(qi, a).
74
Subset Construction: Example 1
• Example: We’ll construct the DFA
equivalent of our “chessboard” NFA.
1 2 3
4 5 6
7 8 9
75
Example: Subset Construction
r b r b
76
Example: Subset Construction
r b
r b
1 2,4 5 {1} {2,4} {5}
{2,4} {2,4,6,8} {1,3,5,7}
2 4,6 1,3,5
{5}
3 2,6 5
{2,4,6,8}
4 2,8 1,5,7 {1,3,5,7}
5 2,4,6,8 1,3,7,9
6 2,8 3,5,9
7 4,8 5
*
8 4,6 5,7,9
9 6,8 5
77
Example: Subset Construction
r b r b
78
Example: Subset Construction
r b
r b
{1} {2,4} {5}
1 2,4 5
{2,4} {2,4,6,8} {1,3,5,7}
2 4,6 1,3,5
{5} {2,4,6,8} {1,3,7,9}
3 2,6 5 {2,4,6,8} {2,4,6,8} {1,3,5,7,9}
4 2,8 1,5,7 {1,3,5,7}
5 2,4,6,8 1,3,7,9 * {1,3,7,9}
6 2,8 3,5,9 * {1,3,5,7,9}
7 4,8 5
8 4,6 5,7,9
* 9 6,8 5
79
Example: Subset Construction
r b
r b
{1} {2,4} {5}
1 2,4 5
{2,4} {2,4,6,8} {1,3,5,7}
2 4,6 1,3,5
{5} {2,4,6,8} {1,3,7,9}
3 2,6 5 {2,4,6,8} {2,4,6,8} {1,3,5,7,9}
4 2,8 1,5,7 {1,3,5,7} {2,4,6,8} {1,3,5,7,9}
5 2,4,6,8 1,3,7,9 * {1,3,7,9}
6 2,8 3,5,9 * {1,3,5,7,9}
7 4,8 5
8 4,6 5,7,9
* 9 6,8 5
80
Example: Subset Construction
r b
r b
{1} {2,4} {5}
1 2,4 5
{2,4} {2,4,6,8} {1,3,5,7}
2 4,6 1,3,5
{5} {2,4,6,8} {1,3,7,9}
3 2,6 5 {2,4,6,8} {2,4,6,8} {1,3,5,7,9}
4 2,8 1,5,7 {1,3,5,7} {2,4,6,8} {1,3,5,7,9}
5 2,4,6,8 1,3,7,9 * {1,3,7,9} {2,4,6,8} {5}
6 2,8 3,5,9 * {1,3,5,7,9}
7 4,8 5
8 4,6 5,7,9
* 9 6,8 5
81
Example: Subset Construction
r b
r b
{1} {2,4} {5}
1 2,4 5
{2,4} {2,4,6,8} {1,3,5,7}
2 4,6 1,3,5
{5} {2,4,6,8} {1,3,7,9}
3 2,6 5 {2,4,6,8} {2,4,6,8} {1,3,5,7,9}
4 2,8 1,5,7 {1,3,5,7} {2,4,6,8} {1,3,5,7,9}
5 2,4,6,8 1,3,7,9 * {1,3,7,9} {2,4,6,8} {5}
6 2,8 3,5,9 * {1,3,5,7,9} {2,4,6,8} {1,3,5,7,9}
7 4,8 5
8 4,6 5,7,9
* 9 6,8 5
82
Example 2
0
,
Start 1 1
0
q0 q1 q2
0 1
Ø Ø Ø
→{q0} {q0,q1} {q0}
{q1} Ø {q2}
*{q2} Ø Ø
{q0,q1} {q0,q1} {q0,q2}
*{q0,q2} {q0,q1} {q0}
*{q1,q2} Ø {q2}
*{q0,q1,q2} {q0,q1} {q0,q2} 83
Example 2
0 1
• NFA N Accepts all A A A
strings that end in 01 →B E B
• N’s set of states: {q1, q2, C A D
q3} =03 *D A A
• Subset construction: E E F
DFA need 23 = 8 states
*F E B
• Assign new names: A for ∅
, B for {q0} *G A D
*H E F
84
Example 2
1 0
Start 0 1
B E F
0
1
0 1
A A A
→B E B
•From 08 states, starting in start
C A D
state B, can only reach states B, E
*D A A
&F
E E F
▪other 05 states are inaccessible
*F E B
from B
*G A D
*H E F
85
Example 3
• N = (Q, {a, b}, δ, 1, {1})
1
• Q = {1, 2, 3} = 03 states
a
• DFA states = 08 b
∈
• {∅, {1}, {2}, {3}, {1, 2}, {1, 3},
{2, 3}, {1, 2, 3}} 2
a, b
3
a
86
a b ε
∅ ∅ ∅ ∅
{1} ∅ {2} {3}
{2} {2, 3} {3} ∅
{3} {1, 3} ∅ ∅
{1, 2} {2, 3} {2, 3} ∅
{1, 3} {1, 3} {2} ∅
{2, 3} {1, 2, 3} {3} ∅
{1, 2, 3} {1, 2, 3} {2, 3} ∅
a, b
a b {2}
∅ {1} {1, 2}
b
a a,
b b b a
a
{2, 3} {1, 2, 3}
{3} {1, 3} a
a b
b 87
Example 3
Simplified: no incoming arrows point at states {1} & {1, 2}
May be removed without affecting the performance
a, b
a
a b
{1, 3}
{3} ∅
b b b a
a
{2} {2, 3} {1, 2, 3}
a
b
88
Closure of States
• CL(q) = set of states you can reach from state
q following only arcs labeled ε.
• Example: CL(A) = {A}; ε
1 1
CL(E) = {B, C, D, E}. 1 B C D
A ε ε 0
0 E F
0
Set of states
The subset construction
Computing ε-closure(T)
Example: NFA accepting R = (alb) *abb
Σ = (a, b)
Marked
• ε-closure(0) = {0, 1, 2,4, 7} = A
• Mark A, Compute Dtran [A, a] & Dtran [A, b]
• Dtran [A, a] = ε-closure (move(A, a))
= ε-closure (move({0, 1, 2, 4, 7}, a))
= ε-closure ({3, 8})
= {3, 6, 7, 1, 2, 4} U {8}
= {1, 2, 3, 4, 6, 7, 8} = B
Σ = (a, b)
Parse tree
Step 1: For sub expression r1 = a
b
8 9
Step 9: For sub expression r10 = b
b
9` 10
b b
8 9 10
Important States of NFA
• A state of an NFA important if it has a non-ε out-transition.
• Notice that the subset construction uses only the important
states in a set T when it computes
ε- closure (move(T, a)),
-the set of states reachable from T on input a.
• The set of states move(s , a) is nonempty only if state s is
important.
• During the subset construction, two sets of NFA states can
be identified (treated as if they were the same set) if they:
• 1. Have the same important states, and
• 2. Either both have accepting states or neither does.
• The only important states are those introduced as
initial states in the basis part for a particular
symbol position in the regular expression.
• Each important state corresponds to a particular
operand in the regular expression.
• The constructed NFA has only one accepting state,
but this state, having no out-transitions, is not an
important state
❑By concatenating a unique right end marker # to a regular expression r, we
give the accepting state for r a transition on #, making it an important state of
the NFA for (r) #.
❑augmented regular expression (r)#,
❑when the construction is complete, any state with a transition on # must be
an accepting state.
Nodes
• The important states of the NFA correspond directly to
the positions in the regular expression that hold
symbols
• present the regular expression by its syntax tree
-leaves correspond to operands
-interior nodes correspond to operators
• An interior nodes:
.
• cat-node: concatenation operator ( dot)
• or-node: union operator (I)
• star-node: star operator (*)
Syntax tree: (alb)* abb#
Syntax tree: (alb)* abb#
• Leaves in a syntax tree are labeled by ε or by an
alphabet symbol.
❑ To each leaf not labeled ε, attach a unique integer.
❑ (the position of the leaf and also as a position of
its symbol)
❑ a symbol can have several positions (a: 1 & 3 )
• The positions in the syntax tree correspond to the
important states of the constructed NFA.
Example: NFA [for r=(a|b)*abb#] with the important states numbered and
other states represented by letters
b b
8 9 10
Functions Computed From the Syntax Tree
• To construct a DFA directly from a regular
expression, we construct its syntax tree and
then compute four functions:
❑ nullable
❑ firstpos
❑ lastpos
❑ followpos
04 Functions
1. nullable(n) is true for a syntax-tree node n if & only if the sub
expression represented by n has ε in its language.
▪ sub expressiorn can be "made null" or the empty string, even
though there may be other strings it can represent as well.
2. firstpos(n) is the set of positions in the subtree rooted at n that
correspond to the first symbol of at least one string in the
language of the sub expression rooted at n.
3. lastpos(n) is the set of positions in the subtree rooted at n that
correspond to the last symbol of at least one string in the
language of the sub expression rooted at n
4. followpos(p), for a position p, is the set of positions q in the
entire syntax tree such that there is some string x = a1 a2 . . . an
in L ( (r ) #) such that for some i, there is a way to explain the
membership of x in L( (r) #) by matching ai to position p of the
syntax tree and ai+1 to position q
Example: Consider the aa
cat-node n ba
corresponds to aba
expression (alb) *a
Cat node
• nullable(n) is false,
since this node
generates all strings of
a’s & b’s ending in an
a; does not generate ε
•
firstpos (n) = {1, 2, 3}
• the star-node below it lastpost (n) = {3}
is nullable; it generates followpos (1) = {1, 2, 3}
ε along with all other
strings of a’s & b’s
Computing nullable, firstpos, & lastpos
• Compute nullable, firstpos, & lastpos by a
straightforward recursion on height of the tree
• Basis & inductive rules for nullable & firstpos
Example : only the
star-node is nullable.
• none of the leaves are
nullable, because they
each correspond to non-ε
operands.
• The or-node is not
nullable, because neither
of its children is.
• The star-node is nullable,
because every star-node
is nullable.
• each of the cat-nodes,
having at least one non
null able child, is not
nullable.
▪firstpos(n) to the left of node n, and lastpos(n) to its right.
Each of the leaves has only itself for firstpos & lastpos, as required by
the rule for non-ε leaves
For the or-node, we take the union of firstpos
at the children and do the same for lastpos.
• consider the lowest cat-node, which we shall call n.
• To compute firstpos(n) , we first consider whether the
left operand is nullable, which it is in this case.
• Therefore, firstpos for n is the union of firstpos for
each of its children, that is {1, 2 } U {3} = {I, 2, 3}.
• The rule for lastpos are the same as for firstpos, with
the children interchanged.
• To compute lastpos(n) we must ask whether its right
child (the leaf with position 3) is nullable, which it is
not.
• Therefore, lastpos(n) is the same as lastpos of the right
child, or {3}.
Computing Followpos
• two ways that a position of a regular
expression can be made to follow another:
1. If n is a cat-node with left child C1 & right child
C2 , then for every position i in lastpos(C1) , all
positions in firstpos(C2) are in followpos(i).
2. If n is a star-node, & i is a position in
lastpos(n) , then all positions in firstpos(n) are
in followpos(i).
Example: Rule 1 for followpos requires that we look
at each cat-node, & put each position in firstpos of
its right child in followpos for each position in
lastpos of its left child.
firstpos
lastpos
▪ For the lowest cat-node, that rule says position 3 is in
followpos(1) and followpos(2)
▪ The next cat-node says that 4 is in followpos (3) ,
▪ remaining two cat-nodes give us 5 in followpos (4) & 6 in
followpos(5)
C1 C2
F|L F|L
▪ For the lowest cat-node, that rule says position 3 is in followpos(1) &
followpos(2)
▪ Rule 2 to the star-node. positions 1 & 2 are in both followpos(1) &
followpos(2) , since both firstpos & lastpos for this node are {1 , 2} .
C1 C2
F|L F|L
Directed graph for the function followpos
Converting a Regular Expression
Directly to a DFA
Algorithm: Construction of a DFA from a regular expression r.
INPUT : A regular expression r.
OUTPUT: A DFA D that recognizes L (r) .
METHOD:
1 . Construct a syntax tree T from the augmented regular
expression (r) #.
2. Compute nullable, firstpos, lastpos, & followpos for T
3. Construct Dstates, the set of states of DFA D , & Dtran, the
transition function for D. The states of D are sets of positions in T.
Initially, each state is "unmarked," & a state becomes "marked" just
before we consider its out-transitions.
▪The start state of D is firstpos(no) , where node no is the root of T.
▪The accepting states are those containing the position for
endmarker symbol #
Construction of a DFA directly from a
regular expression
Example: construct a DFA for the regular expression
r = (a|b)*abb.
▪The value of firstpos for the root of the tree: {1, 2, 3}
A = {1, 2, 3} ----Start state
• Compute Dtran[A, a] & Dtran[A, b].
• Among the positions of A, 1 & 3 correspond to a, while 2
corresponds to b.
• Dtran[A, a] = followpos(1) U followpos(3) = {1, 2, 3, 4} = B
• Compute Dtran[A, b].
• Among the positions only 2 corresponds to b.
• Dtran[A, b] = followpos(2) = {1, 2, 3} = A
• Compute Dtran[B, a] = Dtran[{1, 2, 3, 4}, a]
• Among the positions 1, 3 corresponds to a.
• Dtran[B, a] = followpos(1) U followpos(3)
= {1, 2, 3, 4} = B
• Compute Dtran[B, b] = Dtran[{1, 2, 3, 4}, b]
• Among the positions 2 & 4 corresponds to b.
• Dtran[B, b] = followpos(2) U followpos(4)
= {1, 2, 3, 5} = C
• Compute Dtran[C, a] = Dtran[{1, 2, 3, 5}, a]
• Among the positions 1 & 3 corresponds to a.
• Dtran[C, a] = followpos(1) U followpos(3)
= {1, 2, 3, 4} = B
• Compute Dtran[C, b] = Dtran[{1, 2, 3, 5}, b]
• Among the positions 2 & 5 corresponds to b.
• Dtran[C, b] = followpos(2) U followpos(5)
= {1, 2, 3, 6} = D
• Compute Dtran[D, a] = Dtran[{1, 2, 3, 6}, a]
• Among the positions 1 & 3 corresponds to a.
• Dtran[D, a] = followpos(1) U followpos(3)
= {1, 2, 3, 4} = B
• Compute Dtran[D, b] = Dtran[{1, 2, 3, 6}, b]
• Among the positions 2 corresponds to b.
• Dtran[D, b] = followpos(2)
= {1, 2, 3} = A
A = {1, 2, 3}
Dtran[A, a] = followpos(1) U followpos(3) = {1, 2, 3, 4} = B
A B C D
Conclusion
• Tokens
• Lexemes
• Patterns
• Regular Expressions
• Regular Definitions
• Transition Diagrams
• Finite Automata
• DFA & NFA
• Conversion (NFA to DFA, Regular Expression to
NFA/DFA)