0% found this document useful (0 votes)
5 views

Lecture 02

The document discusses the role of the lexical analyzer in a compiler, which involves reading source program characters, grouping them into lexemes, and producing a sequence of tokens for syntax analysis. It details the interaction between the lexical analyzer and parser, the definition of tokens, patterns, and lexemes, as well as the process of error recovery in lexical analysis. Additionally, it covers regular expressions, their definitions, and the recognition of tokens in programming languages.

Uploaded by

nihafahima9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Lecture 02

The document discusses the role of the lexical analyzer in a compiler, which involves reading source program characters, grouping them into lexemes, and producing a sequence of tokens for syntax analysis. It details the interaction between the lexical analyzer and parser, the definition of tokens, patterns, and lexemes, as well as the process of error recovery in lexical analysis. Additionally, it covers regular expressions, their definitions, and the recognition of tokens in programming languages.

Uploaded by

nihafahima9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 150

• Example: R = (alb) * abb

NFA

Double circle around state 3 indicates that this state is accepting.


only ways to get from the start state 0 to the accepting state is to
follow some path that stays in state 0 for a while, then goes to
states 1 , 2, and 3 by reading abb from the input.

Thus, the only strings getting to the accepting state are those that
end in abb.
Lexical Analysis
The Role of the Lexical Analyzer
• The lexical analyzer is to read the input characters of the
source program, group them into lexemes, and produce
as output a sequence of tokens for each lexeme in the
source program.
• The stream of tokens is sent to the parser for syntax
analysis.
• It is common for the lexical analyzer to interact with the
symbol table as well.
• When the lexical analyzer discovers a lexeme constituting
an identifier, it needs to enter that lexeme into the symbol
table.
• In some cases, information regarding the kind of identifier
may be read from the symbol table by the lexical analyzer
to assist it in determining the proper token it must pass to
the parser.
Interaction between the lexical
analyzer & the parser
• Other Tasks:
Stripping out comments and whitespace
(blank, newline, tab, and perhaps other
characters that are used to separate tokens in
the input).
Correlating error messages generated by the
compiler with the source program.
Divided into two processes
1. Scanning consists of the simple processes that
do not require tokenization of the input, such
as deletion of comments and compaction of
consecutive whitespace characters into one.
2. Lexical analysis proper is the more complex
portion, where the scanner produces the
sequence of tokens as output.
Lexical Analysis Versus Parsing
• Why the analysis portion of a compiler is
normally separated into lexical analysis and
parsing (syntax analysis) phases?
1. Simplicity of design is the most important
consideration.
The separation of lexical and syntactic analysis
often allows us to simplify at least one of these
tasks.
If we are designing a new language, separating
lexical and syntactic concerns can lead to a
cleaner overall language design.
2. Compiler efficiency is improved.
A separate lexical analyzer allows us to
apply specialized techniques that serve only
the lexical task, not the job of parsing.
In addition, specialized buffering
techniques for reading input characters can
speed up the compiler significantly.
3. Compiler portability is enhanced.
Input-device-specific peculiarities can be
restricted to the lexical analyzer.
Tokens, Patterns & Lexemes
• Token
❖ a token name + an optional attribute value.
❖ The token name is an abstract symbol
representing a kind of lexical unit, e.g., a
particular keyword, or a sequence of input
characters denoting an identifier.
❖ The token names are the input symbols
that the parser processes.
• Pattern
o A pattern is a description of the form that the
lexemes of a token may take.
o In the case of a keyword as a token, the pattern is
just the sequence of characters that form the
keyword.
o For identifiers and some other tokens, the pattern
is a more complex structure that is matched by
many strings.
• Lexemes
❑ A lexeme is a sequence of characters in the source
program that matches the pattern for a token & is
identified by the lexical analyzer as an instance of
that token.
Example: Patterns & Lexemes

• C statement
printf ("Total = %d\n" , score ) ;
• both printf and score are lexemes matching the
pattern for token id, and " Total = %d\n" is a lexeme
matching literal.
Covering most or all of the tokens
1. One token for each keyword. The pattern for a
keyword is the same as the keyword itself.
2. Tokens for the operators, either individually or in
classes .
3. One token representing all identifiers.
4. One or more tokens representing constants, such
as numbers and literal strings .
5. Tokens for each punctuation symbol, such as left
and right parentheses, comma, and semicolon.
Attributes for Tokens
• When more than one lexeme can match a pattern, the
lexical analyzer must provide the subsequent compiler
phases additional information about the particular
lexeme that matched.
• For example, the pattern for token number matches both
0 and 1, but it is extremely important for the code
generator to know which lexeme was found in the source
program.
• Thus, in many cases the lexical analyzer returns to the
parser not only a token name, but an attribute value that
describes the lexeme represented by the token ;
• Token name influences parsing decisions, while the
attribute value influences translation of tokens after the
parse.
Attributes for Tokens
• Tokens have at most one associated attribute,
although this attribute may have a structure
that combines several pieces of information.
• Normally, information about an identifier-e.g.,
its lexeme, its type, and the location at which
it is first found is kept in the symbol table.
• Thus, the appropriate attribute value for an
identifier is a pointer to the symbol-table
entry for that identifier.
Example 3.2 : The token names & associated
attribute values for the Fortran statement
E = M * C ** 2
• Sequence of pairs:
<id, pointer to symbol-table entry for E>
<assign_op> [no need to assign]
<id, pointer to symbol-table entry for M>
<mult_op> [no need to assign]
<id, pointer to symbol-table entry for C>
<exp_op> [no need to assign]
<number, integer value 2>
Lexical Errors
• It is hard for a lexical analyzer to tell, without the
aid of other components, that there is a
source-code error.
• For instance, if the string fi is encountered for the
first time in a C program in the context :
fi ( a == f (x) ) . . .
• A lexical analyzer cannot tell whether fi is a
misspelling of the keyword if or an undeclared
function identifier.
• Since fi is a valid lexeme for the token id, the
lexical analyzer must return the token id to the
parser
Lexical Errors
• Let the parser - handle an error due to transposition of the
letters.
• However, suppose a situation arises in which the lexical
analyzer is unable to proceed because none of the patterns
for tokens matches any prefix of the remaining input.
• Error Recovery
✔ The simplest recovery strategy is "panic mode" recovery.
✔ We delete successive characters from the remaining input,
until the lexical analyzer can find a well-formed token at the
beginning of what input is left.
✔ This recovery technique may confuse the parser, but in an
interactive computing environment it may be quite
adequate.
• Other Recovery

• 1. Delete one character from the remaining


input.
• 2. Insert a missing character into the remaining
input.
• 3. Replace a character by another character.
• 4. Transpose two adjacent characters.
Terms for Parts of Strings
1 . A prefix of string s is any string obtained by removing zero or more
symbols from the end of s.
• ban, banana, and ε are prefixes of banana.
2. A suffix of string s is any string obtained by removing zero or more
symbols from the beginning of s.
nana, banana, and ε are suffixes of banana.
3. A substring of s is obtained by deleting any prefix and any suffix from s.
banana, nan, and ε are substrings of banana.
4. The proper prefixes, suffixes, and substrings of a string s are those,
prefixes, suffixes, and substrings, respectively, of s that are not ε or not
equal to s itself.
5. A subsequence of s is any string formed by deleting zero or more not
necessarily consecutive positions of s.
baan is a subsequence of banana.
Regular Expressions
• Regular expressions: underscore is included
among the letters.
• if letter_ is established to stand for any letter
or the underscore, and digit_ established to
stand for any digit, then we could describe the
language of C identifiers by :
Letter_ ( letter_ | digit )*
▪Vertical bar: union
▪Parentheses: group sub expressions,
▪Star: zero or more occurrences of
▪Juxtaposition of letter_ with the remainder of the expression signifies concatenation
• Each regular expression r denotes a language
L(r) , which is also defined recursively from the
languages denoted by r's sub expressions.

• Rules that define the regular expressions over


some alphabet Σ and the languages that those
expressions denote
BASIS: There are two rules

• 1. ε is a regular expression, and L (ε) is {ε}


▪ the language whose sole member is the empty
string.
• 2. If a is a symbol in Σ, then a is a regular
expression, and L(a) = {a},
▪ the language with one string, of length one,
with a in its one position.
INDUCTION: There are 04 parts to the induction whereby larger
regular expressions are built from smaller ones. Suppose r and s are
regular expressions denoting languages L(r) and L(s).

1. (r)|(s) is a regular expression denoting the language L(r) U L(s) .

2. (r) (s) is a regular expression denoting the language L(r)L(s).


3. (r)* is a regular expression denoting (L (r)) * .
4. (r) is a regular expression denoting L(r) .
▪ This last rule says that we can add additional pairs of parentheses
around expressions without changing the language they denote.
• We may drop certain pairs of parentheses
• Conventions
• a) The unary operator * has highest precedence & is
left associative.
• b) Concatenation has second highest precedence
and is left associative.
• c) I has lowest precedence and is left associative.
• (a) I((b) *(c)) == alb* c.
• Both expressions denote the set of strings that are
either a, single a or are zero or more b's followed by
one c.
Example 3.4 : Let Σ = {a, b} .
1. The regular expression a|b denotes the language {a, b} .
2. (alb) (alb) denotes {aa, ab, ba, bb} , the language of all
strings of length two over the alphabet Σ. Another regular
expression for the same language : aa l ab l ba l bb
3. a* denotes the language consisting of all strings of zero or
more a's: {ε, a, aa, aaa, ... }.
4. ( a I b) * denotes the set of all strings consisting of zero or
more instances of a or b,
▪ all strings of a's and b's: {ε , a, b, aa, ab, ba, bb, aaa, ... }.
▪ Another regular expression for same language: (a* b * )*.
5. ala* b denotes the language {a, b, ab, aab, aaab, ... },
▪ the string a & all strings consisting of zero or more a's &
ending in b.
• Regular set: A language that can be defined by
a regular expression
• If two regular expressions r and s denote the
same regular set , we say they are equivalent
and write r = s. For instance, (alb) = (b la).
Algebraic laws for regular expressions r, s, & t
Regular Definition
• If Σ is an alphabet of basic symbols, then a
regular definition is a sequence of definitions
of the form:
• d1 → r 1
• d2 → r 2
------
▪ Dn → r n
• 1. Each di is a new symbol, not in Σ and not the
same as any other of the d's,
• 2. Each ri is a regular expression over the
alphabet ΣU {d1 , d2 , . . . , di-1
• By restricting ri to Σ & previously defined d's, we avoid recursive
definitions, and we can construct a regular expression over Σ
alone, for each ri

• How: first replacing uses of d1 in r2 (which cannot use any of the d's
except for d1), then replacing uses of d1 and d2 in r3 by r1 and (the
substituted) r2 , and so on.

• Finally, in rn replace each di , for i = 1 , 2, . . . , n - 1 , by the


substituted version of ri , each of which has only symbols of Σ.
• Example: C identifiers are strings of letters,
digits, and underscores. Here is a regular
definition for the language of C identifiers. We
shall conventionally use italics for the symbols
defined in regular definitions.
• letter → A I B I · · · l z l a l b l · · · l z l –
• digit → 0| 1 | · · ·| 9

• id → letter _ ( letter- I digit ) *


• Example : Unsigned numbers (integer or floating point)
are strings such as 5280, 0 . 0 1234, 6 . 336E4, or 1 .
89E-4.

• digit → 0| 1 |· · · | 9

• digits → digit digit*

• optional Fraction → . digits I ε

• optional Exponent → ( E ( + I - I ε ) digits ) I ε

• number → digits optionalFraction optionalExponent


Abbreviations

• The basic operations generate all possible regular


expressions, but there are common abbreviations used for
convenience. Typical examples:
Abbr. Meaning Notes
r+ (rr*) 1 or more occurrences
r? (r | ε) 0 or 1 occurrence
[a-z] (a|b|…|z) 1 character in given range
[abxyz] (a|b|x|y|z) 1 of the given characters
Examples

re Meaning
+ single + character
! single ! character
= single = character
!= 2 character sequence
<= 2 character sequence
xyzzy 5 character sequence
Extensions of Regular Expressions
• 1 . One or more instances. The unary, postfix operator + represents the
positive closure of a regular expression and its language.
That is, if r is a regular expression, then (r)+ denotes the language (L(r) ) + .
The operator + has the same precedence and associativity as the operator *
Two useful algebraic laws, r* = r+ |e and r+ = rr* = r*r relate the Kleene
closure & positive closure.
• 2. Zero or one instance. The unary postfix operator ? means "zero or one
occurrence." That is, r? is equivalent to r l ε , or put another way, L (r?) = L (r)
U {ε}.
The ? operator has the same precedence and associativity as * and + .
• 3. Character classes. A regular expression a1l a2| · · · I an , where the ai 's are
each symbols of the alphabet, can be replaced by the shorthand [a1 a2 · · ·
an].
when a1 , a2 , . · · , an form a logical sequence, e.g., consecutive uppercase
letters, lowercase letters, or digits, we can replace them by a1-an , that is, just
the first and last separated by a hyphen.
[abc] == a|b|c, [a-z] == a|b| · · · |z
Recognition of Tokens
• Build a piece of code that examines the input
string & finds a prefix that is a lexeme
matching one of the patterns.
Patterns
Example:
Terminals:
if, then, else,
relop , id,
number---names
of tokens
ws → ( blank | tab | newline )+

blank, tab, newline are abstract symbols.


Token ws is different from the other tokens in that ,
when we recognize it, we do not return it to the
parser, but rather restart the lexical analysis from
the character that follows the whitespace.
Tokens, their patterns, and attribute values

• For each lexeme or


family of lexemes,
which token name
is returned to the
parser and what
attribute value, is
returned.
06 relational
operators are
used as the
attribute value,
in order to
indicate which
instance of the
token relop we
have found
Transition Diagrams
• convert patterns into stylized flowcharts:
"transition diagrams"
Transition diagrams have a collection of
nodes or circles, called states.
Each state represents a condition that could
occur during the process of scanning the
input looking for a lexeme that matches
one of several patterns.
Transition Diagrams
• Edges are directed from one state to another.
• Each edge is labeled by a symbol or set of symbols.
• If we are in some state s , & the next input symbol is a,
we look for an edge out of state s labeled by a (and
perhaps by other symbols, as well).
• If such an edge found, advance the forward pointer &
enter the state of the transition diagram to which that
edge leads.
• transition diagrams are deterministic
there is never more than one edge out of a given state
with a given symbol among its labels.
Some important conventions
1. Certain states are said to be accepting, or final. These
states indicate that a lexeme has been found. We always
indicate an accepting state by a double circle, and if there is
an action to be taken - typically returning a token and an
attribute value to the parser - we shall attach that action to
the accepting state.
2. In addition, if it is necessary to retract the forward pointer
one position (i.e., the lexeme does not include the symbol
that got us to the accepting state), then we shall
additionally place a* near that accepting state.
3. One state is designated the start state, or initial state; it is
indicated by an edge, labeled "start," entering from
nowhere. The transition diagram always begins in the start
state before any input symbols have been read.
• Example: Transition diagram that recognizes
the lexemes matching the token relop.
Recognition of Reserved Words and
Identifiers
• Keywords (if or then) are reserved
• not identifiers
• Transition diagram for identifier lexemes, &
recognize the keywords if , then, & else
02 ways that we can handle reserve words
that look like identifier
1. Install the reserved words in the symbol table
initially. A field of the symbol-table entry
indicates that these strings are never ordinary
identifiers, and tells which token they represent.
When we find an identifier, a call to installed
places it in the symbol table if it is not already
there and returns a pointer to the symbol-table
entry for the lexeme found.
2. Create separate transition diagrams for each
keyword;
Note that such a transition diagram consists of
states representing the situation after each
successive letter of the keyword is seen,
followed by a test for a "non letter-or-digit,“
i.e., any character that cannot be the
continuation of an identifier.

Hypothetical transition diagram for the keyword then


Figure : A transition diagram for unsigned numbers

Figure : A transition diagram for whitespace


Finite Automata
• Finite automata are essentially graphs, like transition diagrams,
with a few differences:
1. Finite automata are recognizers; they simply say "yes" or "no"
about each possible input string.
2. Finite automata come in two flavors:
(a) Nondeterministic finite automata (NFA) have no
restrictions on the labels of their edges. A symbol can label
several edges out of the same state, and ε, the empty
string, is a possible label.
(b) Deterministic finite automata (DFA) have, for each
state, and for each symbol of its input alphabet exactly one
edge with that symbol leaving that state.
• Both NFA & DFA are capable of recognizing the same languages
(regular language)
Nondeterministic Finite Automata (NFA)
1 . A finite set of states S.
2. A set of input symbols Σ, the input alphabet. The
empty string (ε), is never a member of Σ
3. A transition function that gives, for each state,
and for each symbol in Σ U {E} a set of next states.
4. A state s0 from S that is distinguished as the start
state (or initial state) .
5. A set of states F, a subset of S, that is
distinguished as the accepting states (or final
states) .
▪Any NFA/DFA can represent by a transition graph, where the
nodes are states and the labeled edges represent the transition
function.
▪There is an edge labeled a from state s to state t
❖if and only if t is one of the next states for state s and input a
.

This graph is very much like a transition diagram, except

a) The same symbol can label edges from one state to several
different states,
b) An edge may be labeled by ε, the empty string, instead of, or
in addition to, symbols from the input alphabet.
Transition Tables
• Rows : states
• Columns: input symbols and ε.
• The entry for a given state & input is
value of the transition function
applied to those arguments.
• If the transition function has no
information about that state-input
pair, put Φ.
• Adv: Easily find the transitions on a
given state and input.
• Disadv: takes a lot of space, when
the input alphabet is large,
Acceptance of Input Strings by
Automata
• An NFA accepts input string x if & only if there
is some path in the transition graph from the
start state to one of the accepting states
• ε labels along the path are effectively ignored,
since the empty string does not contribute to
the string constructed along the path.
Example : The string aabb is accepted by the NFA

• Another path (not accepting)

❑ NFA accepts a string as long as some path


labeled by that string leads from the start
state to an accepting state.
NFA
L(aa* | bb* )

• String aaa accepted


• ε is "disappear" in a
concatenation
Example: Moves on a Chessboard

• States = squares.
• Inputs = r (move to an adjacent red square)
and b (move to an adjacent black square).
• Start state, final state are in opposite
corners.

54
Example: Chessboard – (2)
1 2 3
r b
1 2,4 5
4 5 6 2 4,6 1,3,5
3 2,6 5
7 8 9 4 2,8 1,5,7
5 2,4,6,8 1,3,7,9
r b b 6 2,8 3,5,9
1 2 1 5 7 4,8 5
4 3 1
8 4,6 5,7,9
5 3
* 9 6,8 5
7 7
9 Accept, since final state reached 55
Example
• An NFA accepting all strings that end in 01
0
,
Start 1 1
0
q0 q1 q2

Input: 00101
q0 q0 q0 q0 q0 q0

q1 q1 q1
(Stuck)
q2 Accept
q2
(Stuck) ed

1 0 1
0 0 56
Example
• NFA that has an input alphabet {0} consisting of a
single symbol. It accepts all strings of the form 0k
where k is a multiple of 2 or 3 (accept: ∈, 00, 0000,
000000 but not 0, 00000)
0

∈ 0


0

0
0
57
Example

Accept: ∈, a, baba, baa q1

Reject: b, bb, babba b


a

q2 q3
a a, b

58
Transition Table
NFA A= ({q0,q1,q2},{0,1}, δ ,q0,{q2})

0
,
Start 1 1
0
q0 q1 q2

0 1
→q0 {q0,q1} {q0}
q1 Ø {q2}
*q2 Ø Ø

59
Transition Table
• Accept all strings that contains either 101
or 11 as a substring (010110) 0
0
,
,
1
1 0,
Start 1 1
q1 q2 ∈ q3 q4

1. Q = {q1, q2, q3, q4}


2. Σ = {0, 1} 0 1 ∈
→q1 {q1} {q1, q2} ∅
3. δ
q2 {q3} ∅ {q3}
q3 ∅ {q4} ∅
4. Start state: q1
5. F = {q4} *q4 {q4} {q4} ∅

60
Deterministic Finite Automata (DFA)
1. There are no moves on input ε
2. For each state s & input symbol a, there is
exactly one edge out of s labeled a
• If we are using a transition table to represent a
DFA, then each entry is a single state.
• Represent this state without the curly braces
that we use to form sets.
• Lexical Analyzer---DFA
Algorithm: Simulating a DFA.
• INPUT: An input string x terminated by an
end-of-file character eof. A DFA D with start state
s0 , accepting states F, and transition function
move.
• OUTPUT: Answer "yes" if D accepts x ; "no"
otherwise.
• METHOD: Apply the algorithm to the input string
x. The function move(s, c) gives the state to which
there is an edge from state s on input c. The
function nextChar returns the next character of
the input string x.
(a|b)* abb
ababb,
Sequence of states: 0, 1 , 2, 1 , 2, 3
& returns "yes."
Example
● Draw the Transition Diagram for the DFA
accepting all string with a substring 01.
0
1 0 ,
1
Start 0 1
q0 q2 q1

A=({q0,q1,q2},{0,1}, δ ,q0,{q1})
Check with the string 01,11010,100011,
0111,110101,11101101, 111000
64
Transition Function & Table
0
1 0 ,
1
Start 0 1
q0 q2 q1

● (q0,0)=q2
● (q0,1)=q0 0 1
● (q1,0)=q1
● (q1,1)=q1
→q0 q2 q0
● (q2,0)=q2 *q1 q1 q1
● (q2,1)=q1
q2 q2 q1
Example
● Let us design a DFA to accept the language
L={w | w has both an even number of 0’s
and even number of 1’s} q0→0(even) 1 (even)
q1→0(even) 1 (odd)
q2→0(odd) 1 (even)
1 q3→0(odd) 1 (odd)
Start
q q
0
1 1
0 1
*q0 q2 q1 0 0 0 0
q1 q3 q0
1
q2 q0 q3
q q
2 3
q3 q3 q1 1
66
Example: Try Yourself
• A = {w | w contains at least one 1 and an even
number of 0s follow the last 1
• Hints: A1 = (Q, ∑, δ, q1, F)
1. Q = {q1, q2, q3}
2. ∑ = {0, 1}
3. δ try yourself
4. Start state: q1
5. Final state: {q2}

67
Example
0 1 1

q1 q2

• A2= ({q1, q2}, (0,1), δ, q1, {q2})


• Transition function, δ 0 1
Try: 1101, 11010, 0011010 →q1 q1 q2
L(A2) = {w | w ends in a 1} *q2 q1 q2
68
Example
0 1 1

q1 q2

• A3= ({q1, q2}, (0,1), δ, q1, {q1})


• Transition function, δ 0 1
Try: 1101, 11010, 0011010 →*q1 q1 q2
L(A3) = {w | w is ε or ends q2 q1 q2
in a 0} 69
DFA vs. NFA
◊ DFA: δ returns a single ❑ NFA: δ returns a set of states
state ❑ NFA has an arrow with label ∈
◊ Every state of a DFA ❑ NFA may have arrows labeled
always has exactly one with members of alphabet/∈.
exiting transition arrow ❑ Zero, one, or many arrows may
for each symbol in the exit from each state with label ∈
alphabet
◊ Labels on the transition
arrows are symbols
from the alphabet

70
DFA vs. NFA
Parallel computation
tree

reject

accept
Accept/reject
71
NFA to DFA
Subset Construction
• Given an NFA with states Q, inputs Σ,
transition function δN, state state q0, and
final states F, construct equivalent DFA with:
– States 2Q (Set of subsets of Q).
– Inputs Σ.
– Start state {q0}.
– Final states = all those with a member of F.

72
Subset Construction
• Given, NFA: N = (QN, Σ, δN, q0, FN)
• Goal: DFA, D = (QD, Σ, δD, {q0}, FD)
• L(D) = L(N)
States
❑ QD is the set of subsets of QN
- QD is the power set of QN
- If QN has n states, QD will have 2n states
❑ Inaccessible states can be thrown away, so
effectively, the number of states D << 2n

73
Subset construction
Final States
• FD is the set of subsets S of QN such that S ∩ FN
≠ ∅ . That is FD is all sets of N’s states that
include at least one accepting state of N.
Transition Function
• The transition function δD is defined by:
δD({q1,…,qk}, a) is the union over all i = 1,…,k of
δN(qi, a).

74
Subset Construction: Example 1
• Example: We’ll construct the DFA
equivalent of our “chessboard” NFA.

1 2 3

4 5 6

7 8 9

75
Example: Subset Construction
r b r b

1 2,4 5 {1} {2,4} {5}


2 4,6 1,3,5 {2,4}
3 2,6 5 {5}
4 2,8 1,5,7
5 2,4,6,8 1,3,7,9
6 2,8 3,5,9
7 4,8 5
8 4,6 5,7,9
* 9 6,8 5

76
Example: Subset Construction
r b
r b
1 2,4 5 {1} {2,4} {5}
{2,4} {2,4,6,8} {1,3,5,7}
2 4,6 1,3,5
{5}
3 2,6 5
{2,4,6,8}
4 2,8 1,5,7 {1,3,5,7}
5 2,4,6,8 1,3,7,9
6 2,8 3,5,9
7 4,8 5
*
8 4,6 5,7,9
9 6,8 5
77
Example: Subset Construction
r b r b

1 2,4 5 {1} {2,4} {5}


2 4,6 1,3,5 {2,4} {2,4,6,8} {1,3,5,7}
3 2,6 5 {5} {2,4,6,8} {1,3,7,9}
4 2,8 1,5,7 {2,4,6,8}
5 2,4,6,8 1,3,7,9 {1,3,5,7}
6 2,8 3,5,9 * {1,3,7,9}
7 4,8 5
8 4,6 5,7,9
* 9 6,8 5

78
Example: Subset Construction
r b
r b
{1} {2,4} {5}
1 2,4 5
{2,4} {2,4,6,8} {1,3,5,7}
2 4,6 1,3,5
{5} {2,4,6,8} {1,3,7,9}
3 2,6 5 {2,4,6,8} {2,4,6,8} {1,3,5,7,9}
4 2,8 1,5,7 {1,3,5,7}
5 2,4,6,8 1,3,7,9 * {1,3,7,9}
6 2,8 3,5,9 * {1,3,5,7,9}
7 4,8 5
8 4,6 5,7,9
* 9 6,8 5
79
Example: Subset Construction
r b
r b
{1} {2,4} {5}
1 2,4 5
{2,4} {2,4,6,8} {1,3,5,7}
2 4,6 1,3,5
{5} {2,4,6,8} {1,3,7,9}
3 2,6 5 {2,4,6,8} {2,4,6,8} {1,3,5,7,9}
4 2,8 1,5,7 {1,3,5,7} {2,4,6,8} {1,3,5,7,9}
5 2,4,6,8 1,3,7,9 * {1,3,7,9}
6 2,8 3,5,9 * {1,3,5,7,9}
7 4,8 5
8 4,6 5,7,9
* 9 6,8 5
80
Example: Subset Construction
r b
r b
{1} {2,4} {5}
1 2,4 5
{2,4} {2,4,6,8} {1,3,5,7}
2 4,6 1,3,5
{5} {2,4,6,8} {1,3,7,9}
3 2,6 5 {2,4,6,8} {2,4,6,8} {1,3,5,7,9}
4 2,8 1,5,7 {1,3,5,7} {2,4,6,8} {1,3,5,7,9}
5 2,4,6,8 1,3,7,9 * {1,3,7,9} {2,4,6,8} {5}
6 2,8 3,5,9 * {1,3,5,7,9}
7 4,8 5
8 4,6 5,7,9
* 9 6,8 5
81
Example: Subset Construction
r b
r b
{1} {2,4} {5}
1 2,4 5
{2,4} {2,4,6,8} {1,3,5,7}
2 4,6 1,3,5
{5} {2,4,6,8} {1,3,7,9}
3 2,6 5 {2,4,6,8} {2,4,6,8} {1,3,5,7,9}
4 2,8 1,5,7 {1,3,5,7} {2,4,6,8} {1,3,5,7,9}
5 2,4,6,8 1,3,7,9 * {1,3,7,9} {2,4,6,8} {5}
6 2,8 3,5,9 * {1,3,5,7,9} {2,4,6,8} {1,3,5,7,9}
7 4,8 5
8 4,6 5,7,9
* 9 6,8 5
82
Example 2
0
,
Start 1 1
0
q0 q1 q2

δD({q0, q2}, 0) = δN({q0, 0) U δN({q2, 0) = {q0, q1} U ∅ = {q0, q1}


δD({q0, q2}, 1) = δN({q0, 1) U δN({q2, 1) = {q0} U ∅ = {q0}

0 1
Ø Ø Ø
→{q0} {q0,q1} {q0}
{q1} Ø {q2}
*{q2} Ø Ø
{q0,q1} {q0,q1} {q0,q2}
*{q0,q2} {q0,q1} {q0}
*{q1,q2} Ø {q2}
*{q0,q1,q2} {q0,q1} {q0,q2} 83
Example 2
0 1
• NFA N Accepts all A A A
strings that end in 01 →B E B
• N’s set of states: {q1, q2, C A D
q3} =03 *D A A
• Subset construction: E E F
DFA need 23 = 8 states
*F E B
• Assign new names: A for ∅
, B for {q0} *G A D
*H E F
84
Example 2
1 0
Start 0 1
B E F
0
1
0 1
A A A
→B E B
•From 08 states, starting in start
C A D
state B, can only reach states B, E
*D A A
&F
E E F
▪other 05 states are inaccessible
*F E B
from B
*G A D
*H E F
85
Example 3
• N = (Q, {a, b}, δ, 1, {1})
1
• Q = {1, 2, 3} = 03 states
a
• DFA states = 08 b

• {∅, {1}, {2}, {3}, {1, 2}, {1, 3},
{2, 3}, {1, 2, 3}} 2
a, b
3
a

86
a b ε
∅ ∅ ∅ ∅
{1} ∅ {2} {3}
{2} {2, 3} {3} ∅
{3} {1, 3} ∅ ∅
{1, 2} {2, 3} {2, 3} ∅
{1, 3} {1, 3} {2} ∅
{2, 3} {1, 2, 3} {3} ∅
{1, 2, 3} {1, 2, 3} {2, 3} ∅

a, b
a b {2}
∅ {1} {1, 2}

b
a a,
b b b a

a
{2, 3} {1, 2, 3}
{3} {1, 3} a
a b

b 87
Example 3
Simplified: no incoming arrows point at states {1} & {1, 2}
May be removed without affecting the performance

a, b
a
a b
{1, 3}
{3} ∅

b b b a

a
{2} {2, 3} {1, 2, 3}
a
b

88
Closure of States
• CL(q) = set of states you can reach from state
q following only arcs labeled ε.
• Example: CL(A) = {A}; ε
1 1
CL(E) = {B, C, D, E}. 1 B C D

A ε ε 0
0 E F
0

• Closure of a set of states = union of the


closure of each state.
89
Algorithm : The subset construction of
a DFA from an NFA.
Input: An NFA N.
OUTPUT: A DFA D accepting the same language as N
METHOD: constructs a transition table Dtran for D.
Each state of D is a set of NFA states, and construct
Dtran so D will simulate
"in parallel" all possible moves N can make on a given
input string.
s is a single state of N, while T is a set of states of N.
Operations on NFA states

Set of states
The subset construction

Computing ε-closure(T)
Example: NFA accepting R = (alb) *abb
Σ = (a, b)

Marked
• ε-closure(0) = {0, 1, 2,4, 7} = A
• Mark A, Compute Dtran [A, a] & Dtran [A, b]
• Dtran [A, a] = ε-closure (move(A, a))
= ε-closure (move({0, 1, 2, 4, 7}, a))
= ε-closure ({3, 8})
= {3, 6, 7, 1, 2, 4} U {8}
= {1, 2, 3, 4, 6, 7, 8} = B
Σ = (a, b)

• Dtran [A, b] = ε-closure (move(A, b))


= ε-closure (move({0, 1, 2, 4, 7}, b))
= ε-closure ({5})
= {5, 6, 7, 1, 2, 4}
Dtran [A, b] = {1, 2, 4, 5, 6, 7} = C
Σ = (a, b)

• Mark B, Compute Dtran [B, a] & Dtran [B, b]


• Dtran [B, a] = ε-closure (move(B, a))
= ε-closure (move({1, 2, 3, 4, 6, 7, 8}, a))
= ε-closure ({3, 8})
= {3, 6, 7, 1, 2, 4} U {8}
Dtran [B, a] = {1, 2, 3, 4, 6, 7, 8} = B
Σ = (a, b)

• Compute Dtran [B, b]


• Dtran [B, b] = ε-closure (move(B, b))
= ε-closure (move({1, 2, 3, 4, 6, 7, 8}, b))
= ε-closure ({5, 9})
= {5, 6, 7, 1, 2, 4} U {9}
Dtran [B, b] = {1, 2, 4, 5, 6, 7, 9} = D
Σ = (a, b)

• Mark C, Compute Dtran [C, a] & Dtran [C, b]


• Dtran [C, a] = ε-closure (move(C, a))
= ε-closure (move({1, 2, 4, 5, 6, 7}, a))
= ε-closure ({3, 8})
= {3, 6, 7, 1, 2, 4} U {8}
Dtran [C, a] = {1, 2, 3, 4, 6, 7, 8} = B
Σ = (a, b)

• Compute Dtran [C, b]


• Dtran [C, b] = ε-closure (move(C, b))
= ε-closure (move({1, 2, 4, 5, 6, 7}, b))
= ε-closure ({5})
= {5, 6, 7, 1, 2, 4}
Dtran [C, b] = {1, 2, 4, 5, 6, 7} = C
Σ = (a, b)

• Mark D, Compute Dtran [D, a] and Dtran [D, b]


• Dtran [D, a] = ε-closure (move(D, a))
= ε-closure (move({1, 2, 4, 5, 6, 7, 9}, a))
= ε-closure ({3, 8})
= {3, 6, 7, 1, 2, 4} U {8}
Dtran [D, a] = {1, 2, 3, 4, 6, 7, 8} = B
Σ = (a, b)

• Compute Dtran [D, b]


• Dtran [D, b] = ε-closure (move(D, b))
= ε-closure (move({1, 2, 4, 5, 6, 7, 9}, b))
= ε-closure ({5, 10})
= {5, 6, 7, 1, 2, 4} U {10}
Dtran [D, b] = {1, 2, 4, 5, 6, 7, 10} = E
Σ = (a, b)

• Mark E, Compute Dtran [E, a] and Dtran [E, b]


• Dtran [E, a] = ε-closure (move(E, a))
= ε-closure (move({1, 2, 4, 5, 6, 7, 10}, a))
= ε-closure ({3, 8})
= {3, 6, 7, 1, 2, 4} U {8}
Dtran [E, a] = {1, 2, 3, 4, 6, 7, 8} = B
Σ = (a, b)

• Compute Dtran [E, b]


• Dtran [E, b] = ε-closure (move(E, b))
= ε-closure (move({1, 2, 4, 5, 6, 7, 10}, b))
= ε-closure ({5})
= {5, 6, 7, 1, 2, 4}
Dtran [E, b] = {1, 2, 4, 5, 6, 7} = C
Summary
Dtran [A, a] = {1, 2, 3, 4, 6, 7, 8} = B
Dtran [A, b] = {1, 2, 4, 5, 6, 7} = C NFA State DFA a b
State
Dtran [B, a] = {1, 2, 3, 4, 6, 7, 8} = B
{0, 1, 2, 4, 7} A B C
Dtran [B, b] = {1, 2, 4, 5, 6, 7, 9} = D {1, 2, 3, 4, 6, 7, 8} B B D
Dtran [C, a] = {1, 2, 3, 4, 6, 7, 8} = B {1, 2, 4, 5, 6, 7} C B C
Dtran [C, b] = {1, 2, 4, 5, 6, 7} = C {1, 2, 4, 5, 6, 7, 9} D B E
Dtran [D, a] = {1, 2, 3, 4, 6, 7, 8} = B {1, 2, 4, 5, 6, 7, 10} E B C
Dtran [D, b] = {1, 2, 4, 5, 6, 7, 10} = E
Dtran [E, a] = {1, 2, 3, 4, 6, 7, 8} = B
Dtran [E, b] = {1, 2, 4, 5, 6, 7} = C
NFA State DFA a b
State
→{0, 1, 2, 4, 7} →A B C
{1, 2, 3, 4, 6, 7, 8} B B D
{1, 2, 4, 5, 6, 7} C B C
{1, 2, 4, 5, 6, 7, 9} D B E
*{1, 2, 4, 5, 6, 7, 10} *E B C
Construction of an NFA from a Regular
Expression
McNaughton-Yamada-Thompson Algorithm
• Algorithm: The McNaughton-Yamada- Thompson
algorithm to convert a regular expression to an NFA.
• INPUT: A regular expressioll r over alphabet Σ
• OUTPUT: An NFA N accepting L(r) .
• METHOD:
Begin by parsing r into its constituent sub expressions.
❑ The rules for constructing an NFA consist of
basis rules for handling sub expressions with no
operators,
inductive rules for constructing larger NFA's from the
NFA's for the immediate sub expressions of a given
expression
Basis
1. For expression ε (r= ε) construct the NFA

2. For any sub expression a (r=a), construct NFA


INDUCTION
• Suppose N(s) and N(t) are NFA's for regular
expressions s and t, respectively.
1. r = s|t (union)
2. r = st (Concatenation)
3. r = s* (Closure/star)
Observations
• 1. N(r) has at most twice as many states as there are
operators and operands in r.
-This bound follows from the fact that each step of the
algorithm creates at most two new states.
• 2. N(r) has one start state and one accepting state.
-The accepting state has no outgoing transitions,
-start state has no incoming transitions.
• 3. Each state of N(r) other than the accepting state has
either
-one outgoing transition on a symbol in Σ
-or two outgoing transitions, both on ε
Example: Construct an NFA for r = (alb)*abb

Parse tree
Step 1: For sub expression r1 = a

Step 2: For sub expression r2 = b


Step 3: For sub expression r3 = r1|r2

Step 4: For sub expression r4 = (r3)


Same As r3
Step 5: For sub expression r5 = (r3)*
Step 6: For sub expression r6 = a

Step 7: For sub expression r7 = r5r6


Step 8: For sub expression r8 = b
b
8` 9

Step 9: For sub expression r9 = r8r7

b
8 9
Step 9: For sub expression r10 = b
b
9` 10

Step 10: For sub expression r11 = r10r9

b b
8 9 10
Important States of NFA
• A state of an NFA important if it has a non-ε out-transition.
• Notice that the subset construction uses only the important
states in a set T when it computes
ε- closure (move(T, a)),
-the set of states reachable from T on input a.
• The set of states move(s , a) is nonempty only if state s is
important.
• During the subset construction, two sets of NFA states can
be identified (treated as if they were the same set) if they:
• 1. Have the same important states, and
• 2. Either both have accepting states or neither does.
• The only important states are those introduced as
initial states in the basis part for a particular
symbol position in the regular expression.
• Each important state corresponds to a particular
operand in the regular expression.
• The constructed NFA has only one accepting state,
but this state, having no out-transitions, is not an
important state
❑By concatenating a unique right end marker # to a regular expression r, we
give the accepting state for r a transition on #, making it an important state of
the NFA for (r) #.
❑augmented regular expression (r)#,
❑when the construction is complete, any state with a transition on # must be
an accepting state.
Nodes
• The important states of the NFA correspond directly to
the positions in the regular expression that hold
symbols
• present the regular expression by its syntax tree
-leaves correspond to operands
-interior nodes correspond to operators
• An interior nodes:
.
• cat-node: concatenation operator ( dot)
• or-node: union operator (I)
• star-node: star operator (*)
Syntax tree: (alb)* abb#
Syntax tree: (alb)* abb#
• Leaves in a syntax tree are labeled by ε or by an
alphabet symbol.
❑ To each leaf not labeled ε, attach a unique integer.
❑ (the position of the leaf and also as a position of
its symbol)
❑ a symbol can have several positions (a: 1 & 3 )
• The positions in the syntax tree correspond to the
important states of the constructed NFA.
Example: NFA [for r=(a|b)*abb#] with the important states numbered and
other states represented by letters

b b
8 9 10
Functions Computed From the Syntax Tree
• To construct a DFA directly from a regular
expression, we construct its syntax tree and
then compute four functions:
❑ nullable
❑ firstpos
❑ lastpos
❑ followpos
04 Functions
1. nullable(n) is true for a syntax-tree node n if & only if the sub
expression represented by n has ε in its language.
▪ sub expressiorn can be "made null" or the empty string, even
though there may be other strings it can represent as well.
2. firstpos(n) is the set of positions in the subtree rooted at n that
correspond to the first symbol of at least one string in the
language of the sub expression rooted at n.
3. lastpos(n) is the set of positions in the subtree rooted at n that
correspond to the last symbol of at least one string in the
language of the sub expression rooted at n
4. followpos(p), for a position p, is the set of positions q in the
entire syntax tree such that there is some string x = a1 a2 . . . an
in L ( (r ) #) such that for some i, there is a way to explain the
membership of x in L( (r) #) by matching ai to position p of the
syntax tree and ai+1 to position q
Example: Consider the aa
cat-node n ba
corresponds to aba
expression (alb) *a
Cat node

• nullable(n) is false,
since this node
generates all strings of
a’s & b’s ending in an
a; does not generate ε

firstpos (n) = {1, 2, 3}
• the star-node below it lastpost (n) = {3}
is nullable; it generates followpos (1) = {1, 2, 3}
ε along with all other
strings of a’s & b’s
Computing nullable, firstpos, & lastpos
• Compute nullable, firstpos, & lastpos by a
straightforward recursion on height of the tree
• Basis & inductive rules for nullable & firstpos
Example : only the
star-node is nullable.
• none of the leaves are
nullable, because they
each correspond to non-ε
operands.
• The or-node is not
nullable, because neither
of its children is.
• The star-node is nullable,
because every star-node
is nullable.
• each of the cat-nodes,
having at least one non
null able child, is not
nullable.
▪firstpos(n) to the left of node n, and lastpos(n) to its right.
Each of the leaves has only itself for firstpos & lastpos, as required by
the rule for non-ε leaves
For the or-node, we take the union of firstpos
at the children and do the same for lastpos.
• consider the lowest cat-node, which we shall call n.
• To compute firstpos(n) , we first consider whether the
left operand is nullable, which it is in this case.
• Therefore, firstpos for n is the union of firstpos for
each of its children, that is {1, 2 } U {3} = {I, 2, 3}.
• The rule for lastpos are the same as for firstpos, with
the children interchanged.
• To compute lastpos(n) we must ask whether its right
child (the leaf with position 3) is nullable, which it is
not.
• Therefore, lastpos(n) is the same as lastpos of the right
child, or {3}.
Computing Followpos
• two ways that a position of a regular
expression can be made to follow another:
1. If n is a cat-node with left child C1 & right child
C2 , then for every position i in lastpos(C1) , all
positions in firstpos(C2) are in followpos(i).
2. If n is a star-node, & i is a position in
lastpos(n) , then all positions in firstpos(n) are
in followpos(i).
Example: Rule 1 for followpos requires that we look
at each cat-node, & put each position in firstpos of
its right child in followpos for each position in
lastpos of its left child.

firstpos

lastpos
▪ For the lowest cat-node, that rule says position 3 is in
followpos(1) and followpos(2)
▪ The next cat-node says that 4 is in followpos (3) ,
▪ remaining two cat-nodes give us 5 in followpos (4) & 6 in
followpos(5)

1. If n is a cat-node with left child C1 & right


child C2 , then for every position i in lastpos(C1)
, all positions in firstpos(C2) are in followpos(i).

C1 C2
F|L F|L
▪ For the lowest cat-node, that rule says position 3 is in followpos(1) &
followpos(2)
▪ Rule 2 to the star-node. positions 1 & 2 are in both followpos(1) &
followpos(2) , since both firstpos & lastpos for this node are {1 , 2} .

2. If n is a star-node, & i is a position in


lastpos(n) , then all positions in firstpos(n)
are in followpos(i).

C1 C2
F|L F|L
Directed graph for the function followpos
Converting a Regular Expression
Directly to a DFA
Algorithm: Construction of a DFA from a regular expression r.
INPUT : A regular expression r.
OUTPUT: A DFA D that recognizes L (r) .
METHOD:
1 . Construct a syntax tree T from the augmented regular
expression (r) #.
2. Compute nullable, firstpos, lastpos, & followpos for T
3. Construct Dstates, the set of states of DFA D , & Dtran, the
transition function for D. The states of D are sets of positions in T.
Initially, each state is "unmarked," & a state becomes "marked" just
before we consider its out-transitions.
▪The start state of D is firstpos(no) , where node no is the root of T.
▪The accepting states are those containing the position for
endmarker symbol #
Construction of a DFA directly from a
regular expression
Example: construct a DFA for the regular expression
r = (a|b)*abb.
▪The value of firstpos for the root of the tree: {1, 2, 3}
A = {1, 2, 3} ----Start state
• Compute Dtran[A, a] & Dtran[A, b].
• Among the positions of A, 1 & 3 correspond to a, while 2
corresponds to b.
• Dtran[A, a] = followpos(1) U followpos(3) = {1, 2, 3, 4} = B
• Compute Dtran[A, b].
• Among the positions only 2 corresponds to b.
• Dtran[A, b] = followpos(2) = {1, 2, 3} = A
• Compute Dtran[B, a] = Dtran[{1, 2, 3, 4}, a]
• Among the positions 1, 3 corresponds to a.
• Dtran[B, a] = followpos(1) U followpos(3)
= {1, 2, 3, 4} = B
• Compute Dtran[B, b] = Dtran[{1, 2, 3, 4}, b]
• Among the positions 2 & 4 corresponds to b.
• Dtran[B, b] = followpos(2) U followpos(4)
= {1, 2, 3, 5} = C
• Compute Dtran[C, a] = Dtran[{1, 2, 3, 5}, a]
• Among the positions 1 & 3 corresponds to a.
• Dtran[C, a] = followpos(1) U followpos(3)
= {1, 2, 3, 4} = B
• Compute Dtran[C, b] = Dtran[{1, 2, 3, 5}, b]
• Among the positions 2 & 5 corresponds to b.
• Dtran[C, b] = followpos(2) U followpos(5)
= {1, 2, 3, 6} = D
• Compute Dtran[D, a] = Dtran[{1, 2, 3, 6}, a]
• Among the positions 1 & 3 corresponds to a.
• Dtran[D, a] = followpos(1) U followpos(3)
= {1, 2, 3, 4} = B
• Compute Dtran[D, b] = Dtran[{1, 2, 3, 6}, b]
• Among the positions 2 corresponds to b.
• Dtran[D, b] = followpos(2)
= {1, 2, 3} = A
A = {1, 2, 3}
Dtran[A, a] = followpos(1) U followpos(3) = {1, 2, 3, 4} = B

Dtran[A, b] = followpos(2) = {1, 2, 3} = A

Dtran[B, a] = followpos(1) U followpos(3) = {1, 2, 3, 4} = B

Dtran[B, b] = followpos(2) U followpos(4) = {1, 2, 3, 5} = C

Dtran[C, a] = followpos(1) U followpos(3) = {1, 2, 3, 4} = B

Dtran[C, b] = followpos(2) U followpos(5) = {1, 2, 3, 6} = D

Dtran[D, a] = followpos(1) U followpos(3) = {1, 2, 3, 4} = B

Dtran[D, b] = followpos(2) = {1, 2, 3} = A States a b


{1, 2, 3} A B A
{1, 2, 3, 4} B B C
{1, 2, 3, 5} C B D
{1, 2, 3, 6} D B A
States a b
{1, 2, 3} A B A
{1, 2, 3, 4} B B C
{1, 2, 3, 5} C B D DFA Construction
{1, 2, 3, 6} D B A

A B C D
Conclusion
• Tokens
• Lexemes
• Patterns
• Regular Expressions
• Regular Definitions
• Transition Diagrams
• Finite Automata
• DFA & NFA
• Conversion (NFA to DFA, Regular Expression to
NFA/DFA)

You might also like