CD.mod2
CD.mod2
SYNTAX ANALYSIS
SYNTAX ANALYSIS
Syntax analysis is the second phase of the compiler. It gets the input from the tokens and generates a
syntax tree or parse tree.
The parser or syntactic analyzer obtains a string of tokens from the lexical analyzer and verifies that
the string can be generated by the grammar for the source language. It reports any syntax errors in the program.
It also recovers from commonly occurring errors so that it can continue processing its input.
Example: The grammar with the following productions defines simple arithmetic expression:
In this grammar, the terminal symbols are: id + - * / ( ) the nonterminal symbols are: expression,
term, factor Start symbol: expression
Notational Conventions
To avoid always having to state that "these are the terminals," "these are the nonterminals," and so on,
the following notational conventions for grammars will be used.
1. These symbols are terminals:
(a) Lowercase letters early in the alphabet, such as a, b, c.
(b) Operator symbols such as +, *, and so on.
( c) Punctuation symbols such as parentheses, comma, and so on.
(d) The digits 0, 1, . . . , 9.
(e) Boldface strings such as id or if, each of which represents a single terminal symbol.
3. Uppercase letters late in the alphabet, such as X, Y, Z, represent grammar symbols; that is, either
nonterminals or terminals.
4. Lowercase letters late in the alphabet, chiefly u, v, ... ,z, represent (possibly empty) strings of
terminals.
5. Lowercase Greek letters α, β, γ for example, represent (possibly empty) strings of grammar symbols.
6. A set of productions A → α1 , A → α2 , ... , A → αk with a common head A(call them A-
productions) , may be written A →α1| α2|.....| αk · We call α1, α2,.., αn the alternatives for A.
7. Unless stated otherwise, the head of the first production is the start symbol.
Using these conventions, the grammar for arithmetic expression can be rewritten as: E→ E +
T| E-T| T
T→ T * F | T / F| F F→
( E ) | id
Derivations
The construction of a parse tree can be made precise by taking a derivational view, in which productions
are treated as rewriting rules. Beginning with the start symbol, each rewriting step replaces a non
terminal by the body of one of its productions.
For example,consider the following grammar,with a single non terminal
E: E → E + E | E * E | - E | ( E ) | id
The production E → - E signifies that if E denotes an expression, then – E must also denotes an
expression. The replacement of a single E by - E will be described by writing
E => -E
which is read, "E derives - E." The production E -+ ( E ) can be applied to replace any instance of E in
any string of grammar symbols by (E) ,
e.g., E * E => (E) * E or E * E => E * (E)
We can take a single E and repeatedly apply productions in any order to get a sequence of replacements.
For example,
E => - E => - (E) => - (id)
We call such a sequence of replacements a derivation of - (id) from E. This derivation provides a proof
Example
Let any set of production rules in a CFG be
X → X+X | X*X |X| a
over an alphabet {a}.
The leftmost derivation for the string "a+a*a" may be –
X → X+X
→ a+X
→ a + X*X
→ a+a*X
→ a+a*a
The stepwise derivation of the above string is shown as below −
The rightmost derivation for the above string "a+a*a" may be –
X → X*X
→ X*a
→ X+X*a
→ X+a*a
→ a+a*a
Parse tree and Derivation tree
Parse Tree
Parse tree is a hierarchical structure which represents the derivation of the grammar to yield input
strings.
• Root node of parse tree has the start symbol of the given grammar from where the
derivation proceeds.
• Leaves of parse tree represent terminals.
• Each interior node represents productions of grammar.
• If A -> xyz is a production, then the parse tree will have A as interior node whose
children are x, y and z from its left to right.
Construct parse tree for E --> E + E / E * E /id
Even if some parse trees are unique, if there are multiple parse trees for any sentence, then the grammar is called
ambiguous.
Left recursion
A grammar is said to be left –recursive if it has a non-terminal A such that there is a derivation
A =>Aα, for some string α.
For example
A =>Aα
A => β ,
It recognizes the regular expression βα*. The problem is that if we use the first production for
top-down derivation, we will fall into an infinite derivation chain. This is called left recursion..Top–down
parsing methods cannot handle left recursive grammars, so a transformation that eliminates left-
recursion is needed. The left-recursive pair of productions could be replaced by two non recursive
productions
Eliminating the immediate left recursion to the productions for E and then for T, we obtain
No matter how many A-productions there are, we can eliminate immediate left recursion from them by
the following technique. First, we group the A productions as
Left factoring
Left factoring is a grammar transformation that is useful for producing a grammar suitable for predictive
parsing. The basic idea is that when it is not clear which of two alternative productions to use to expand
a non-terminal A, we may be able to rewrite the A-productions to defer the decision until we have seen
enough of the input to make the right choice A->αβ1 | αβ2 are two A-productions, and the input begins
with a non-empty string derived from α we do not know whether to expand A to αβ1 or αβ2 . However,
we may defer the decision by expanding A to αB. Then, after seeing the input derived from α, we may
expand B to β1 or β2 .The left factored original expression becomes:
A→ αB
B→β1 | β2
For the “dangling else“ grammar:
stmt->if cond then stmt else stmt |if cond then stmt
The corresponding left – factored grammar is:
Parsing is the process of determining if a string of token can be generated by a grammar. Mainly two
parsing approaches:
1. Top Down Parsing
2. Bottom Up Parsing
In top down parsing, parse tree is constructed from top (root) to the bottom (leaves). In bottom up
parsing, parse tree is constructed from bottom (leaves)) to the top (root).
Top Down Parsing
It can be viewed as an attempt to construct a parse tree for the input starting from the root and creating
the nodes of parse tree in preorder.
Preorder traversal means: 1. Visit the root 2. Traverse left subtree 3. Traverse right subtree
Top down parsing can be viewed as an attempt to find a leftmost derivation for an input string (that is
expanding the leftmost terminal at every step).
It is the most general form of top-down parsing. It may involve backtracking, that is making repeated
scans of input, to obtain the correct expansion of the leftmost non-terminal. Unless the grammar is
ambiguous or left-recursive, it finds a suitable parse tree
Example:
S → cAd
A→ ab | a
and the input string w = cad.
To construct a parse tree for this string top down, we initially create a tree consisting of a single node
labelled S. An input pointer points to c, the first symbol of w. S has only one production, so we use it
to expand S and obtain the tree as:
The leftmost leaf, labeled c, matches the first symbol of input w, so we advance the input pointer
to a, the second symbol of w, and consider the next leaf, labeled A. Now, we expand A using the first
alternative A → ab to obtain the tree as:
We have a match for the second input symbol, a, so we advance the input pointer to d, the third input
symbol, and compare d against the next leaf, labeled b. Since b does not match d, we report failure and
go back to A to see whether there is another alternative for A that has not been tried, but that might
produce a match.
In going back to A, we must reset the input pointer to position 2 , the position it had when we first came
to A, which means that the procedure for A must store the input pointer in a local variable. The second
alternative for A produces the tree as:
The leaf a matches the second symbol of w and the leaf d matches the third symbol. Since we have
produced a parse tree for w, we halt and announce successful completion of parsing. (that is the string
parsed completely and the parser stops).
A left-recursive grammar can cause a recursive-descent parser, to go into an infinite loop. That is when
we try to expand A, we may find ourselves again trying to expanding A, without having consumed any
input.
Recursive-descent parsers are not very common as programming language constructs can be parsed
without resorting to backtracking.
Predicative Parser
A predictive parsing is a special form of recursive-descent parsing, in which the current input token
unambiguously determines the production to be applied at each step. The goal of predictive parsing is
to construct a top-down parser that never backtracks. To do so, we must transform a grammar in two
ways:
1. Eliminate left recursion, and
2. Perform left factoring.
These rules eliminate most common causes for backtracking although they do not guarantee a
completely backtrack-free parsing (called LL(1) as we will see later).
It is possible to build a non recursive predictive parser by maintaining a stack explicitly, rather than
implicitly via recursive calls. The key problem during predictive parsing is that of determining the
production to be applied for a nonterminal. The non recursive parser in looks up the production to be
applied in a parsing table
Requirements are:
– 1.Stack
– 2.Parsing Table
– 3.Input Buffer
– 4.Parsing program
• Input buffer - contains the string to be parsed, followed by $(used to indicate end of input
string)
• Stack – initialized with $, to indicate bottom of stack.
• Parsing table - 2 D array M[A,a] where A is a nonterminal and a is terminal or the symbol
$
Predictive Parsing Algorithm
Example:
• Uses 2 functions:
– FIRST()
– FOLLOW()
– These functions allows us to fill the entries of predictive parsing table
FIRST
If 'α' is any string of grammar symbols, then FIRST(α) be the set of terminals that begin the string
derived from α . If α==*>є then add є to FIRST(α).First is defined for both terminals and non terminals.
3) If X is a non terminal and X-->Y1Y2Y3...Yn , then put 'a' in FIRST(X) if for some i, a is in
FIRST(Yi) and є is in all of FIRST(Y1),...FIRST(Yi-1).
For example
FOLLOW
• In other words, if A is a nonterminal, then FOLLOW(A) is the set of terminals 'a' that
can appear immediately to the right of A in some sentential form
Rules To Compute Follow Set
Example
LL(l) grammars are the class of grammars from which the predictive parsers can be
constructed automatically.
A context-free grammar G = (VT, VN, P, S) whose parsing table has no multiple entries is said to
be LL(1). In the name LL(1),
• the first L stands for scanning the input from left to right,
• the second L stands for producing a leftmost derivation,
• and the 1 stands for using one input symbol of lookahead at each step to make parsing
action decision.
A language is said to be LL(1) if it can be generated by a LL(1) grmmar. It can be shown that
LL(1) grammars are
The above table has multiple entries corresponding to some input symbols so the grammar is
not LL(1)
FOLLOW(S’) = { $ ,e }
FOLLOW(E) = {t}
Parsing table:
Non a b e i t $
terminal
S S→a S→iEtSS’
S’ S’→eS S’
S’→ε →ε
E E→b
Since there are more than one production, the grammar is not LL(1) grammar.