0% found this document useful (0 votes)
19 views

CD.mod2

Syntax analysis is the second phase of a compiler that generates a syntax tree from input tokens. It involves a parser that verifies the structure of the tokens based on grammar, constructs parse trees, reports errors, and performs error recovery. The document also discusses context-free grammar, derivations, left recursion, left factoring, and parsing approaches including top-down and bottom-up parsing.

Uploaded by

xocije3152
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

CD.mod2

Syntax analysis is the second phase of a compiler that generates a syntax tree from input tokens. It involves a parser that verifies the structure of the tokens based on grammar, constructs parse trees, reports errors, and performs error recovery. The document also discusses context-free grammar, derivations, left recursion, left factoring, and parsing approaches including top-down and bottom-up parsing.

Uploaded by

xocije3152
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

MODULE- 2

SYNTAX ANALYSIS

SYNTAX ANALYSIS
Syntax analysis is the second phase of the compiler. It gets the input from the tokens and generates a
syntax tree or parse tree.

Advantages of grammar for syntactic specification:

1. A grammar gives a precise and easy-to-understand syntactic specification of a programming language.


2. An efficient parser can be constructed automatically from a properly designed grammar.
3. A grammar imparts a structure to a source program that is useful for its translation into object code and for
the detection of errors.
4. New constructs can be added to a language more easily when there is a grammatical description of the
language.

THE ROLE OF PARSER

The parser or syntactic analyzer obtains a string of tokens from the lexical analyzer and verifies that
the string can be generated by the grammar for the source language. It reports any syntax errors in the program.
It also recovers from commonly occurring errors so that it can continue processing its input.

Functions of the parser:


1. It verifies the structure generated by the tokens based on the grammar.
2. It constructs the parse tree.
3. It reports the errors.
4. It performs error recovery.

Context Free Grammar (CFG)


A context-free grammar (grammar for short) consists of terminals, non terminals, a start symbol, and
productions.
1. Terminals are the basic symbols from which strings are formed. The term "token name" is a
synonym for "terminal" and frequently we will use the word "token" for terminal when it is clear that
we are talking about just the token name.
2. Non terminals are syntactic variables that denote sets of strings.
3. In a grammar, one non terminal is distinguished as the start symbol, and the set of strings it denotes
is the language generated by the grammar. Conventionally, the productions for the start symbol are listed
first.
4. The productions of a grammar specify the manner in which the terminals and non terminals can be
combined to form strings. Each production consists of:
(a) A non terminal called the head or left side of the production; this production defines some
of the strings denoted by the head.
(b) The symbol → Sometimes: = has been used in place of the arrow.
(c) A body or right side consisting of zero or more terminals and non terminals.

Example: The grammar with the following productions defines simple arithmetic expression:

In this grammar, the terminal symbols are: id + - * / ( ) the nonterminal symbols are: expression,
term, factor Start symbol: expression

Notational Conventions
To avoid always having to state that "these are the terminals," "these are the nonterminals," and so on,
the following notational conventions for grammars will be used.
1. These symbols are terminals:
(a) Lowercase letters early in the alphabet, such as a, b, c.
(b) Operator symbols such as +, *, and so on.
( c) Punctuation symbols such as parentheses, comma, and so on.
(d) The digits 0, 1, . . . , 9.
(e) Boldface strings such as id or if, each of which represents a single terminal symbol.

2. These symbols are nonterminals:


(a) Uppercase letters early in the alphabet, such as A, B, C.
(b) The letter S, which, when it appears, is usually the start symbol.
(c) Lowercase, italic names such as expr or stmt.
(d) When discussing programming constructs, uppercase letters may be used to represent nonterminals
for the constructs. For example, nonterminalsfor expressions, terms, and factors are often
represented by E, T, and F, respectively.

3. Uppercase letters late in the alphabet, such as X, Y, Z, represent grammar symbols; that is, either
nonterminals or terminals.

4. Lowercase letters late in the alphabet, chiefly u, v, ... ,z, represent (possibly empty) strings of
terminals.
5. Lowercase Greek letters α, β, γ for example, represent (possibly empty) strings of grammar symbols.
6. A set of productions A → α1 , A → α2 , ... , A → αk with a common head A(call them A-
productions) , may be written A →α1| α2|.....| αk · We call α1, α2,.., αn the alternatives for A.
7. Unless stated otherwise, the head of the first production is the start symbol.

Using these conventions, the grammar for arithmetic expression can be rewritten as: E→ E +
T| E-T| T
T→ T * F | T / F| F F→
( E ) | id

Derivations
The construction of a parse tree can be made precise by taking a derivational view, in which productions
are treated as rewriting rules. Beginning with the start symbol, each rewriting step replaces a non
terminal by the body of one of its productions.
For example,consider the following grammar,with a single non terminal
E: E → E + E | E * E | - E | ( E ) | id
The production E → - E signifies that if E denotes an expression, then – E must also denotes an
expression. The replacement of a single E by - E will be described by writing
E => -E
which is read, "E derives - E." The production E -+ ( E ) can be applied to replace any instance of E in
any string of grammar symbols by (E) ,
e.g., E * E => (E) * E or E * E => E * (E)
We can take a single E and repeatedly apply productions in any order to get a sequence of replacements.
For example,
E => - E => - (E) => - (id)
We call such a sequence of replacements a derivation of - (id) from E. This derivation provides a proof

that the string - (id) is one particular instance of an expression.


Leftmost and Rightmost Derivation of a String
• Leftmost derivation − A leftmost derivation is obtained by applying production to the leftmost
variable in each step.

• Rightmost derivation − A rightmost derivation is obtained by applying production to the


rightmost variable in each step.

Example
Let any set of production rules in a CFG be
X → X+X | X*X |X| a
over an alphabet {a}.
The leftmost derivation for the string "a+a*a" may be –
X → X+X
→ a+X
→ a + X*X
→ a+a*X
→ a+a*a
The stepwise derivation of the above string is shown as below −
The rightmost derivation for the above string "a+a*a" may be –

X → X*X
→ X*a
→ X+X*a
→ X+a*a
→ a+a*a
Parse tree and Derivation tree
Parse Tree
Parse tree is a hierarchical structure which represents the derivation of the grammar to yield input
strings.
• Root node of parse tree has the start symbol of the given grammar from where the
derivation proceeds.
• Leaves of parse tree represent terminals.
• Each interior node represents productions of grammar.
• If A -> xyz is a production, then the parse tree will have A as interior node whose
children are x, y and z from its left to right.
Construct parse tree for E --> E + E / E * E /id

Yield of Parse Tree


Leaf nodes of parse tree are concatenated from left to right to form the input string derived from a
grammar which is called yield of parse tree. Figure represents the parse tree for the string id+ id*id. The
string id + id * id, is the yield of parse tree depicted in Fig.
Ambiguity
There are other sentences derived from E above that have more than one parse tree, and corresponding left- and
rightmostderivations. Forexample,the very simplesentence id+id*id.Thetablelooks atleftmost derivations
and parse trees
1st Leftmost Der. 2nd Leftmost Der.
E ===> E + E E ===> E * E
===> id + E ===> E + E * E
===> id + E * E ===> id + id * E
===> id + id * E ===> id + id * E
===> id + id * id ===> id + id * id

Even if some parse trees are unique, if there are multiple parse trees for any sentence, then the grammar is called
ambiguous.

Left recursion

A grammar is said to be left –recursive if it has a non-terminal A such that there is a derivation
A =>Aα, for some string α.

For example

Consider the grammar

A =>Aα

A => β ,

It recognizes the regular expression βα*. The problem is that if we use the first production for
top-down derivation, we will fall into an infinite derivation chain. This is called left recursion..Top–down
parsing methods cannot handle left recursive grammars, so a transformation that eliminates left-
recursion is needed. The left-recursive pair of productions could be replaced by two non recursive
productions

Consider The following grammar which generates arithmetic expressions

Eliminating the immediate left recursion to the productions for E and then for T, we obtain

No matter how many A-productions there are, we can eliminate immediate left recursion from them by
the following technique. First, we group the A productions as

Left factoring
Left factoring is a grammar transformation that is useful for producing a grammar suitable for predictive
parsing. The basic idea is that when it is not clear which of two alternative productions to use to expand
a non-terminal A, we may be able to rewrite the A-productions to defer the decision until we have seen
enough of the input to make the right choice A->αβ1 | αβ2 are two A-productions, and the input begins
with a non-empty string derived from α we do not know whether to expand A to αβ1 or αβ2 . However,
we may defer the decision by expanding A to αB. Then, after seeing the input derived from α, we may
expand B to β1 or β2 .The left factored original expression becomes:
A→ αB
B→β1 | β2
For the “dangling else“ grammar:
stmt->if cond then stmt else stmt |if cond then stmt
The corresponding left – factored grammar is:

stmt->if cond then stmt else_clause


else_clause->else stmt | ε
Basic Parsing Approaches

Parsing is the process of determining if a string of token can be generated by a grammar. Mainly two
parsing approaches:
1. Top Down Parsing
2. Bottom Up Parsing

In top down parsing, parse tree is constructed from top (root) to the bottom (leaves). In bottom up
parsing, parse tree is constructed from bottom (leaves)) to the top (root).
Top Down Parsing

It can be viewed as an attempt to construct a parse tree for the input starting from the root and creating
the nodes of parse tree in preorder.

Preorder traversal means: 1. Visit the root 2. Traverse left subtree 3. Traverse right subtree

Top down parsing can be viewed as an attempt to find a leftmost derivation for an input string (that is
expanding the leftmost terminal at every step).

Recursive Descent Parsing

It is the most general form of top-down parsing. It may involve backtracking, that is making repeated
scans of input, to obtain the correct expansion of the leftmost non-terminal. Unless the grammar is
ambiguous or left-recursive, it finds a suitable parse tree
Example:

Consider the grammar:

S → cAd
A→ ab | a
and the input string w = cad.

To construct a parse tree for this string top down, we initially create a tree consisting of a single node
labelled S. An input pointer points to c, the first symbol of w. S has only one production, so we use it
to expand S and obtain the tree as:
The leftmost leaf, labeled c, matches the first symbol of input w, so we advance the input pointer
to a, the second symbol of w, and consider the next leaf, labeled A. Now, we expand A using the first
alternative A → ab to obtain the tree as:

We have a match for the second input symbol, a, so we advance the input pointer to d, the third input
symbol, and compare d against the next leaf, labeled b. Since b does not match d, we report failure and
go back to A to see whether there is another alternative for A that has not been tried, but that might
produce a match.
In going back to A, we must reset the input pointer to position 2 , the position it had when we first came
to A, which means that the procedure for A must store the input pointer in a local variable. The second
alternative for A produces the tree as:

The leaf a matches the second symbol of w and the leaf d matches the third symbol. Since we have
produced a parse tree for w, we halt and announce successful completion of parsing. (that is the string
parsed completely and the parser stops).

A left-recursive grammar can cause a recursive-descent parser, to go into an infinite loop. That is when
we try to expand A, we may find ourselves again trying to expanding A, without having consumed any
input.

Recursive-descent parsers are not very common as programming language constructs can be parsed
without resorting to backtracking.

Predicative Parser

A predictive parsing is a special form of recursive-descent parsing, in which the current input token
unambiguously determines the production to be applied at each step. The goal of predictive parsing is
to construct a top-down parser that never backtracks. To do so, we must transform a grammar in two
ways:
1. Eliminate left recursion, and
2. Perform left factoring.

These rules eliminate most common causes for backtracking although they do not guarantee a
completely backtrack-free parsing (called LL(1) as we will see later).

Non Recursive Predictive parser

It is possible to build a non recursive predictive parser by maintaining a stack explicitly, rather than
implicitly via recursive calls. The key problem during predictive parsing is that of determining the
production to be applied for a nonterminal. The non recursive parser in looks up the production to be
applied in a parsing table

Requirements are:
– 1.Stack
– 2.Parsing Table
– 3.Input Buffer
– 4.Parsing program

• Input buffer - contains the string to be parsed, followed by $(used to indicate end of input
string)
• Stack – initialized with $, to indicate bottom of stack.

• Parsing table - 2 D array M[A,a] where A is a nonterminal and a is terminal or the symbol
$
Predictive Parsing Algorithm
Example:

Moves made by predictive parser for the input id+id*id


Construction of predictive parsing table

• Uses 2 functions:
– FIRST()
– FOLLOW()
– These functions allows us to fill the entries of predictive parsing table

FIRST
If 'α' is any string of grammar symbols, then FIRST(α) be the set of terminals that begin the string
derived from α . If α==*>є then add є to FIRST(α).First is defined for both terminals and non terminals.

Rules to Compute First Set

1) If X is a terminal , then FIRST(X) is {X}

2) If X--> є then add є to FIRST(X)

3) If X is a non terminal and X-->Y1Y2Y3...Yn , then put 'a' in FIRST(X) if for some i, a is in
FIRST(Yi) and є is in all of FIRST(Y1),...FIRST(Yi-1).

For example
FOLLOW

• FOLLOW is defined only for non terminals of the grammar G.


• It can be defined as the set of terminals of grammar G , which can immediately follow
the non terminal in a production rule from start symbol.

• In other words, if A is a nonterminal, then FOLLOW(A) is the set of terminals 'a' that
can appear immediately to the right of A in some sentential form
Rules To Compute Follow Set

1. If S is the start symbol, then add $ to the FOLLOW(S).


2. If there is a production rule A--> αBβ then everything in FIRST(β) except for є is
placed in FOLLOW(B).
3. If there is a production A--> αB , or a production A--> αBβ where FIRST(β) contains є
then everything in FOLLOW(A) is in FOLLOW(B).
Example

Example

Algorithm to construct predictive parsing table:


LL(1)Grammers

LL(l) grammars are the class of grammars from which the predictive parsers can be
constructed automatically.

A context-free grammar G = (VT, VN, P, S) whose parsing table has no multiple entries is said to
be LL(1). In the name LL(1),

• the first L stands for scanning the input from left to right,
• the second L stands for producing a leftmost derivation,
• and the 1 stands for using one input symbol of lookahead at each step to make parsing
action decision.

A language is said to be LL(1) if it can be generated by a LL(1) grmmar. It can be shown that
LL(1) grammars are

• not ambiguous and


• not left-recursive

Consider the following grammar

The above table has multiple entries corresponding to some input symbols so the grammar is
not LL(1)

FOLLOW(S’) = { $ ,e }
FOLLOW(E) = {t}

Parsing table:

Non a b e i t $
terminal
S S→a S→iEtSS’
S’ S’→eS S’
S’→ε →ε
E E→b
Since there are more than one production, the grammar is not LL(1) grammar.

Actions performed in predictive parsing:


1. Shift
2. Reduce
3. Accept
4. Error

Implementation of predictive parser:


1. Elimination of left recursion, left factoring and ambiguous grammar.
2. Construct FIRST() and FOLLOW() for all non-terminals.
3. Construct predictive parsing table.
4. Parse the given input string using stack and parsing table.

You might also like