Syntax Analysis
CS2210
Lecture 4
CS2210 Compiler Design 2004/05
Parser
source lexical analyzer token parser get next token symbol table parse tree rest of frontend IR
Parsing = determining whether a string of tokens can be generated by a grammar
CS2210 Compiler Design 2004/05
Grammars
!
Precise, easy-to understand description of syntax Context-free grammars -> efficient parsers (automatically!) Help in translation and error detection
!
Eg. Attribute grammars Can add new constructs systematically
CS2210 Compiler Design 2004/05
Easier language evolution
!
Syntax Errors
!
Many errors are syntactic or exposed by parsing
!
eg. Unbalanced () Report errors quickly & accurately Recover quickly (continue parsing after error) Little overhead on parse time
CS2210 Compiler Design 2004/05
Error handling goals:
! !
Error Recovery
!
Panic mode
!
Discard tokens until synchronization token found (often ;)
Phrase level
!
Local correction: replace a token by another and continue Encode commonly expected errors in grammar Find closest input string that is in L(G)
!
Error productions
!
Global correction
!
Too costly in practice
CS2210 Compiler Design 2004/05
Context-free Grammars
!
! !
Precise and easy way to specify the syntactical structure of a programming language Efficient recognition methods exist Natural specification of many recursive constructs:
!
expr -> expr + expr | term
CS2210 Compiler Design 2004/05
Context-free Grammar Definition
!
Terminals T
!
Symbols which form strings of L(G), G a CFG (= tokens in the scanner), e.g. if , else , id Syntactic variables denoting sets of strings of L(G) Impose hierarchical structure (e.g., precedence rules) Denotes the set of strings of L(G) Rules that determine how strings are formed N -> (N|T) *
CS2210 Compiler Design 2004/05
Nonterminals N
! !
Start symbol S (! N)
!
Productions P
! !
Example: Expression Grammar
expr -> expr op expr expr -> (expr) expr -> - expr expr -> id op -> + op -> op -> * op -> / op -> ^
!
Terminals:
!
{id, +, -, *, /, ^} {expr, op,} Expr
Nonterminals
!
Start symbol
!
CS2210 Compiler Design 2004/05
Notational Conventions
!
Terminals
! ! ! ! !
Nonterminals
! !
a,b,c.. +,-,.. ,.; etc 0..9 expr or <expr>
A, B, C .. S start symbol (if present) or first nonterminal in production list u,v,..
Terminal strings
!
Grammar symbol strings
!
",# A -> "
Productions
!
CS2210 Compiler Design 2004/05
Shorthands & Derivations
E -> E + E | E * E | (E) | - E | <id>
!
! !
E => - E E derives -E => derives in 1 step =>* derive in n (0..) steps
CS2210 Compiler Design 2004/05
More Definitions
! !
! ! ! !
L(G) language generated by G = set of strings derived from S S =>+ w : w sentence of G (w string of terminals) S =>+ " : " sentential form of G (string can contain nonterminals) G and G are equivalent :$ L(G) = L(G) A language generated by a grammar (of the form shown) is called a context-free language
CS2210 Compiler Design 2004/05
Example
G = ({-,*,(,),<id>}, {E}, E, {E -> E + E, E-> E * E , E -> (E) , E-> - E, E -> <id>})
Sentence: -(<id> + <id>) Derivation: E => -E => -(E) => -(E+E)=>-(<id>+E) => -(<id> + <id>)
Leftmost derivation i.e. always replace leftmost nonterminal Rightmost derivation analogously Left /right sentential form
CS2210 Compiler Design 2004/05
Parse Trees
E E => -E => -(E) => -(E+E)=> -(<id>+E) => -(<id> + <id>) ( E <id> Parse tree = graphical representation of a derivation ignoring replacement order E E + ) E <id>
CS2210 Compiler Design 2004/05
Ambiguous Grammars
!
>=2 different parse trees for some sentence $ >= 2 leftmost/rightmost derivations Usually want to have unambiguous grammars
!
E.g. want to just one evaluation order: <id> + <id> * <id> to be parsed as <id> + (<id> * <id>) not (<id>+<id>)*<id> To keep grammars simple accept ambiguity and resolve separately (outside of grammar)
CS2210 Compiler Design 2004/05
Expressive Power
!
CFGs are more powerful than REs
! !
Can express matching () with CFGs Can express most properties desired for programming languages Identifiers declared before used L = {wcw|w is in (a|b) *} Parameter checking (#formals = #actuals) L ={a nbmcndm|n % 1, m % 1}
CFGs cannot express:
!
CS2210 Compiler Design 2004/05
Eliminating Ambiguity (1)
Grammar stmt -> if expr then stmt | if expr then stmt else stmt | other is ambiguous: Sentence: if E1 then if E2 then S1 else S2
stmt => if expr then stmt => if E1 then stmt => if E1 then if expr then stmt else stmt => if E1 then if E2 then stmt else stmt => if E1 then if E2 then S1 else stmt => if E1 then if E2 then S1 else S 2
stmt => if expr then stmt else stmt => if E1 then stmt else stmt => if E1 then if expr then stmt else stmt => if E1 then if E2 then stmt else stmt => if E1 then if E2 then S1 else stmt => if E1 then if E2 then S1 else S 2
Which one do we prefer?
CS2210 Compiler Design 2004/05
Eliminating Ambiguity (2)
Grammar stmt -> if expr then stmt | if expr then stmt else stmt | other is ambiguous: Sentence: if E1 then if E2 then S1 else S2
stmt -> matchted_stmt | unmatched_stmt matched_stmt -> if expr then matched_stmt else matched_stmt | other unmatched_stmt -> if expr then stmt | if expr then matched_stmt else unmatched_stmt
CS2210 Compiler Design 2004/05
Left Recursion
If for grammar G there is a derivation A =>+ A", for some string " then G is left recursive Example: S -> Aa | b A -> Ac | Sd | &
!
CS2210 Compiler Design 2004/05
Parsing
!
= determining whether a string of tokens can be generated by a grammar Two classes based on order in which parse tree is constructed:
!
Top-down parsing
!
Start construction at root of parse tree Start at leaves and proceed to root
CS2210 Compiler Design 2004/05
Bottom-up parsing
!
Recursive Descent Parsing
!
A top-down method based on recursive procedures (one for each nonterminal typically)
!
May have to backtrack when wrong production was picked
Predictive parsing = a recursive descent parsing approach that avoids backtracking
! !
More efficient Uses (limited) lookahead to decide what productions to use
CS2210 Compiler Design 2004/05
Predictive Parser
!
Program with a (parsing) procedure for each nonterminal which
!
Decides what production to use (based on lookahead in the input) Uses a production by mimicking the right side
CS2210 Compiler Design 2004/05
Predictive Parser Example
type -> simple | ^id | array [simple ] of type simple -> integer | char | num dotdot num
procedure match(t:token); begin if lookahead = t then lookahead = nexttoken; else error; end; procedure type; begin if lookahead is in {integer,char,num) then simple else if lookakead = ^ then begin match(^);match(id) end else if lookahead = array then begin match(array);match([); simple; match(]);match(of); type end else error; end
CS2210 Compiler Design 2004/05
Predictive Parsing Obstacles
!
expr -> expr + term
! !
expr; match(+); term; Infinite recursion (left recursion)
stmt -> if expr then stmt else stmt | if expr then stmt
!
Common prefix
!
Cant predict production
Solution
! !
Eliminate left recursion Left factoring
CS2210 Compiler Design 2004/05
Eliminating Left Recursion (1)
!
Simple case: immediate left recursion: Replace A -> A " | # with A -> # A A -> "A | &
CS2210 Compiler Design 2004/05
Eliminating Left Recursion (2)
Order the nonterminals A 1 .. A n for i := 1 to n do begin for j := 1 to i-1 do begin replace each production of the form Ai -> Aj' by the productions Ai -> (1' | ( 2' || (k' where A i -> (1 | (2 | | (k are all current A j productions end eliminate immediate left recursion among the A i productions end
CS2210 Compiler Design 2004/05
Example Eliminating Left Recursion
S -> Aa | b A -> Ac | Sd | & Order: S,A
for i := 1 to n do begin for j := 1 to i-1 do begin replace each production of the form Ai -> A j' by the productions Ai -> (1' | (2' || (k' where Ai -> (1 | (2 | | (k are all current A j productions end eliminate immediate left recursion among the A i productions end
i=2,j=1: Eliminate A->S ' Replace A->Sd with A->Ac|Aad|bd|&
Eliminate immediate left recursion: S->Aa|b A -> bdA|A A ->cA | adA |
&
CS2210 Compiler Design 2004/05
Left Factoring
!
Find longest common prefix and turn into new nonterminal
! !
stmt -> if expr then stmt stmt stmt -> else stmt | &
CS2210 Compiler Design 2004/05
Transition Diagrams
! !
Create initial and final state For each production A -> X1X2Xn create a path from the initial to the final state, with edges labeled X1, X2, Xn
0 T + 3 & 6
E:
CS2210 Compiler Design 2004/05
Non-recursive Predictive Parsers
! !
Avoid recursion for efficiency reasons Typically built automatically by tools
Input X Y Z $ a + b $
Predictive Parsing Program
Stack
output M[A,a] gives production A symbol on stack a input symbol (and $)
Parsing Table M
CS2210 Compiler Design 2004/05
Parsing Algorithm
!
X symbol on top of stack, a current input symbol
!
Stack contents and remaining input called parser configuration (initially $S on stack and complete input string)
If X=a=$ halt and announce success If X=a ) $ pop X off stack advance input to next symbol If X is a nonterminal use M[X,a] which contains production X->rhs or error replace X on stack with rhs or call error routine, respectively, e.g. X->UVW replace X with WVU (U on top) output the production (or augment parse tree)
CS2210 Compiler Design 2004/05
1. 2. 3.
10
Construction of Parsing Table Helpers (1)
!
First(") : =set of terminals that begin strings derived from "
! ! !
First(X) = {X} for terminal X If X-> & a production add & to First(X) For X->Y1Yk place a in First(X) if a in First(Y i) and & !First(Yj) for j=1i-1, if & !First(Yj) j=1k add & to First(X)
CS2210 Compiler Design 2004/05
Construction of Parsing Table Helpers (2)
!
Follow(A) := set of terminals a that can appear immediately to the right of A in some sentential form i.e., S =>* " Aa # for some ",# (a can include $)
! !
Place $ in Follow(S), S start symbol, $ right end marker If there is a production A-> " B# put everything in First( #) except & in Follow(B) If there is a production A-> " B or A-> "B # where & is in First( #) then everything in Follow(A) is in Follow(B)
CS2210 Compiler Design 2004/05
Construction Algorithm
Input: Grammar G Output: Parsing table M For each production A -> " do For each terminal a in FIRST( ") add A-> " to M[A, a] If & is in FIRST( ") add A-> " to M[A,b] for each terminal b in FOLLOW(A). ($ counts as a terminal in this step) Make each undefined entry in M to error
CS2210 Compiler Design 2004/05
11
Example
E -> TE E -> +TE | & T ->FT T -> *FT | & F -> (E) | id FIRST(E) = FIRST(T) = FIRST(F) ={(,id } FIRST(E) = {+, &} FIRST(T) = {*, &} FOLLOW(E)=FOLLOW(E)={),$} FOLLOW(T)=FOLLOW(T)={+.),$} FOLLOW(F) ={+.*,),$} I + d
* (
E E T T F
CS2210 Compiler Design 2004/05
LL(1)
!
A grammar whose parsing table has no multiply defined entries is said to be LL(1)
! ! !
First L = left to right input scanning Second L = leftmost derivation (1) = 1 token lookahead
Not all grammars can be brought to LL(1) form, i.e., there are languages that do not fall into the LL(1) class
CS2210 Compiler Design 2004/05
12