Syntax and Symentic
Syntax and Symentic
• The syntax analysis part of a compiler is a recognizer for the language the com-
piler translates.
• John Backus and Noam Chomsky separately invented a notation that has
become the most widely used method for formally describing programming lan-
guage syntax.
• Backus’s notation was later modified slightly by Peter Naur for the description
of ALGOL 60. This revised notation became known as Backus-Naur form, or
simply BNF.
• BNF uses abstractions for syntactic structures. A simple Java assignment, for
example, might be represented by the abstraction <assign>. The definition of
<assign> is given by a rule or production:
<assign> → <var> = <expression>
• Each rule has a left-hand side (LHS) and a right-hand side (RHS). The LHS is
the abstraction being defined. The RHS consists of tokens, lexemes, and refer-
ences to other abstractions.
• Nonterminal symbols can have more than one definition. Multiple definitions
can be written as a single rule, with the definitions separated by the symbol |.
• A rule is recursive if its LHS appears in its RHS. Recursion is often used when
a nonterminal represents a variable-length list of items.
• Every internal node of a parse tree is labeled with a nonterminal symbol; every
leaf is labeled with a terminal symbol.
• A grammar that generates a sentence for which there are two or more distinct
parse trees is said to be ambiguous.
= =
• The connection between parse trees and derivations is very close; either can eas-
ily be constructed from the other.
• The parse tree shows the left addition operator lower than the right addition
operator. This is the correct order if addition is meant to be left associative.
• When a BNF rule has its LHS also appearing at the beginning of its RHS, the
rule is said to be left recursive. Left recursion corresponds to left associativity.
• The rule for if constructs in most languages is that an else clause is matched
with the nearest previous unmatched if.
• To make the grammar unambiguous, two new nonterminals are added, repre-
senting matched statements and unmatched statements:
<stmt> → <matched> | <unmatched>
<matched> → if ( <logic_expr> ) <matched> else <matched>
| any non-if statement
<unmatched> → if ( <logic_expr> ) <stmt>
| if ( <logic_expr> ) <matched> else <unmatched>
• Most extended versions are called Extended BNF, or simply EBNF, even though
they are not all exactly the same.
• The brackets, braces, and parentheses are metasymbols. In cases where these
metasymbols are also terminal symbols in the language being described, the in-
stances that are terminal symbols can be underlined or quoted.
• Although EBNF is more concise than BNF, it does not convey as much informa-
tion. For example, the BNF rule
<expr> → <expr> + <term>
forces the + operator to be left associative, whereas the EBNF rule
<expr> → <term> {+ <term>}
does not.
• Some versions use a plus (+) superscript to indicate one or more repetitions:
<compound> → begin {<stmt>}+ end
• A different version of EBNF is often used for describing the syntax of C-like
languages.
• A colon is used instead of an arrow, and the RHS of a rule is placed on the next
line.
• Alternative RHSs are placed on separate lines (vertical bars are not used).
• Alternative RHSs can also be indicated by the using the words “one of”:
AssignmentOperator: one of
= *= /= %= += -= <<= >>= >>>= &= ^= |=
• A number of software systems have been developed that perform this construc-
tion. yacc (yet another compiler-compiler) was one of the first.
• Attribute grammars are useful because some language rules (such as type com-
patibility) are difficult to specify with BNF.
• Other language rules cannot be specified in BNF at all, such as the rule that all
variables must be declared before they are referenced.
• Rules such as these are considered to be part of the static semantics of a lan-
guage, not part of the language’s syntax. The term “static” indicates that these
rules can be checked at compile time.
• Attribute grammars, designed by Donald Knuth, can describe both syntax and
static semantics.
• Attribute grammars are context-free grammars to which the following have been
added:
Attributes are properties that can have values assigned to them.
Attribute computation functions (semantic functions) specify how attribute
values are computed.
Predicate functions state the static semantic rules of the language.
• A(X) consists of two disjoint sets S(X) and I(X), called synthesized and inher-
ited attributes, respectively.
Synthesized attributes are used to pass semantic information up a parse tree.
Inherited attributes pass semantic information down and across a tree.
• Inherited attributes of symbols Xj, 1 ≤ j ≤ n (in the rule X0 → X1…Xn), are com-
puted with a semantic function of the form I(Xj) = f(A(X0),…,A(Xn)).
To avoid circularity, inherited attributes are often restricted to functions of the
form I(Xj) = f(A(X0),…,A(Xj–1)).
• A parse tree of an attribute grammar is the parse tree based on its underlying
BNF grammar, with a possibly empty set of attribute values attached to each
node.
• If all the attribute values in a parse tree have been computed, the tree is said to be
fully attributed.
• Intrinsic attributes are synthesized attributes of leaf nodes whose values are
determined outside the parse tree (perhaps coming from the compiler’s symbol
table).
• Initially, the only attributes with values are the intrinsic attributes of the leaf
nodes. The semantic functions can then be used to compute the remaining
attribute values.
• The following fragment of an attribute grammar describes the rule that the name
on the end of an Ada procedure must match the procedure’s name:
Syntax rule: <proc_def> → procedure <proc_name>[1]
<proc_body> end <proc_name>[2] ;
Predicate: <proc_name>[1].string == <proc_name>[2].string
Nonterminals that appear more than once in a rule are subscripted to distinguish
them.
• An attribute grammar can be used to check the type rules of a simple assignment
statement with the following syntax and semantics:
The only variable names are A, B, and C.
The right side of an assignment can be a variable or the sum of two variables.
Variables are either int or real.
Variables that are added need not have the same type. If the types are different,
the result has type real.
The variable on the left side of an assignment must have the same type as the
expression on the right side.
• The process of decorating the parse tree with attributes could proceed in a com-
pletely top-down order if all attributes were inherited.
• Because our grammar has both synthesized and inherited attributes, the evalua-
tion process cannot be in any single direction. One possible order for attribute
evaluation:
1. <var>.actual_type ← look-up(A) (Rule 4)
2. <expr>.expected_type ← <var>.actual_type (Rule 1)
3. <var>[2].actual_type ← look-up(A) (Rule 4)
<var>[3].actual_type ← look-up(B) (Rule 4)
4. <expr>.actual_type ← either int or real (Rule 2)
5. <expr>.expected_type == <expr>.actual_type is either TRUE or FALSE
(Rule 2)
• The following figure shows the flow of attribute values. Solid lines are used for
the parse tree; dashed lines show attribute flow.
• The following tree shows the final attribute values on the nodes.
• Attribute grammars have been used in a variety of applications, not all of which
involve describing the syntax and static semantics of programming languages.
• Although not every compiler writer uses attribute grammars, the underlying con-
cepts are vital for constructing compilers.
• Semantics are typically described in English. Such descriptions are often impre-
cise and incomplete.
• Each construct in the intermediate language must have an obvious and unambig-
uous meaning.
• The following statements would be adequate for describing the semantics of the
simple control statements of a typical programming language:
ident = var
ident = ident + 1
ident = ident - 1
goto label
if var relop var goto label
relop is a relational operator, ident is an identifier, and var is either an identifier
or a constant.
• Adding a few more instructions would allow the semantics of arrays, records,
pointers, and subprograms to be described.
• The first and most significant use of formal operational semantics was to
describe the semantics of PL/I. The abstract machine and the translation rules for
PL/I were together named the Vienna Definition Language (VDL).
• Operational semantics can be effective as long as the descriptions are simple and
informal. The VDL description of PL/I, unfortunately, is so complex that it
serves no practical purpose.
• The term denotational comes from the fact that mathematical objects “denote”
the meaning of syntactic entities.
• The meanings, or denoted objects (integers in this case), can be attached to the
nodes of the parse tree:
• Any of the v’s can have the special value undef, which indicates that its associ-
ated variable is currently undefined.
• Let VARMAP be a function of two parameters, a variable name and the program
state. The value of VARMAP(ij, s) is vj.
• Most semantics mapping functions for language constructs map states to states.
These state changes are used to define the meanings of the constructs.
• Denotational descriptions are of little use to language users. On the other hand,
they provide an excellent way to describe the semantics of a language concisely.
• Example of a postcondition:
sum = 2 * x + 1 {sum > 1}
In this and later examples, all variables are assumed to have integer type.
• The weakest precondition is the least restrictive precondition that will guaran-
tee the validity of the associated postcondition.
{x > 10}, {x > 50}, and {x > 1000} are all valid preconditions. The weakest
precondition is{x > 0}.
• Example:
a = b / 2 - 1 {a < 10}
• The appearance of the left side of an assignment statement in its right side does
not affect the process of computing the weakest precondition.
• Example:
x = x + y - 3 {x > 10}
The precondition {x > 5} is not the same as the assertion produced by the
assignment axiom. Using this statement in a proof requires an inference rule
named the rule of consequence.
• Rule of consequence:
{P} S {Q}, P'=>P, Q=>Q'
{P'} S {Q'}
The => symbol means “implies.” S can be any program statement.
• The rule of consequence says that a postcondition can always be weakened and a
precondition can always be strengthened.
• Example:
if x > 0 then
y = y - 1
else
y = y + 1
Assume that the postcondition is {y > 0}. Applying the axiom for assignment to
the then clause
y = y - 1 {y > 0}
produces the precondition {y - 1 > 0}. Applying the same axiom to the else
clause
y = y + 1 {y > 0}
produces {y + 1 > 0}. Because {y - 1 > 0} => {y + 1 > 0}, the rule of conse-
quence allows {y - 1 > 0} to be used as the precondition of the selection state-
ment.
• Computing the weakest precondition for a logical pretest (while) loop is inher-
ently difficult because of the need to find a loop invariant.
• Another complicating factor for while loops is the question of loop termina-
tion.
Proving total correctness involves showing that the loop satisfies the specified
postcondition and always terminates.
Proving partial correctness involves showing that the loop satisfies the speci-
fied postcondition, without proving that it always terminates.
• This loop invariant can also be used as the precondition for the while state-
ment. The goal is now to show that the loop
{y <= x} while y <> x do y = y + 1 end {y = x}
• It is easy to show that the first three criteria are satisfied. Loop termination is
also clear, since y increases with each iteration until it is eventually equal to x.
• Informal proof:
Applying the assignment axiom to the last statement yields the precondition
{x = B AND t = A}
Using this as the postcondition for the middle statement yields the precondition
{y = B AND t = A}
Finally, this can be used as the postcondition for the first statement, which yields
the precondition
{y = B AND x = A}
• Formal proof:
1. {y = B AND x = A} t = x; {y = B AND t = A} Assignment axiom
2. (x = A AND y = B) => (y = B AND x = A)
3. (y = B AND t = A) => (y = B AND t = A)
4. {x = A AND y = B} t = x; {y = B AND t = A} Rule of consequence (1, 2, 3)
5. {y = B AND t = A} x = y; {x = B AND t = A} Assignment axiom
6. {x = A AND y = B} t = x; x = y; {x = B AND t = A} Sequence rule (4, 5)
7. {x = B AND t = A} y = t; {x = B AND y = A} Assignment axiom
8. {x = A AND y = B} t = x; x = y; y = t; {x = B AND y = A}
Sequence rule (6, 7)
• The loop computes the factorial function in order of the last multiplication first,
so part of the invariant can be
fact = (count + 1) * (count + 2) * … * (n - 1) * n