0% found this document useful (0 votes)
228 views51 pages

PEG GrammarExplorer

This document introduces Parsing Expression Grammars (PEGs) and a C# library for implementing PEG parsers. PEGs are a technique for parsing text and binary sources using executable grammars. The library includes classes for parsing text and binary formats using PEGs. It supports features like error handling, evaluation during parsing, and generating parse trees or abstract syntax trees. Sample PEG parsers included demonstrate parsing JSON, building parse trees, direct evaluation, and generating C# source code for a PEG parser.

Uploaded by

vinhxuann
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
228 views51 pages

PEG GrammarExplorer

This document introduces Parsing Expression Grammars (PEGs) and a C# library for implementing PEG parsers. PEGs are a technique for parsing text and binary sources using executable grammars. The library includes classes for parsing text and binary formats using PEGs. It supports features like error handling, evaluation during parsing, and generating parse trees or abstract syntax trees. Sample PEG parsers included demonstrate parsing JSON, building parse trees, direct evaluation, and generating C# source code for a PEG parser.

Uploaded by

vinhxuann
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 51

Introduction

This is the first part of a series of articles which cover the parsing technique Parsing Expression Grammars. This part introduces a support library and a parser generator for C# 3.0 . The support library consists of the classes PegCharParser and PegByteParser which are for parsing text and binary sources and which support user defined error handling, direct evaluation during parsing, parse tree generation and abstract syntax tree generation. Using these base classes results in fast, easy to understand and to extend Parsers, which are well integrated into the hosting C# program. The underlying parsing methodology called Parsing Expression Grammar [1][2][3] is relatively new (first described 2004), but has already many implementations. Parsing Expressions Grammars (PEG) can be easily implemented in any programming language, but fit especially well into languages with a rich expression syntax like functional languages and functionally enhanced imperative languages (like C# 3.0) because PEG concepts have a close relationship to mutually recursive function calls, short-circuit boolean expressions and in-place defined functions (lambdas). A new trend in parsing is integration of parsers into a host language so that the semantic gap between grammar notation and implementation in the host language is as small as possible (Perl 6 and boost::sprit are forerunners of this trend). Parsing Expression Grammars are escpecially well suited when striving for this goal. Earlier Grammars were not so easy to implement, so that one grammar rule could result in dozens of code lines. In some parsing strategies, the relationship between grammar rule and implementation code was even lost. This is the reason, that until recently generators were used to build parsers. This article shows how the C# 3.0 lambda facility can be used to implement a support library for Parsing Expression Grammars, which makes parsing with the PEG technique easy. When using this library, a PEG grammar is mapped to a C# grammar class which inherits basic functionality from a PEG base class and each PEG grammar rule is mapped to a method in the C# grammar class. Parsers implemented with this libary should be fast (provided the C# compiler inlines methods whenever possible), easy to understand and to extend. Error Diagnosis, generation of a parse tree and addition of semantic actions are also supported by this library. The most striking property of PEG and especially this library is the small footprint and the lack of any administrative overhead. The main emphasis of this article is on explaining the PEG framework and on studying concrete application samples. One of the sample applications is a PEG parser generator, which generates C# source code. The PEG parser generator is the only sample parser which has been written manually, all other sample parsers were generated by the parser generator.

Contents

Introduction

Parsing Expression Grammar Tutorial o Parsing Expression Grammars Basics o Parsing Expression Grammars partiularities and idioms o Integrating semantic actions into a PEG framework o Parsing Expression Grammars exposed Parsing Expression Grammar Implementations o General implementation strategy for PEG o Parsing Expression Grammars mapped to C#1.0 o Parsing Expression Grammars mapped to C#3.0 Parsing Expression Grammar Examples o Json Checker (Recognize only) o Json Tree (Build Tree) o Basic Encode Rules (Direct Evaluation + Build Tree) o Scientific Calculator (Build Tree + Evaluate Tree) PEG Parser Generator o A PEG Generator implemented with PEG o PEG Parser Generator Grammar o The PEG Parser Generator's handling of semantic blocks Parsing Expression Grammars in perspective o Parsing Expression Grammars Tuning o Comparison of PEG parsing with other parsing techniques o Translating LR grammars to PEG grammars Future Developments

Parsing Expression Grammar Tutorial


Parsing Expression Grammars are a kind of executable grammars. Execution of a PEG grammar means, that grammar patterns matching the input string advance the current input position accordingly. Mismatches are handled by going back to a previous input string position where parsing eventually continues with an alternative. The following subchapters explain PEGs in detail and introduce the basic PEG constructs, which have been extended by the author in order to support error diagnosis, direct evaluation and tree generation.

Parsing Expression Grammars Basics


The following PEG grammar rule Collapse
EnclosedDigits: [0-9]+ / '(' EnclosedDigits ')' ;

introduces a so called Nonterminal EnclosedDigits and a right hand side consisting of two alternatives. The first alternative ([0-9]+) describes a sequence of digits, the second ( '(' EnclosedDigits ')') something enclosed in parentheses. Executing EnclosedDigits with the string ((123))+5 as input would result in a match and move the input position to just before +5.

This sample also shows the potential for recursive definitions, since EnclosedDigits uses itself as soon as it recognizes an opening parentheses. The following table shows the outcome of applying the above grammar for some other input strings. The | character is an artifical character which visualizes the input position before and after the match. Input |((123))+5 |123 |5+123 |((1)] Match Position ((123))|+5 123| |5+123 |((1)] Match Result true true false false

For people familiar with regular expressions, it may help to think of a parsing expression grammar as a generalized regular expression which always matches the beginning of an input string (regexp prefixed with ^). Whereas a regular expression consists of a single expression, a PEG consists of set of rules; each rule can use other rules to help in parsing. The starting rule matches the whole input and uses the other rules to match subparts of the input. During parsing one has always a current input position and the input string starting at this position must match against the rest of the PEG grammar. Like regular expressions PEG supports the postfix operators * + ? , the dot . and character sets enclosed in []. Unique to PEG are the prefix operators & (peek) and ! (not), which are used to look ahead without consuming input. Alternatives in a PEG are not separated by | but by / to indicate that alternatives are strictly tried in sequential order. What makes PEG grammars powerful and at the same time a potential memory hogg is unlimited backtracking, meaning that the input position can be set back to any of the previously visited input positions in case an alternative fails. A good and detailed explanation of PEG can be found in the wikipedia [2]. The following table gives an overview of the PEG constructs (and some homegrown extensions) which are supported by the library class described in this article. The following terminology is used Notion Meaning Name of a grammar rule. In PEG, there must be exactly one grammar rule having a nonterminal on the left hand side. The right hand side of the grammar rule provides Nonterminal the definition of the grammar rule. A nonterminal on the right hand side of a grammar rule must reference an existing grammar rule definition.; Input string string which is parsed. Input Indicates the next input character to be read. position A grammar element can match a stretch of the input. The match starts at the current Match input position. Success/ Possible outcome of matching a PEG element against the input Failure e, e1, e2 e, e1 and e2 stand each for arbitrary PEG expressions.

The extended PEG constructs supported by this library are listed in the following table (|indicates the input position, italics like in name indicate a placeholder): PEG element CodePoint Notation #32 (decimal) #x3A0 (hex) #b111 (binary) Meaning Match input against the specified unicode character. PEG Success Failure #x25 %|1 |1% Match input against quoted string. PEG Success Failure 'for' for|tran |afordable Escapes take the same form as in the "C" syntax family. Same as for Literal but compares case insensitive. \i must follow a Literal PEG Success Failure 'FOR'\i FoR|TraN |affordable Same meaning as in regular expressions. Supported are ranges as in [A-Za-z0-9], single characters and escapes sequences. increment the input position except when being at the end of the input. PEG Success Failure 'this is the end' this is the |this is the end!| end . Interprets the Bit bitNo/Bitsequence [low-high] of the current input byte as integer which must match is used as input for the PegElement. PEG Success Failure &BITS |11010101 |01010101 <7-8,#b11> Match input against e1 and then -in case of successagainst e2. PEG Success Failure '#'[0-9] #5| |#A Match input against e1 and then - in case of failure against e2. PEG Success Failure '<='/'<' <|5 |>5 Try to match input against e. PEG Success Success

Literal

'literal'

CaseInsensitive 'literal'\i Literal

CharacterSet

[chars]

Any

BITS<bitNo> BITS BITS<low-high>

Sequence

e1 e2

Sequentially executed alternatives

e1 / e2

Greedy Option e?

PEG element

Notation

Meaning '-'? -|42 |+42 Match input repeated against e until the match fails. PEG Success Success [0-9]* 42|b |-42 Shorthand for e e* PEG Success Failure [0-9]* 42|b |-42 Match input at least min times but not more than max times against e. PEG Success Failure ('.'[0-9]*){2,3} .12.36.42|.18b |.42b Match e without changing input position. PEG Success Failure &'42' |42 |-42 Like Peek but Success<->Failure PEG Success Failure !'42' |-42 |42 Prints the message and error location to the error stream and quits the parsing process afterwards (throws an exception). WARNING<"message"> Prints the message and location to the error stream. Success:<always> Same as (e/FATAL<"e expected"> ) if e is matched, a tree node will be added to the parse tree. This tree node will hold the starting and ending match positions for e like ^^e, but node will be replaced by the child node if there is only one child N is the nonterminal; e the right hand side which is terminated by a semicolon. id must be a positive integer, e.g. [1] Int:[0-9]+; The id will be assigned to the the tree/ast node id. N will be allocated as tree node having the id <id> N will be allocated as tree node and is eventually replaced by a child if the node for N has only one

Greedy repeat zero or more occurrences Greedy repeat one or more occurrences Greedy repeat between minimum and maximum occurrences Peek

e*

e+

e{min} e{min,max} e{,max} e{min,}

&e

Not

!e

FATAL

FATAL <"message"> WARNING <"message"> @e ^^e ^e N: e; [id]N: e; [id] ^^N: e; [id]^N: e;

WARNING Mandatory Tree Node Ast Node Rule Rule with id Tree building rule Ast building rule

PEG element Parametrized Rule

Notation N<peg1, peg2,..>: e;

Into variable

e:variableName

Meaning child which has no siblings. N takes the PEG epressions peg1, peg2 ... as parameter. This parameters cant then be used in e. Set the host language variable (a string, byte[], int, double or PegBegEnd) to the matched input stretch. The language variable must be declared either in the semantic block of the corresponding rule or in the semantic block of the grammar (see below). Interpret the Bit bitNo or the Bitsequence [low-high] as integer and store it in the host variable. call host language function f_ in a semantic block (see below). A semantic function has the signature bool f_();. A return value of true is handled as success whereas a return value of false is handled as fail. The BlockName can be missing in which case a local class named _Top will be created. Functions and data of a grammar-level semantic block can be accessed from any other rule-level semantic block. Functions in the grammar-level semantic block can be used as semantic functions at any place in the grammar. This kind of block is used in conjunction with customized tree nodes as described at the very end of this table Functions and data of a rule-level semantic block are only available from within the associated rule. Functions in the rule associated semantic block can be used as semantic functions on the right hand side of the rule.

Bits Into variable

BITS<bitNo, :variable> BITS<low-high, :variable> f_

Semantic Function

BlockName{ //host Semantic Block //language (Grammar //statements level) } CREATE Semantic Block (Grammar level) CREATE{ //host //language //statements }

RuleName { //host Semantic Block //language (Rule level) //statements }: e; Using semantic block (which is elsewhere defined)

RuleName The using directive supports reusing the same using semantic block when several rules need the same local NameOfSemanticBlock: semantic block. e; Custom Node creation allows to create a user defined Node (which must be derived from the library node ^^CREATE< CreaFunName> N: e; PegNode). The CreaFunc must be defined in a CREATE semantic block (see above) and must have ^CREATE< CreaFuncName> N: e; the following overall structure Collapse
PegNode CreaFuncName(ECreatorPhase phase,

Custom Node Creation

PEG element

Notation

Meaning
PegNode parentOrCreated, int id) { if (phase == ECreatorPhase.eCreate || phase == ECreatorPhase.eCreateAndComplete) { // create and return the custom node; if phase==ECreatorPhase.eCreateAndComplete // this will be the only call }else{ // finish the custom node and return parentOrCreated; one only gets here // after successful parsing of the subrules } }

Parsing Expression Grammars Particularities and Idioms


PEG's behave in some respects similar to regular expressions: The application of a PEG to an input string can be explained by a pattern matching process which assigns matching parts of the input string to rules of the grammar (much like with groups in regexes) and which backtracks in case of a mismatch. The most important difference between a PEG and regexes is the fact, that PEG support recursivenesss and that PEG patterns are greedy. Compared to most other traditional language parsing techniques, PEG is surprisingly different. The most striking differences are:

Parsing Expression Grammars are deterministic and never ambigous, thereby removing a problem of most other parsing techniques. Ambiguity means that the same input string can be parsed with different sets of rules of a given grammar and that there is no policy saying which of the competing rules should be used. This is in most cases a serious problem, since if this gets undetected it results in different parse trees for the same input. The lack of ambiguity is a big plus for PEG. But the fact, that the order of alternatives in a PEG rule matters, takes getting used to. The following PEG rule e.g. rel_operator: '<' / '<=' / '>' / '>='; will never succeed in recognizing <= because the first alternative will be chosen. The correct rule is: rel_operator: '<=' / '<' / '>=' / '>'; Parsing Expression Grammars are scannerless, whereas most other parsers divide the parsing task into a low level lexical parse phase called scanning and a high level phase proper parsing. The lexical parse phase just parses items like numbers, identifiers and strings and presents the information as a so called token to the proper parser. This subdivision has its merits and its weak points. It avoids in some cases backtracking and makes it e.g. easy to distinguish between a keyword and an identifier. A weak point of most scanners is the lack of context information inside the scanner so that a given input string always results in the same token. This is not always desirable and makes e.g. problems in C++ for the input string >> which can be a right shift operator or the closing of two template brackets.

Parsing Expression Grammars can backtrack to an arbitrary location at the beginning of the input string. PEG does not require that a file which has to be parsed must be read completely into memory, but it prohibits to give free any part of the file which has already been parsed. This means that a file which foreseeably will be parsed to the end, should be read into memory completely before parsing starts. Fortunately memory is not anymore a scarce resource. In a direct evaluation scenario (semantic actions are executed as soon as the corresponding syntax element is recognized) backtracking can also cause problems, since already executed semantic actions are in most cases not so easily undone. Semantic actions should therefore be placed at points where backtracking cannot anymore occur or where backtracking would indicate a fatal error. Fatal errors in PEG parsing are best handled by throwing an exception. For many common problems idiomatic solution exist within the PEG framework as shown in the following table Idiomatic solution Sample Collapse
//to avoid [3]prod: val S ([*/] S val S)*; [4]val : [0-9]+ / '(' S expr ')' S; //to prefer [3]prod: val ([*/] S val)*; [4]val: ([0-9]+ / '(' S expr ')') S;

Goal Avoid that white space scanning clutters up the grammar

White Space scanning should be done immediately after reading a terminal, but not in any other place.

Reuse Nonterminal !oneOfExceptions when only a subset reusedNonterminal is applicable Test for end of input Generic rule for quoting situation Order alternatives having the same start Integrate error handling into Grammar Provide detailed,

Java spec SingleCharacter: InputCharacter but not ' or \ Peg spec Collapse
SingleCharacter: !['\\] InputCharacter

Collapse !.
(!./FATAL<"end of input expected"> )

GenericQuote Collapse <BegQuote,QuoteContent,EndQuote>: GenericQuote<'"',(!'"' .)*,'"'> BegQuote QuoteContent EndQuote; longer_choice / shorter_choice Collapse
<= / <

Collapse Use error handling alternative.


//expect a '(' expr //same as '(' expr expected"> symbol @')' (')'/FATAL<"')' );

Generate better error messages by

Collapse

Goal expressive error messages

Idiomatic solution peeking at next symbol

Sample
//poor error handling [4]object: '{' S members? @'}' S; [5]members: (str/num)(',' S @(str/num))*; //better error handling [4]object: '{' S (&'}'/members) @'}' S; [5]members: @(str/num)(',' S @(str/num))*;

PEG Idioms Applied to Real Word Grammar Problems Most modern programming languages are based on grammars, which can be almost parsed by the predominant parsing technique (LALR(1) parsing). The emphasis here is on almost, meaning that there are often grammar rules which require special handling outside of the grammar framework. The PEG framework can handle this exceptional cases far better as will be shown for the C++ and C# grammar. The C# Language Specification V3.0 e.g. has the following wording for its castexpression/parenthized-expression disambiguation rule: Collapse
A sequence of one or more tokens (2.3.3) enclosed in parentheses is considered the start of a cast-expression only if at least one of the following are true: 1) The sequence of tokens is correct grammar for a type, and the token immediately following the closing parentheses is the token ~, the token !, the token (, an identifier (2.4.1), a literal (2.4.4), or any keyword (2.4.3) except as and is. 2) The sequence of tokens is correct grammar for a type, but not for an expression.

This can be expressed in PEG with Collapse


cast_expression: /*1)*/ ('(' S type ')' S &([~!(]/identifier/literal/!('as' B/'is' B) keyword B) /*2)*/ / !parenthesized_expression '(' S type ')' ) S unary_expression; B: ![a-zA-Z_0-9]; S: (comment/whitespace/new_line/pp_directive )*;

The C++ standard has the following wording for its expression-statement/declaration disambiguation rule Collapse

An expression-statement ... can be indistinguishable from a declaration ... In those cases the statement is a declaration.

This can be expressed in PEG with Collapse


statement: declaration / expression_statement;

Integrating Semantic Actions into a PEG Framework


A PEG grammar can only recognize an input string, which gives you just two results, a boolean value indicating match success or match failure and an input position pointing to the end of the matched string part. But in most cases, the grammar is only a means to give the input string a structure. This structure is then used to associate the input string with a meaning (a semantic) and to execute statements based on this meaning. These statements executed during parsing are called semantic actions. The executable nature of PEG grammars makes integration of semantic actions easy. Assuming a sequence of grammar symbols e1 e2 and a semantic action es_ which should be performed after recognition of e1 we just get the sequence e1 es_ e2 where es_ is a function of the host language. From the grammar view point es_ has to conform to the same interface as e1 and e2 or any other PEG component, what means that es_ is a function returning a bool value as result, where true means success and false failure. The semantic function es_ can be defined either local to the rule which uses (calls) es_ or in the global environment of the grammar. A bundling of semantic functions, into-variables, helper data values and helper functions forms then a semantic block. Semantic actions face one big problem in PEG grammars, namely backtracking. In most cases, backtracking should not occur anymore after a semantic function (e.g. computation of a result of an arithemtic subexpression) has been performed. The simplest way to guard against backtracking in such a case is to handle any attempt to backtrack as fatal error. The FATAL<msg> construct presented here aborts parsing (by raising an exception). Embedding semantic actions into the grammar enables direct evaluation of the parsed construct. A typical application is the stepwise computation of an arithmetical expression during the parse phase. Direct evaluation is fast but very limiting since it can only use information present at the current parse point. In many cases embedded semantic actions are therefore used to collect information during parsing for processing after parsing has completed. The collected data can have many forms, but the most important one is a tree. Optimizing parsers and compilers delay semantic actions until the end of the parsing phase and just create a physical parse tree during parsing (our PEG framework supports tree generating by the prefixes ^ and ^^). A tree walking process then checks and optimizes the tree. Finally the tree is intrerpreted at runtime or it is just used to generate virtual or real machine code. The most important evaluation options are shown below Collapse
Parsing -> Direct Evaluation

-> Collecting Information during Parsing -> User defined datastructure ->User defined evaluation -> Tree Structure ->Interpretation of generated tree ->Generation of VM or machine code

In a PEG implementation, tree generation must cope with backtracking by deleting tree parts which were built after the backtrack restore point. Furthermore, no tree nodes should be created when a Peek or Not production is active. In this implementation this is handled by tree generation aware code in the implemenations for And, Peek, Not and ForRepeat productions.

Parsing Expression Grammars Exposed


The following sample grammar is also taken from the wikipedia article on PEG [2] (but with a sligthly different notation). Collapse
<<Grammar Name="WikiSample">> Expr: S Sum; Sum: Product ([+-] S Product)*; Product: Value ([*/] S Value)*; Value: [0-9]+ ('.' [0-9]+)? S / '(' S Sum ')' S; S: [ \n\r\t\v]*; <</Grammar>>

During the application of a grammar to an input string, each grammar rule is called from some parent grammar rule and matches a subpart of the input string which is matched by the parent rule. This results in a parse tree. The grammar rule Expr would associate the arithmetical expressions 2.5 * (3 + 5/7) with the following parse tree: Collapse
Expr< S<' '> Sum< Product< Value<'2.5' S<' '>> '*' //[*] see text S<' '> Value< '(' S<''> Sum< Product<Value<'3' S<' '>> '+' S<' '> Product< Value<'5' S<''>> '/' S<''> Value<'7' S<''>> > >

> > > >

The above parse tree is not a physical tree but an implicit tree which only exists during the parse process. The natural implementation for a PEG parser associates each grammar rule with a method (function). The right hand side of the grammar rule corresponds to the function body and each nonterminal on the right hand side of the rule is mapped to a function call. When a rule function is called, it tries to match the input string at the current input position against the right hand side of the rule. If it succeeds it advances the input position accordingly and returns true otherwise the input position is unchanged and the result is false. The above parse tree can therefore be regarded as a stack trace. The location marked with [*] in the above parse tree corresponds to the function stack Value<=Product<=Sum<=Expr with the function Value at the top of the stack and the function Expr at the bottom of the stack. The parsing process as described above just matches an input string or it fails to match. But it is not difficult to add semantic actions during this parse process by inserting helper functions at appropriate places. The PEG parser for arithemtical expressions could e.g. compute the result of the expression during parsing. Such direct evaluation does not significantly slow down the parsing process. Using into variables and semantic blocks as listed above one would get the following enhanced PEG grammar for arithmetical expressions which directly evaluates the result of the expression and prints it out to the console. Collapse
<<Grammar Name="calc0_direct">> Top{ // semantic top level block using C# as host language double result; bool print_(){Console.WriteLine("{0}",result);return true;} } Expr: S Sum (!. print_ / FATAL<"following code not recognized">); Sum { //semantic rule related block using C# as host language double v; bool save_() {v= result;result=0; return true;} bool add_() {v+= result;result=0;return true;} bool sub_() {v-= result;result=0;return true;} bool store_() {result= v; return true;} } : Product save_ ('+' S Product add_ /'-' S Product sub_)* store_ ; Product { //semantic rule related block using C# as host language double v; bool save_() {v= result;result=0; return true;} bool mul_() {v*= result;result=0; return true;} bool div_() {v/= result;result=0;return true;} bool store_() {result= v;return true;} } : Value save_ ('*' S Value mul_

/'/' S Value div_)* store_ ; Value: Number S / '(' S Sum ')' S ; Number { //semantic rule related block using C# as host language string sNumber; bool store_(){double.TryParse(sNumber,out result);return true;} } : ([0-9]+ ('.' [0-9]+)?):sNumber store_ ; S: [ \n\r\t\v]* ; <</Grammar>>

In many cases on the fly evaluation during parsing is not sufficient and one needs a physical parse tree or an abstract syntax tree (abbreviated AST). An AST is a parse tree shrinked to the essential nodes thereby saving space and providing a view better suited for evaluation. Such physical trees typically need at least 10 times the memory space of the input string and reduce the parsing speed by a factor of 3 to 10. The following PEG grammar uses the symbol ^ to indicate an abstract snytax node and the symbol ^^ to indicate a parse tree node. The grammar presented below is furthermore enhanced with the error handling item Fatal< errMsg>. Fatal leaves the parsing process immediately with the result fail but the input position set to the place where the fatal error occurred. Collapse
<<Grammar Name="WikiSampleTree">> [1] ^^Expr: S Sum (!./FATAL<"end of input expected">) ; [2] ^Sum: Product (^[+-] S Product)* ; [3] ^Product: Value (^[*/] S Value)* ; [4] Value: Number S / '(' S Sum ')' S / FATAL<"number or ( <Sum> ) expected">; [5] ^^Number: [0-9]+ ('.' [0-9]+)? ; [6] S: [ \n\r\t\v]* ; <</Grammar>>

With this grammar the arithmetical expression 2.5 * (3 + 5/7) would result in the following physical tree: Collapse
Expr< Product< Number<'2.5'> <'*'> Sum< Number<'3'>; <'+'> Product<Number<'5'><'/'>Number<'7'> >> > >

With a physical parse tree, much more options for evaluation are possible, e.g. one can generate code for a virtual machine after first optimizing the tree.

Parsing Expression Grammar Implementation

In this chapter I first show how to implement all the PEG constructs one by one. This will be expressed in pseudo code. Then I will try to find the best interface for this basic PEG functions in C#1.0 and C#3.0.

General Implementation Strategy for PEG


The natural representation of a PEG is a top down recursive parser with backtracking. PEG rules are implemented as functions/methods which call each other when needed and return true in case of a match and false in case of a mismatch. Backtracking is implemented by saving the input position before calling a parsing function and restoring the input position to the saved one in case the parsing function returns false. Backtracking can be limited to the the PEG sequence construct and the e<min,max> repetitions if the input position is only moved forward after successful matching in all other cases. In the following pseudo code we use strings and integer variables, short circuit conditional expressions (using && for AND and || for OR) and exceptions. s stands for the input string and i refers to the current input position. bTreeBuild is an instance variable which inhibits tree build operations when set to false. PEG construct CodePoint #<dec> #x<hex> #x<bin> Literal 'literal' sample pseudo code to implement sample Collapse

#32 (decimal) if i<length(s) && s[i]=='\u03A0' #x3A0 (hex) {i+= 1; return true;} <#b111> (binary) else {return false;} Collapse 'ab'
if i+2<length(s) && s[i]=='a' && s[i+1]=='b' { i+= 2; return true; } else {return false; }

Collapse CaseInsensitive Literal 'ab'\i


if i+2<length(s) && toupper(s[i])=='A' && toupper(s[i+1] =='B' { i+= 2; return true; } else {return false; }

Charset [charset]

Collapse [ab]
if i+1<length(s) && (s[i]=='a'|| s[i]=='b') {i+= 1; return true;} else {return false;}

Collapse CharacterSet [a-z]


if i+1<length(s) && s[i]>='a' && s[i]<='z' {i+= 1; return true;} else {return false;}

Collapse Any .
if i+1<length(s) {i+=1;return true;}

PEG construct

sample

pseudo code to implement sample


else {return false;}

Collapse BITS BITS<7-8,#b11>


if i<length(s) && ExtractBitsAsInt(s[i],7,8)==3 {i+=1;return true;} else {return false;}

Collapse Sequence e1 e2
int i0= i; TreeState t=SaveTreeState(); if e1() && e2() {return true;} else {i=i0; RestoreTreeState(t);return false;}

Alternative Greedy Option Greedy repeat 0+ Greedy repeat 1+

e1 / e2 e? e* e+

Collapse
return e1() || e2();

Collapse
return e() || true;

Collapse
while e() {} return true;

Collapse
if !e() { return false}; while e() {} return true;

Collapse Greedy repeat >=low<=high


int c,i0=i; TreeState t=SaveTreeState(); for(c=0;c<high;++c){if !e(){ break;} } if c<low { i=i0; RestoreTreeState(t); return false;} else {return true;}

e{low,high}

Collapse Peek &e


int i0=i; bool bOld= bTreeBuild; bTreeBuild= false; bool b= e(); i=i0; bTreeBuild= bOld; return b;

Collapse Not !e
int i0=i; bool bOld= bTreeBuild; bTreeBuild= false; bool b= e(); i=i0; bTreeBuild= bOld; return !b;

Collapse FATAL WARNING FATAL< message > WARNING<


PrintMsg(message); throw PegException();

Collapse

PEG construct

sample message > e :variableName

pseudo code to implement sample


PrintMsg(message); return true;

Collapse Into
int i0=i; bool b= e(); variableName= s.substring(i0,i-i0); return b;

Collapse Bits Into variable BITS<3-5,:v>


int i0=i; if i<length(s) {v= ExtractBitsAsInt(s[i],3,5);++i;return true;} else {return false;}

Collapse Build Tree Node ^^e


TreeState t=SaveTreeState(); AddTreeNode(...) bool b= e(); if !b {RestoreTreeState(savedState);} return b;

Collapse Build Ast Node ^e


TreeState t=SaveTreeState(); AddTreeNode(..) bool b= e(); if !b {RestoreTreeState(savedState);} else {TryReplaceByChildWithoutSibling();} return b;

Parsing Expression Grammars Mapped to C#1.0


In C#1.0 we can map the PEG operators CodePoint,Literal, Charset, Any, FATAL, and WARNING to helper functions in a base class. But the other PEG constructs, like Sequence, Repeat, Peek, Not, Into and Tree building cannot be easily outsourced to a library module. The Grammar for integer sums Collapse
<<Grammar Name="IntSum">> Sum: S [0-9]+ ([+-] S [0-9]+)* S S: [ \n\r\t\v]* <</Grammar>> ; ;

results in the following C#1.0 implementation (PegCharParser is a not shown base class with the field pos_ and the methods In and OneOfChars) Collapse
class InSum_C1 : PegCharParser { public InSum_C1(string s) : base(s) { } public bool Sum()//Sum: S [0-9]+ (S [+-] S [0-9]+)* S { S();

//[0-9]+ if( !In('0', '9') ){return false;} while (In('0', '9')) ; for(;;){//(S [+-] S [0-9]+)* int pos= pos_; if( S() && OneOfChars('+','-') && S() ){ //[0-9]+ if( !In('0', '9') ){pos_=pos; break;} while (In('0', '9')) ; }else{ pos_= pos; break; } } S(); return true; } bool S()//S: [ \n\r\t\v]* ; { while (OneOfChars(' ', '\n', '\r', '\t', '\v')) ; return true; } }

To execute the Grammar we must just call the method Sum of an object of the above class. But we cannot be happy and satisfied with this solution. Compared with the original grammar rule, the method Sum in the above class InSum_C1 is large and in its use of loops and helper variables quite confusing. But it is perhaps the best of what is possible in C#1.0. Many traditional parser generators even produce worse code.

Parsing Expression Grammars Mapped to C#3.0


PEG operators like Sequence, Repeat, Into, Tree Build, Peek and Not can be regarded as operators or functions which take a function as parameter. This maps in C# to a method with a delegate parameter. The Peg Sequence operator e.g can be implemented as a function with the following interface public bool And(Matcher pegSequence); where Matcher is the following delegate public delegate bool Matcher();. In older C# versions, passing a function as a parameter required some code lines, but with C#3.0 this changed. C#3.0 supports lambdas, which are anonymous functions with a very low syntactical overhead. Lambdas enable a functional implementation of PEG in C#. The PEG Sequence e1 e2 can now be mapped to the C# term And(()=>e1() && e2()). ()=>e1()&& e2() looks like a normal expression, but is in effect a fullfledged function with zero parameters (hence ()=>) and the function body {return e1() && e2();}. With this facility, the Grammar for integer sums Collapse
<<Grammar Name="IntSum">> Sum: S [0-9]+ ([+-] S [0-9]+)* S S: [ \n\r\t\v]* <</Grammar>> ; ;

results in the following C#3.0 implementation (PegCharParser is a not shown base class with methods And,PlusRepeat,OptRepeat, In and OneOfChars) Collapse
class IntSum_C3 : PegCharParser { public IntSum_C3(string s) : base(s) { } public bool Sum()//Sum: S [0-9]+ (S [+-] S [0-9]+)* S ; { return And(()=> S() && PlusRepeat(()=>In('0','9')) && OptRepeat(()=> S() && OneOfChars('+','-') && S() && PlusRepeat(()=>In('0','9'))) && S()); } public bool S()//S: [ \n\r\t\v]* ; { return OptRepeat(()=>OneOfChars(' ', '\n', '\r', '\t', '\v')); } }

Compared to the C#1.0 implementation this parser class is a huge improvement. We have eliminated all loops and helper variables. The correctness (accordance with the grammar rule) is also much easier to check. The methods And, PlusRepeat, OptRepeat, In and OneOfChars are all implemented in both the PegCharParser and PegByteParser base classes. The following table shows most of the PEG methods available in the base library delivered with this article. PEG element CodePoint C# methods Char(char) Char(char c0,char c1,...) Literal Char(string s) CaseInsensitive IChar(char c0,char c1,...) Literal IChar(string s) Char Set OneOf(char c0,char c1,...) [<c0c1...>] OneOf(string s) Char Set In(char c0,char c1,...) [<c0-c1...>] In(string s) Any Any() sample usage Char('\u0023') Char("ab") IChar("ab") OneOf("ab") In('A','Z','a'-'z','0'-'9') Any() Bits(1,5,31)

BITS Sequence e1 e2 ...

Bits(char cLow,char cHigh,byte toMatch)

And(MatcherDelegate m) And(() => S() && top_element())

PEG element C# methods sample usage Alternative e1 || e2 || ... @object() || array() e1 / e2 / ... Greedy Option Option(MatcherDelegate Option(() => Char('-')) e? m) Greedy repeat 0+ OptRepeat(MatcherDelega OptRepeat(() => OneOf(' ', '\t', '\r', '\n')) e* te m) Greedy repeat 1+ PlusRepeat(MatcherDeleg PlusRepeat(() => In('0', '9')) e+ ate m) Greedy repeat PlusRepeat(MatcherDeleg n0..n1 ForRepeat(4, 4, () => In('0', '9', 'A', 'F', 'a', 'f')) ate m) e{low,high} Peek Peek(MatcherDelegate m) Peek(() => Char('}')) &e Not Not(MatcherDelegate m) Not(()=>OneOf('"','\\')) !e FATAL FATAL<message Fatal("<message>") Fatal("<<'}'>> expected") > WARNING WARNING<mes Warning("<message>") Warning("non-json stuff before end of file") sage> Into(out string varName,MatcherDelegat e m) Into(out int Into varName,MatcherDelegat Into(out top.n, () => Any()) e :variableName e m) Into(out PegBegEnd varName,MatcherDelegat e m) Bits Into BitsInto(int lowBitNo, int BITS<3highBitNo,out int BitsInto(1, 5,out top.tag) 5,:v>variableNa varName) me Build Tree Node TreeNT(int nRuleId, [id] PegBaseParser.Matcher TreeNT((int)Ejson_tree.json_text,()=>...) ^^RuleName: toMatch); TreeAST(int Build Ast Node TreeAST((int)EC_KernighanRitchie2.external_de id,PegBaseParser.Matcher [id] ^RuleName: claration,()=>...) Delegate m) Parametrized RuleName(MatcherDelega binary(()=> relational_expression(), Rule te a, ()=>TreeChars(

PEG element C# methods RuleName<a,b,.. MatcherDelegate b,...) .>

sample usage ()=>Char('=','=') || Char('!','=') )

Expression Grammar Examples


The following examples show uses of the PegGrammar class for all supported use cases: 1. Recognition only: The result is just match or does not match, in which case an error message is issued. 2. Build of a physical parse tree: The result is a physical tree. 3. Direct evaluation: Semantic actions executed during parsing. 4. Build tree, interpret tree: The generated Tree is traversed and evaluated.

JSON Checker (Recognize only)


JSON (JavaScript Object Notation) [5][6] is an exchange format suited for serializing/deserializing program data. Compared to XML it is featherweight and therefore a good testing candidate for parsing techniques. The JSON Checker presented here gives an error message and error location in case the file does not conform to the JSON grammar. The following PEG grammar is the basis of json_check. Collapse
<<Grammar Name="json_check" encoding_class="unicode" encoding_detection="FirstCharIsAscii" reference="www.ietf.org/rfc/rfc4627.txt">> [1]json_text: S top_element expect_file_end ; [2]expect_file_end: !./ WARNING<"non-json stuff before end of file">; [3]top_element: object / array / FATAL<"json file must start with '{' or '['"> ; [4]object: '{' S (&'}'/members) @'}' S ; [5]members: pair S (',' S pair S)* ; [6]pair: @string S @':' S value ; [7]array: '[' S (&']'/elements) @']' S ; [8]elements: value S (','S value S)* ; [9]value: @(string / number / object / array / 'true' / 'false' / 'null') ; [10]string: '"' char* @'"' ; [11]char: escape / !(["\\]/control_chars)unicode_char ; [12]escape: '\\' ( ["\\/bfnrt] / 'u' ([0-9A-Fa-f]{4}/FATAL<"4 hex digits expected">)/ FATAL<"illegal escape">); [13]number: '-'? int frac? exp? ; [14]int: '0'/ [1-9][0-9]* ; [15]frac: '.' [0-9]+ ; [16]exp: [eE] [-+] [0-9]+ ; [17]control_chars: [#x0-#x1F] ; [18]unicode_char: [#x0-#xFFFF] ; [19]S: [ \t\r\n]* ;

<</Grammar>>

The translation of the above grammar to C#3.0 is straightforward and results in the following code (only the translation of the first 4 rules are reproduced). Collapse
public bool json_text() {return And(()=> S() && top_element() && expect_file_end() );} public bool expect_file_end() { return Not(()=> Any() ) || Warning("non-json stuff before end of file"); } public bool top_element() { return @object() || array() || Fatal("json file must start with '{' or '['"); } public bool @object() { return And(()=> Char('{') && S() && ( Peek(()=> Char('}') ) || members()) && ( Char('}') || Fatal("<<'}'>> expected")) && S() ); }

JSON Tree (Build Tree)


With a few changes of the JSON checker grammar we get a grammar which generates a physical tree for a JSON file. In order to have unique nodes for the JSON values true, false, null we add corresponding rules. Furthermore, we add a rule which matches the content of a string (the string without the enclosing double quotes). This gives us the following grammar: Collapse
<<Grammar Name="json_tree" encoding_class="unicode" encoding_detection="FirstCharIsAscii" reference="www.ietf.org/rfc/rfc4627.txt">> [1]^^json_text: (object / array) ; [2]^^object: S '{' S (&'}'/members) S @'}' S [3]members: pair S (',' S @pair S)* [4]^^pair: @string S ':' S value ; [5]^^array: S '[' S (&']'/elements) S @']' S ; [6]elements: value S (','S @value S)* ; [7]value: @(string / number / object / array / true / false / null) ; [8]string: '"' string_content '"' ; [9]^^string_content: ( '\\' ( 'u'([0-9A-Fa-f]{4}/FATAL<"4 hex digits expected">)

; ;

[10]^^number: [11]S: [12]^^true: [13]^^false: [14]^^null: <</Grammar>>

/ ["\\/bfnrt]/FATAL<"illegal escape"> ) / [#x20-#x21#x23-#xFFFF] )* ; '-'? '0'/ [1-9][0-9]* ('.' [0-9]+)? ([eE] [-+] [0-9]+)?; [ \t\r\n]* ; 'true' ; 'false' ; 'null' ;

The following table shows on the left hand side a JSON input file and on the right hand side the tree generated by the TreePrint helper class of our parser library. JSON Sample File TreePrint Output Collapse
json_text< object< pair< 'ImageDescription' object< pair<'Width' '800'> pair<'Height' '600'> pair<'Title' 'View from 15th Floor'> pair<'IDs' array<'116' '943' '234' '38793'>> > > > >

Collapse
{ "ImageDescription": { "Width": 800, "Height": 600, "Title": "View from 15th Floor", "IDs": [116, 943, 234, 38793] } }

Basic Encode Rules (Direct Evaluation + Build Tree)


BER (Basic Encoding Rules) is the most commonly used format for encoding ASN.1 data. Like XML, ASN.1 serves the purpose of representing hierarchical data, but unlike XML, ASN.1 is traditionally encoded in compact binary formats and BER is one of the these formats (albeit the least compact one). The Internet standards SNMP and LDAP are examples of ASN.1 protocols using BER as encoding. The following PEG grammar for reading a BER file into a tree representation uses semantic blocks to store information necessary for further parsing. This kind of dynamic parsing which uses data read during the parsing process to decode data further downstreams is typical for parsing of binary formats. The grammar rules for BER [4] as shown below express the following facts: 1. BER nodes consist of the triple Tag Length Value (abbreviated as TLV) where Value is either a primitive value or a list of TLV nodes. 2. The Tag identifies the element (like the start tag in XML).

3. The Tag contains a flag whether the element is primitive or constructed. Constructed means that there are children. 4. The Length is either the length of the Value in bytes or it is the special pattern 0x80 (only allowed for elements with children), in which case the sequence of childrens ends with two zero bytes (0x0000). 5. The Value is either a primitive value or -if the constructed flag is set- it is a sequence of Tag Length Value triples. The sequence of TLV triples ends when the length given in the Length part of the TLV tripple is used up or in the case where the length is given as 0x80, when the end marker 0x0000 has been reached. Collapse
<<Grammar Name="BER" encoding_class="binary" reference="http://en.wikipedia.org/wiki/Basic_encoding_rules" comment="Tree generating BER decoder (minimal version)">> { int tag,length,n,@byte; bool init_() {tag=0;length=0; return true;} bool add_Tag_() {tag*=128;tag+=n; return true;} bool addLength_(){length*=256;length+=@byte;return true;} } [1] ProtocolDataUnit: TLV; [2] ^^TLV: init_ ( &BITS<6,#1> Tag ( #x80 CompositeDelimValue #0#0 / Length CompositeValue ) / Tag Length PrimitiveValue ); [3] Tag: OneOctetTag / MultiOctetTag / FATAL<"illegal TAG">; [4] ^^OneOctetTag: !BITS<1-5,#b11111> BITS<1-5,.,:tag>; [5] ^^MultiOctetTag: . (&BITS<8,#1> BITS<1-7,.,:n> add_Tag_)* BITS<1-7,.,:n> add_Tag_; [6] Length : OneOctetLength / MultiOctetLength / FATAL<"illegal LENGTH">; [7] ^^OneOctetLength: &BITS<8,#0> BITS<1-7,.,:length>; [8]^^MultiOctetLength: &BITS<8,#1> BITS<1-7,.,:n> ( .:byte addLength_){:n}; [9]^^PrimitiveValue: .{:length} / FATAL<"BER input ends before VALUE ends">; [10]^^CompositeDelimValue: (!(#0#0) TLV)*; [11]^^CompositeValue { int len; PegBegEnd begEnd; bool save_() {len= length;return true;} bool at_end_(){return len<=0;} bool decr_() {len-= begEnd.posEnd_-begEnd.posBeg_;return len>=0;} } : save_ (!at_end_ TLV:begEnd (decr_/FATAL<"illegal length">))*; <</Grammar>>

Scientific Calculator (Build Tree + Evaluate Tree)

This calculator supports the basic arithmetic operations + - * /, built in functions taking one argument like 'sin','cos',.. and assignments to variables. The calculator expects line separated expressions and assignments. It works as two step interpreter which first builds a tree, then evaluates the tree. The PEG grammar for this calculator can be translated to a peg parser by the parser generator coming with the PEG Grammar Explorer. The evaluator must be written by hand. It works by walking the tree and evaluating the results as it visits the nodes. The grammar for the calculator is: Collapse
<<Grammar Name="calc0_tree">> [1]^^Calc: ((^'print' / Assign / Sum) ([\r\n]/!./FATAL<"end of line expected">) [ \r\n\t\v]* )+ (!./FATAL<"not recognized">); [2]^Assign:S ident S '=' S Sum; [3]^Sum: Prod (^[+-] S @Prod)*; [4]^Prod: Value (^[*/] S @Value)*; [5] Value: (Number/'('S Sum @')'S/Call/ident) S; [6]^Call: ident S '(' @Sum @')' S; [7]^Number:[0-9]+ ('.' [0-9]+)?([eE][+-][0-9]+)?; [8]^ident: [A-Za-z_][A-Za-z_0-9]*; [9] S: [ \t\v]*; <</Grammar>>

PEG Parser Generator


A PEG Generator Implemented with PEG
The library classes PegCharParser and PegByteParser are designed for manual Parser construction of PEG parsers. But it is highly recommended in any case to first write the grammar on paper before implementing it. I wrote a little parser generator (using PegCharParser) which translates a 'paper' Peg grammar to a C# program. The current version of the PEG parser generator just generates a C# parser. It uses optimizations for huge character sets and for big sets of literal alternatives. Future versions will generate source code for C/C++ and other languages and furthermore support debugging, tracing and direct execution of the grammar without the need to translate it to a host language. But even the current version of the PEG parser generator is quite helpful. All the samples presented in the chapter Expression Grammar Examples were generated with it. The PEG Parser Generator is an example of a PEG parser which generates a syntax tree. It takes a PEG grammar as input, validates the generated syntax tree and then writes a set of C# code files, which implement the parser described by the PEG grammar.

PEG Parser Generator Grammar

The PEG Parser Generator coming with this article expects a set of grammar rules written as described in the chapter Parsing Expression Grammars Basics. These rules must be preceded by a header and terminated by a trailer as described in the following PEG Grammar: Collapse
<<Grammar Name="PegGrammarParser">> peg_module: peg_head peg_specification peg_tail; peg_head: S '<<Grammar'\i B S attribute+ '>>'; attribute: attribute_key S '=' S attribute_value S; attribute_key: ident; attribute_value: "attribute value in single or double quotes"; peg_specification: toplevel_semantic_blocks peg_rules; toplevel_semantic_blocks:semantic_block*; semantic_block: named_semantic_block / anonymous_semantic_block; named_semantic_block: sem_block_name S anonymous_semantic_block; anonymous_semantic_block:'{' semantic_block_content '}' S; peg_rules: S peg_rule+; peg_rule: lhs_of_rule ':'S rhs_of_rule ';' S; lhs_of_rule: rule_id? tree_or_ast? create_spec? rule_name_and_params (semantic_block/using_sem_block)?; rule_id: (![A-Za-z_0-9^] .)* [0-9]+ (![A-Za-z_0-9^] .)*; tree_or_ast: '^^'/'^'; create_spec: 'CREATE' S '<' create_method S '>' S; create_method: ident; ident: [A-Za-z_][A-Za-z_0-9]*; rhs_of_rule: "right hand side of rule as described in %22#Parsing" expression="" grammars="" basics"="">Parsing Expression Grammars Basics"; semantic_block_content: "semantic block content as described in "%22#Parsing" expression="" grammars="" basics"="">Parsing Expression Grammars Basics"; peg_tail: '<</GRAMMAR'\i; S '>>'; <</Grammar>>

The header of the grammar contains HTML/XML-style attributes which are used to determine the name of the generated C# file and the input file properties. The following attributes are used by the C# code generator: Attribute Key Name Optionality Attribute Value Mandatory Name for the generated C# grammar file and namespace Encoding of the input file. Must be one of binary, unicode, encoding_class Optional utf8 or ascii. Default is ascii Must only be present if the encoding_class is set to unicode. encoding_detection Optional In this case one of the values FirstCharIsAscii or BOM is expected. All further attributes are treated as comments. The attribute reference in the following sample header

Collapse
<<Grammar Name="json_check" encoding_class="unicode" encoding_detection="FirstCharIsAscii" reference="www.ietf.org/rfc/rfc4627.txt">>

is treated as comment.

The PEG Parser Generator's Handling of Semantic Blocks


Semantic blocks are translated to local classes. The code inside semantic blocks must be C# source text as expected in a class body, except that access keywords can be left out. The parser generator prepends an internal access keyword when necessary. Top level semantic blocks are handled differently than local semantic blocks. A top level semantic block is created in the grammar's constructor, wheras a local semantic block is created each time the associated rule method is called. There is no need to define a constructor in a local semantic block, since the parser generator creates a constructor with one parameter, a reference to the grammar class. The following sample shows a grammar excerpt with a top level and a local semantic block and its translation to C# code. Collapse
<<Grammar Name="calc0_direct">> Top{ // semantic top level block double result; bool print_(){Console.WriteLine("{0}",result);return true;} } ... Number { //local semantic block string sNumber; bool store_(){double.TryParse(sNumber,out result);return true;} } : ([0-9]+ ('.' [0-9]+)?):sNumber store_ ;

These semantic blocks will be translated to the following C# source code Collapse
class calc0_direct : PegCharParser { class Top{ // semantic top level block internal double result; internal bool print_(){Console.WriteLine("{0}",result);return true;} } Top top; #region Constructors public calc0_direct(): base(){ top= new Top();} public calc0_direct(string src,TextWriter FerrOut): base(src,FerrOut){top= new Top();} #endregion Constructors ... class _Number{ //local semantic block internal string sNumber;

internal bool store_(){double.TryParse(sNumber,out parent_.top.result);return true;} internal _Number(calc0_direct grammarClass){parent_ = grammarClass; } calc0_direct parent_; } public bool Number() { var _sem= new _Number(this); ... }

Quite often, several grammar rules must use the same local semantic block. To avoid code duplication, the parser generator supports the using SemanticBlockName clause. The semantic block named SemanticBlockName should be defined before the first grammar rule at the same place where the top level semantic blocks are defined. But because such a block is referenced in the using clause of a rule, it is treated as local semantic block. Local semantic blocks also support destructors. A destructor is tranlsated to a IDispose interface and the destructor code is placed into the corresponding Dispose() function. The grammar rule function which is generated by the parser generator will be enclosed in a using block. This allows to execute cleanup code at the end of the rule even in the presence of exceptions. The following sample is taken from the Python 2.5.2 sample parser. Collapse
Line_join_sem_{ bool prev_; Line_join_sem_ (){set_implicit_line_joining_(true,out prev_);} ~Line_join_sem_(){set_implicit_line_joining_(prev_);} } ... [8] parenth_form: '(' parenth_form_content @')' S; [9] parenth_form_content using Line_join_sem_: S expression_list?; ... [17]^^generator_expression: '(' generator_expression_content @')' S; [18]generator_expression_content using Line_join_sem_: S expression genexpr_for;

The Line_join_sem semantic block turns Python's implicit line joining on and off (Python is line oriented except that line breaks are allowed inside constructs which are parenthized as in (...) {...) [...]. The Line_join_sem semantic block and rule [8] of the above grammar excerpt are translated to Collapse
class Line_join_sem_ : IDisposable{ bool prev_; internal Line_join_sem_(python_2_5_2_i parent) { parent_= parent; parent_._top.set_implicit_line_joining_(true,out prev_); } python_2_5_2_i parent_;

public void Dispose(){parent_._top.set_implicit_line_joining_(prev_);} } public bool parenth_form_content() /*[9] parenth_form_content using Line_join_sem_: S expression_list?;*/ { using(var _sem= new Line_join_sem_(this)){ return And(()=> S() && Option(()=> expression_list() ) ); }

Parsing Expression Grammars in Perspective


Parsing Expression Grammars narrow the semantic gap between formal grammar and implementation of the grammar in a functional or imperative programming language. PEGs are therefore particularly well suited for manually written parsers as well as for attempts to integrate a grammar very closely into a programming language. As stated in [1], the elements which form the PEG framework are not new, but are well known and commonly used techniques when implementing parsers manually. What makes the PEG framework unique is the selection and combination of the basic elements, namely PEG Feature Scannerless parsing Advantage Only one level of abstraction. No scanner means no scanner worries. There is only one interpretation for a grammar. The effect is, that PEG grammars are "executable". Disadvantage Grammar sligthly cluttered up. Recognition of overlapping tokens might be inefficient (e.g. identifier token overlaps with keywords -> ident: !keyword [A-Z]+; ) ---

Lack of ambiguity

Error handling by using FATAL Total user control over error diagnostics Bloats the grammar. alternative Potential memory hogg. Interferes Backtracking adds to the powerfulness Excessive with semantic actions. Solution: of PEG. If the input string is in memory, Backtracking Issue a fatal error in case backtrackting just means resetting the possible backtracking cannot succeed input position. anymore. Greedy repetition conforms to the "maximum munch rule" used in Some patterns are more difficult to Greedy repetition scanning and therefore allows recognize. scannerless parsing. Potential error source for the The author of the grammar determines Ordered Choice unexperienced. R: '<' / '<=' ; will the selection strategy. never find <= Lookahead Supports arbitrary lookaheads. Good Excessive use of & and ! makes the

PEG Feature Advantage Disadvantage operators & and ! cost/gain ratio if backtracking is anyway parser slow. supported. Lookahead e.g. allows better reuse of grammar rules and supports more expressive error diagnostics. A PEG grammar can incur a serious performance penalty, when backtracking occurs frequently. This is the reason that some PEG tools (so called packrat parsers) memoize already read input and the associated rules. It can be proven, that appropriate memoization guarantees linear parse time even when backtracking and unlimited lookahead occurs. Memoization (saving information about already taken paths) on the other hand has its own overhead and impairs performance in the average case. The far better approach to limit backtracking is rewriting the grammar in a way, which reduces backtracking. How to do this will be shown in the next chapter. The ideas underlying PEG grammars are not entirely new and many of them are regularly used to manually construct parsers. Only in its support and encouragement for backtracking and unlimited lookahead deviates PEG from most earlier parsing techniques. The simplest implementation for unlimited lookahead and backtracking requires that the input file must be read into internal memory before parsing starts. This is not a problem nowadays, but was not acceptable earlier when memory was a scarce resource.

Parsing Expression Grammars Tuning


A set of grammar rules can recognize a given language. But the same language can be described by many different grammars even within the same formalism (e.g. PEG grammars). Grammar modifications can be used to meet the following goals: Goal More informative tree nodes [1] Before modification Collapse
[1]^^string: '"' ('\\' ["'bfnrt\\])/!'"' .)* '"';

After modification Collapse


[1]^^string: '"'(escs/chars)* '"'; [3] ^^escs: ('\\' ["'bfnrt\\])*; [4] ^^chars: (!["\\] .)*;

Collapse Better placed error indication [2]


'/*' (!'*/' . )* ('*/' / FATAL<"comment not closed before end"> );

Collapse
'/*' ( (!'*/' .)* '*/' / FATAL<"comment not closed before end"> );

Collapse Faster Grammar (reduce calling depth) [3]


[10]string: '"' char* '"' ; [11]char: escape / !(["\\]/control)unicode /!'"' FATAL<"illegal character">;

Collapse
[10]string: '"' ( '\\'["\\/bfnrt] / [#0x20-#0x21#0x23#0x5B#0x5D-#0xFFFF] / !'"' FATAL<"illegal

[12]escape: [17]control [18]unicode:

'\\' ["\\/bfnrt]; character"> [#x0-#x1F]; [#x0-#xFFFFF]; )* '"';

Faster Grammar, Less Backtracking (left factoring)

Collapse
[1] if_stat: 'if' S '('expr')' stat / 'if' S '('expr')' stat 'else' S stat;

Collapse
[1] if_stat: 'if' S '(' expr ')' stat ('else' S stat)?;

Remarks: [1] More informative tree nodes can be obtained by syntactical grouping of grammar elements so that postprocessing is easier. In the above example, access to the content of the string is improved by grouping consecutive non-escape characters into one syntactical unit. [2] The source reference place which is given by an error message is important. In the example of a c comment which is not closed until the end of the input, the error message should be given where the comment opens. [3] Reducing calling depth means inlinig of function calls, since each rule corresponds to one function call in our PEG implementation. Such a transformation should only be carried out for hot spots, otherwise the expressiveness of the grammar gets lost. Furthermore, some aggressive inlining compilers may do this inlining for you. Reducing calling depth may be questionable, but left factorization is certainly not. It not only improves performance but also eliminiates potential disruptive backtracking. When embedding semantic actions into a PEG parser, backtracking should in many cases not occur anymore, because undoing semantic actions may be tedious.

Comparison of PEG Parsing with other Parsing Techniques


Most parsing strategies currently in use are based on the notion of a context free grammar. (The following explanations follow -for the next 50 lines- closely the material used in the Wikipedia on Context free grammars [3]) A context free grammar consists of a set of rules similar to the set of rules for a PEG parser. But context free grammars are quite differently interpreted than PEG grammars. The main difference is the fact, that context free grammars are nondeterministic, meaning that 1. Alternatives in context free grammars can be chosen arbitrarily 2. Nonterminals can be substituted in an arbitrary order (Substitution means replacing a Nonterminal on the right hand side of a rule by the definition of the Nonterminal). By starting with the start rule and choosing alternatives and substituting nonterminals in all possible orders we can generate all the strings which are described by the grammar (also called the language described by the grammar).

With the context free grammar Collapse


S : 'a' S 'b' | 'ab';

e.g. we can generate the following language strings

Collapse
ab, aabb, aaabbb,aaaabbbb,...

With PEG we cannot generate a language, we can only recognize an input string. The same grammar interpreted as PEG grammar Collapse
S: 'a' S 'b' / 'ab';

would recognize any of the following input strings Collapse


ab aabb aaabbb aaaabbbb

It turns out, that the nondeterministic nature of context free grammars, while being indispensable for generating a language, can be a problem when recognizing an input string. If an input string can be parsed in two different ways we have the problem of ambiguity, which must be avoided by parsers. A further consequence of nondeterminism is that a context free input string recognizer (a parser) must choose a strategy how to substitute nonterminals on the right hand side of a rule. To recognize the input string Collapse
1+1+a

with the context free rule Collapse


S: S '+' S | '1' | 'a';

we either can e.g use the following substitutions: Collapse


S '+' S -> (S '+' S) (('1') '+' (('1') '+' (('1') '+' '+' S -> S) '+' S -> ('1')) '+' S -> ('1')) '+' ('a')

This is called a leftmost derivation. Or we can use the substitutions: Collapse


S '+' S -> S '+' (S '+' S) -> S '+' (S '+' ('a')) -> S '+' (('1') '+' ('a')) -> ('1') '+' (('1') '+' ('a'));

This is called a rightmost derivation. A leftmost derivation parsing strategy is called LL, wheras a rightmost derivation parsing strategy is called LR (the first L in LL and LR stands for "parse the input string from Left", but who will try it from right?). Most parsers in use are either LL or

LR parsers. Furthermore, grammars used for LL parsers and LR parsers must obey different rules. A grammar for an LL parser must never use left recursive rules, whereas a grammar for an LR parser prefers immediate left recursive rules over right recursive ones. The C# grammar e.g. is written for an LR parser. The rule for a list of local variables is therefore: Collapse
local-variable-declarators: local-variable-declarator | local-variable-declarators ',' local-variable-declarator;

If we want use this grammar with an LL parser, then we must rewrite this rule to: Collapse
local-variable-declarators: local-variable-declarator (',' local-variable-declarator)*;

Coming back to original context free rule Collapse


S: S '+' S | '1' | 'a';

and interpreting it as a PEG rule Collapse


S: S '+' S / '1' / 'a';

we do not have to do substitutions or/and choose a parsing strategy to recognize the input string Collapse
1+1+a

We only must follow the execution rules for a PEG which translates to the following steps: 1. 2. 3. 4. Set the input position to the start of the input string Choose the first alternative of the start rule (here: S '+' S) Match the input against the first component of the sequence S '+' S Since the first component is the nonterminal S, call this nonterminal.

This obviously results in infinite recursion. The rule S: S '+' S / '1' / 'a'; is therefore not a valid PEG rule. But almost any context free rule can be transformed into a PEG rule. The context free rule S: S '+' S | '1' | 'a'; translates to the valid PEG rule Collapse
S: ('1'/'a')('+' S)*;

One of the following chapters shows how to translate a context free rule into a PEG rule. The following table compares the prevailing parser types.

Parser Type Context Free Context Free Context Free Context Free Context Free Context Free Context Free PEGParser

Sub Type LR-Parser SLR-Parser LALR(1)Parser LL-Parser

Scanner Lookahead Generality Implementation yes yes yes yes 1 1 1 k medium high low high table driven table driven table driven code or table driven code or table driven code or table driven code or table driven

Examples

handcomputed table YACC,Bison

LL(1)-Parser yes LL(k)-Parser yes LL(*)-Parser yes PEG-Parser no

Predictive parsing ANTLR,Coco-R boost::spirit Rats,Packrat,Pappy

unlimited high+

unlimited very high code preferred

The reason, that the above table qualifies the generality and powerfulness of PEG as very high is due to the PEG operators & (peek) and ! (not). It is not difficult to implement these operations, but heavy use of them can impair the parser performance and earlier generations of parser writers carefully avoided such features because of the implied costs. When it comes to runtime performance, the differences between the above parser strategies are not so clear. LALR(1) Parser can be very fast. The same is true for LL(1) parsers (predictive parsers). When using LL(*) and PEG-Parsers, runtime performance depends on the amount of lookahead actually used by the grammar. Special versions of PEG-Parsers (Packrat parsers) can guarantee linear runtime behaviour (meaning that doubling the length of the input string just doubles the parsing time). An important difference between LR-Parsers and LL- or PEG-Parsers is the fact that LR-Parser are always table driven. A manually written Parser is therefore in most cases either an LL-Parser or a PEG-Parser. Table driven parsing puts parsing into a black box which only allows limited user interaction. This is not a problem for a one time, clearly defined parsing task, but is not ideal if one frequently corrects/improves and extends the grammar because changing the grammar means in case of a table driven parser a complete table and code regeneration.

Translating LR grammars to PEG Grammars


Most specifications for popular programming languages come with a grammar suited for an LR parser. LL and PEG parsers can not directly use such grammars because of left recursive rules. Left recursive rules are forbidden in LL and PEG parsers because they result in idefinite recursion. Another problem with LR grammars is that they often use alternatives with the same

beginning. This is legal in PEG but results in unwanted backtracking. The following table shows the necessary grammar transformations when going from an LR grammar to a PEG grammar. Transformation Category LR rule Collapse Immediate Left Recursion => Factor out non recursive alternatives
// s1, s2 are terms which are not //left recursive // and not empty A: A t1 | A t2 | s1 | s2;

~PEG rule (result of transformation)

Collapse
A: (s1 | s2) (t1 | t2)*

Collapse Indirect Left Recursion => Transfrom to Immediate Left Recursion=> Factor out non recursive alternatives Alternatives with same beginning => Merge alternatives using Left Factorization Collapse
A: B t1 | s1 ; B: A t2 | s3 ; // we substitute B by its right hand side A: (A t2 | s3) t1 | s1; //Eliminate immediate left recursion A: (s3 t1 | s1) (t2 t1)* ...;

Collapse
A: s1 t1 | s1 t2;

Collapse
A: s1 (t1 | t2);

The following sample shows the transformation of part of the "C" grammar from the LR grammar as presented in Kernighan and Ritchies book on "C" to a PEG grammar (the symbol S is used to denote scanning of white space). LR grammar: "C" snippet declarator stuff... PEG grammar: "C" declarator stuff... Collapse Collapse
declarator: pointer? direct_declarator; declarator: pointer? direct_declarator;

Collapse

Collapse
direct_declarator: identifier | '(' declarator ')' | direct_declarator '[' constant_expression? ']' | direct_declarator '(' parameter_type_list ')' | direct_declarator '(' identifier_list? ')';

Collapse
direct_declarator: (identifier / '(' S declarator ')' S) ( '[' S constant_expression? ']' S / '(' S parameter_type_list ')' S / '(' S identifier_list? ')' S )*;

Collapse
pointer: '*' type_qualifier_list? | '*' type_qualifier_list? pointer;;

Collapse
pointer: '*' S type_qualifier_list? pointer?

LR grammar: "C" snippet declarator stuff... PEG grammar: "C" declarator stuff... Collapse Collapse
parameter_type_list: parameter_list | parameter_list ',' '...'; parameter_type_list: parameter_list (',' S '...')?;

Collapse

Collapse
type_qualifier_list: type_qualifier | type_qualifier_list type_qualifier;

Collapse
type_qualifier_list: type_qualifier+;

Collapse

Collapse

Collapse Collapse

parameter_declaration: parameter_declaration: declaration_specifiers declarator declaration_specifiers | declaration_specifiers (declarator / abstract_declrator?; abstract_declarator?);

Collapse
identifier_list: identifier | identifier_list ',' identifier;

Collapse
identifier_list: identifier S (',' S identifier)*;

Collapse

Future Developments
The planned series of articles consists of the following parts Subject Parser and Parser Generator PEGDebugger/Interpreter Sample Applications Planned Release date october 2008 january 2009 june 2009 Description C# classes to support work with Parsing expression grammars Direct interpretation and debugging of PEG grammars without need to generate C# module More grammar samples with postprocessing

History

2008 October: initial version 2008 October:..minor update o improved semantic block support (using clause,IDispose Interface) o added new sample parser Python 2.5.2

References
[1] Parsing Expression Grammars,Bryan Ford, MIT, January 2004 [^] [2] Parsing expression grammar, Wikipedia.en [^] [3] Context-free grammar, Wikipedia.en [^] [4] ITU-T X.690: Specification of BER, CER and DER, INTERNATIONAL

TELECOMMUNICATION UNION [^] [5] RFC 4627:The application/json Media Type for JavaScript Object Notation (JSON) [^] [6] Introducing JSON [^]

License
This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Martin.Holzherr

Switzerland Member Article Top Rate this article for us! Poor Add a reason or comment to your vote: x Excellent
Vote

Comments and Discussions


FAQ
1527112 ?floc=%2fKB%2f /KB/recipes/gram 1527112

Search

Noise Tolerance

Medium 25

Layout
Update

Per page

FirstPrevNext New Message Msgs 1 to 25 of 44 (Total in Forum: 44) (Refresh) Great work! Pedro J. Molina 17:15 8 Feb '11 Congratulations Martin! I enjoyed a lot reading your educational article. Amazing work! Are you continuing with the roadmap on PEGs and creating the debugger and the interpreter? I am looking forward to see more... Pedro J. Molina, PhD http://pjmolina.com/metalevel ReplyEmailView ThreadLinkBookmark Rate this message: 1 2 3 4 5 Boost.Spirit OvermindDL1 7:53 18 Aug '09 Actually Boost.Spirit (now known as Classic Spirit) is a PEG parser (regardless of what the docs say) and it does have the & and ! operators in exactly the format you specified. They did not call it a PEG parser because the creator of the library did not know that PEG parsers existed. Boost.Spirit2.1 is worlds faster then Classic Spirit, is even more capable and powerful, and the docs actually call it what it is now, a PEG parser (which it has been all along). I have to say that your code looks a *lot* like Boost.Spirit2.1, just a lot longer due to the lack of overloading the operators, although Boost.Spirit2.1 handles semantic actions a lot better and more powerful. You should probably look at Boost.Spirit2.1 and you could take some ideas from it (if C# is capable of such power). Since C# templates do not have the full power of C++ templates it may be more verbose, but you should still be able to emulate much of the power, which would be useful for the C# programmers since they lack the power of the Boost versions otherwise. Rate this message: 1 2 3 4 ReplyEmailView ThreadLinkBookmark 5 Great Stuff ! LarsAC 7:14 27 Jul '09 Dear Martin, this is really a great article and even comes along with excellent code. The code gave me a good jump start into PEGs and has permitted me to successfully tackle the context-dependent grammar of the legacy software I'm working on.

Many thanks, Lars ReplyEmailView ThreadLinkBookmark Rate this message: 12345

BITS and CodePoint Terminal + byte order Leblanc Meneses 21:42 4 Jul '09 I will be adding BITS terminal into my own grammar and wanted to see what you think of my proposed syntax. (since you have probably already used BITS in a real project) This is my planned implementation Syntax =========== will consume 1 byte (msb to lsb) [code] BYTE(1){ {?<bit8> #b0 }, // 8 bit captured if 0 { #b1}, // not captured but bit 7 requires it to be 1 for this Byte terminal to return true (parsed) {?<bit6_4> #b0X1 }, // bit 6 = 0, 5 = don't care, 4 = 1 {?<bit3> #bX } // don't care but capture // bit 2 and 1 will not be captured and is don't care by default (since not compared) } [/code]

data not in byte boundaries will consume 2 bytes (msb to lsb) // big endian byte order so input[i](MSB) input[i+1](LSB) [code] BYTE(2)[bigendian]{ {?<high>#b1111}, // bit 16,15,14,13 require they be set to 1 {?<BoundaryTest> #b1XX11XX1}, // bit 12, 9,8,5 require they be set to 1, bit 11, 10, 7,6, are don't care .. all 8 bits are captured into Byteboundary astnode {?<low> #bXXXX } // bit 4,3,2, 1 are don't care and captured into ast node named low } [/code]

after implementing bits syntax i have been considering adding support of don't cares to CodePoint terminal. example: #xFX, #b1111XXXX for hex and binary versions what do you think about that idea?

lastly how do your algorithms consider byte order? I've currently added a ByteOrder enumeration to input iterator interface. which i plan to use in BITS/CodePoint where multiple bytes need to be grouped for comparison. how are you taking into account byteorder? One problem i have ran into is C# uses 16 bit chars (didn't know this i always thought char was 8 bits) I'm thinking of making my concrete InputIterator only return 8 bit chars this will make handling utf-8 much easier IMHO. lm ReplyEmailView ThreadLinkBookmark Rate this message: 12345

Re: BITS and CodePoint Terminal + byte order Martin.Holzherr 5:12 6 Jul '09 Your proposed Syntax is very detailed. I think your best invention is the don't care specification. Personally I would try to use less constructs and try to optimize the parser so that it can handle combinations of simple syntactic BITS-constructs in an efficient way. Your remark considering ByteOrder and the use of char as input element type can be answered the following way: When using contstructs like BITS (in your case BYTE) one should treat the input stream as stream of bytes not as stream of characters. My sample grammar for BER-decoding has the following header: <<Grammar Name="BER" encoding_class="binary">> where binary means read it into an array of bytes. Binary Formats like e.g. Basic Encoding Rules normally also define the byte order. The correct input element datatype for C# binary streams is byte not single character chars. I will now try to compare your constructs for reading byte-level data against may construct as described in the article: (sorry the table appears a page down) Remark: The . in my BITS-example mean an unchecked byte Your proposal Description Peek at next byte and expect {?<bit8> #b0} bit 8 to be 0 and save this bit in variable bit8 #b1 Bit 7 must be 1 Peek and test: {?<bit6_4> #b0X1 Bit 6 must be 0 }, and bit 4 must be 1 My construct

&BITS<8,#b0,:bit8>

BITS<7,#b1> &BITS<4,#b1>&BITS<6,#b0>

Bit 16,15,14,13 {?<high>#b1111}, must be 1 (not a >&amp;BITS&lt;5-8,#b1111 good idea look

only at one byte) Save Byte in BoundaryTest &BITS<4,1>&BITS<1,1>BITS<1{?<BoundaryTest> variable and test 8,.,:firstByte>&BITS<8,1>&BITS<5,1>BITS<1#b1XX11XX1}, that 8,.,:secondByte> bits 12,9,8,5 are set to 1 Rate this message: 1 2 3 4 ReplyEmailView ThreadLinkBookmark 5 Re: BITS and CodePoint Terminal + byte order Leblanc Meneses 0:05 8 Jul '09 i like the use of & predicate, it is consistent with rules. > binary streams is byte not single character chars. agree i will need to modify the input iterator > ByteOrder if a rule requires multiple bytes in a comparison as i show in boundarytest. depending on architecture the bytes might be in different order. Take In32 => 4 bytes and you had access to raw (char* ptr) By writing the grammar file that explicitly specifies byte order [bigendian] [littleendian] (scoped at the top of the file) the parser can internally compare the architecture of the machine with grammar endieness and decide on how to compare the data correctly. example rule: #F1F2F3F4 and do the comparison taking into account byte order. the grammar specified byte order can be specified at the top of the grammar file what are your thoughts on this subject?

> :bit8,:firstByte,:secondByte I assume that BITS<1-8,.,:firstByte> is done because individual ast terminals cannot be created per bit?? in other words, for the same byte you can't create an ast node per bit... example: ForByte(1) - when this terminal complete's move iterator one byte BITS<1,:IsItem1Enabled> BITS<2,:IsItem2Enabled> BITS<3,:IsItem3Enabled> BITS<4,:IsItem4Enabled> reserved

-- iterator is moved to next byte ForByte(1) - when this terminal complete's move iterator one byte BITS<1,:IsItem35Enabled> BITS<2,:IsItem36Enabled> BITS<3,:IsItem37Enabled>

So you place the responsibility on the astnode containing the byte to reparse individual bits for data. (Taking this approach would make algorithm much simpler than i had originally proposed) Rate this message: 1 2 3 ReplyEmailView ThreadLinkBookmark 45 Re: BITS and CodePoint Terminal + byte order Martin.Holzherr 4:09 8 Jul '09 Personally I do not see the necessity for a grammar with byte order pragmas. I often process binary files with custom formats for telecommunication applications with strange encodings coming from byte order problems, e.g. there is a data type "technical BCD" which is BCD where high nibble and low nibble are swapped. But a byte order pragma would not help me much with such constructs, because once a thing like "technical BCD" is defined it never changes and I do not get files which are encoded differently. Your case, that you apply the same grammar to input files which come from different machine architectures is in most cases covered by a preprocessing step where you transform the whole file to the expected byte order before parsing it. The grammar elements BITS<....> indeed always reads a whole byte. To read different things within this byte you have to read it more than once. This can be done with the lookahead operatoro &. &BITS<1-3,#b010>&BITS<5,#b1> indeed reads the next byte two times (conceptually). In reality the parser can optimize it and read it only once and extract both informations but this is only an optimization. Furhtermore I do not generate AstNodes or tree nodes if not explicitly specified. The construct BITS<4,:IsItem4Enabled> does not generate an ast node but just writes the value of bit 4 into the host variable (C# variable) IsItem4Enabled. Ast-node and tree-node generation should be kept to the minimum, since the needed space is huge (especially if you allocate a node fro a single bit) Rate this message: 1 2 ReplyEmailView ThreadLinkBookmark 345 string with error notification Mizan Rahman 4:46 13 May '09 Hi Martin, As I mentioned before I am wokring on EBNF based parser. Here I am in dilemma with a double-quoted string recognition. I would also need to report error if the terminating double-quote is missing. In this article you use FATAL or WARNING construct to notify user about some kind of problem. In my solution I am simply allowing

user to report an integer value through NOTIFY construct. It is upto the host appliction to take action based on the notification code. So back to my problem, if the terminating double-quote is missing, I would like report an error. The position should be the position of the starting double-quote. So far I have not found one that can report at correct position. Here are several forms of the rule. Please give me your thoughts on issue. Form1:
'"' ( . - '"' )* '"'

[dot means any charecter]

Fact1: It can span multiple lines. Fact2: If the source does not contain the terminating double-quote, it will contiue to consume inputs untill it reaches end-of-file and eventually giving up (no match). Fact3: No error reporting. Form2:
'"' ( . - ('"' | end-of-line))* '"'

Fact1: It can NOT span multiple lines. Fact2: If the source does not contain the terminating double-quote, it will contiue to consume inputs untill it reaches end-of-line and eventually giving up (no match). Fact3: No error reporting. Form3:
'"' ( . - ('"' | end-of-line))* ('"' | NOTIFY(1) )

Fact1: It can NOT span multiple lines. Fact2: If the source does not contain the terminating double-quote, it will contiue to consume inputs untill it reaches

end-of-line and eventually accepting the notification (match) Fact3: Reports error. Fact4: Notitification position will be right before the end-of-line. Form4: '"' ( ( ( . - ('"' | end-of-line))* '"' ) | NOTIFY(1) ) Fact1: It can NOT span multiple lines. Fact2: If the source does not contain the terminating double-quote, it will contiue to consume inputs untill it reaches end-of-line and eventually accepting the notification (match) Fact3: Reports error. Fact4: Notitification position will be right after the starting double-quote

Regards Mizan ReplyView ThreadLinkBookmark Rate this message: 1 2 3 4 5 Re: string with error notification Martin.Holzherr 7:11 13 May '09 The following PEG-Grammar gives 6 different grammar rules for strings (multiline and singleline with and without diagnostic). In case of an error and existing diagnostic, a Fatal error is issued. The last rule ([6]) even gives an error message at the position before the string starts. It uses the lookahead operator &
<<Grammar Name="VSPR" encoding_class="ascii" comment="various string parsing rules">> [1] multiline_string_silent: '"' (!'"' .)* '"'; [2] singleline_string_silent: '"' (!["\n] .)* '"'; [3] singleline_string_diagnostic: '"' ( (!["\n] .)* '"' / FATAL<"missing string termination"> );//error message at the start of the string [4] singleline_string_diagnostic1: '"' ( ( '\n' FATAL<"end of line in string"> //error message at the place of the line break / !'"' .

)* '"' /FATAL<"missing string termination"> );//error message just after opening " [5] mutliline_string_diagnostic: '"' ((!'"' .)* '"'/FATAL<"missing string termination"> );//error message at the start of the string [6] multiline_string_diagnostic1: (&'"' multiline_string_silent / FATAL<"missing string termination"> ); // error message before start of the string <</Grammar>>

Hopes, this gives you some inspirations about the possibilities ReplyView ThreadLinkBookmark Rate this message: 1 2 3 4 5 Re: string with error notification Martin.Holzherr 7:17 13 May '09 The last rule in my previous answer post contained a wrong parenthizing. It is here corrected:
<<Grammar Name="VSPR" encoding_class="ascii" comment="various string parsing rules">> [6] multiline_string_diagnostic1: &'"' (multiline_string_silent / FATAL<"missing string termination"> ); // error message before start of the string <</Grammar>>

This rule shows how to issue an error message before the first character of a recognized language element. The only possible way to issue an error message before the first character is to use a lookahead operator & as done above. ReplyView ThreadLinkBookmark Rate this message: 1 2 3 4 5 Re: string with error notification Mizan Rahman 10:14 13 May '09 Thanks for the reply. I think I will use the look-ahead operator. ReplyView ThreadLinkBookmark Rate this message: 1 2 3 4 5 AnyArrangement nonterminal [modified] Leblanc Meneses 1:24 28 Apr '09 Today I wanted to allow a user to send options in any order. - i - ignore case - r - recursive - n - number since \i\r\n are optional and can be placed in any order The total possible of combinations \i \i\r \i\r\n \i\n \i\n\r \r \r\i \r\i\n

\r\n \r\n\i \n \n\i \n\i\r \n\r \n\r\i

which could become tedious if solving it using PrioritizedChoice '/' . '\i'?'\r'?'\n'? / '\i'?'\n'?'\r'? / '\r'?'\i'?'\n'? / '\r'?'\n'?'\i'? / '\n'?'\i'?'\r'? / '\n'?'\r'?'\i'?

Even with 3 variables.... what if i had more vars to support?

I'm thinking of adding TryAnyArrangement function into the grammar; TryAnyArrangement( '\i'? / '\r'? / '\n'? ) TryAnyArrangement would not match \i\i\i\i\r\n it would only match a single instance of each expression if possible.

before i litter the grammar with this optional non terminal i wanted to see if you have another solution to this problem? modified on Tuesday, April 28, 2009 1:34 AM ReplyEmailView ThreadLinkBookmark Rate this message: 1 2 3 4 5 Re: AnyArrangement nonterminal Martin.Holzherr 3:11 28 Apr '09 In my opinion, the solution to a grammatical problem is only seldom to introduce a new construct. On the contrary, one should keep grammatical constructs to a minimum as has been done e.g. by Bryan Ford who invented the Parsing Expression Grammar.

If a new problem occurs, the best approach is to search for similar problems solved by other people. The problem of the arbitrary order of elements in a sequence e.g. is well known from the grammar of the programming language "C" (and as consequence of the programming languages "Java" and "C#" which borrowed the grammar from "C"). In "C" you can write
static const int* f(); const static int* f(); int const static* f();

Each of these constructs is equally valid. The solution taken by Kernighan and Ritchie is not an exhaustive number of grammatical variants. Instead of enumerating all legal possibilities they just allow static const and int at the beginning of a declaration in any order and in any multiplicity. This means that their grammar also would allow the following illegal constructs
static const int int* f(); static const static int* f();

After successful parsing the compiler must check for such illegal cases. In my opinion a solution that makes sense since it is an illusion one could express all meaningful situations in a context free grammar. A simplified grammar for "C" declarations where only the keywords static const and int would be supported could therefore have the following rules:
declaration: declaration_specifiers declaration_rest; declaration_specifiers: ('const' / 'int' / 'static')*;

(I deliberately let out the productions for parsing spaces in the above grammar). Hope you got the point. Rate this message: ReplyEmailView ThreadLinkBookmark

Re: AnyArrangement nonterminal Leblanc Meneses Are these hints your dropping ?.. I'm wondering if your planning on showing a c parser on your upcomming article.

1234 5 10:35 28 Apr '09

Since this AnyUnordered event happens so often wouldn't it make since to put it into the grammar? This will find compiler errors now vs some analysis stage. I might implement a strategy like Tasks in msbuild to let users define functions like these. This way it could be included or included based on tasks referenced. -lm ReplyEmailView ThreadLinkBookmark Rate this message: 1 2 3 4 5 Re: AnyArrangement nonterminal Martin.Holzherr 11:15 28 Apr '09 A "C" parser is already part of the current article project. It can be accessed via the Grammar selection "C Kernighan&Ritchie 2" from the Parsing Expression Grammar Explorer. It expects a preprocessed input file.

In my opinion grammar support by a given parser framework is always a compromise between the strive for minimality and comfort functions. The set of constructs should be minimal since the user who works with these constructs must master them and must understand the underlying logic. On the other hand common operations with enough generality are certainly meaningful. One extreme is represented by the EBNF-grammar rules used to describe Pascal (a programming language learned in schools in the 70's and 80's ) where only the constructs sequence, alternative, open number of repetitions and option were supported. The PEG parser framework presented in this article supports far more constructs: sequence,alternative, option, lookahead (peek) and not operator, zero or more repetitions, one or more repetitions, repetitions with a lower and upper repetition bound, generic rules (taking expressions as parameters), character sets like used in regular expressions, the dot (.) for matching any character, tree and ast construction, support for error handling in the form of fatal errors and expected elements. In my opinion these are enough constructs. If you arbitrarily add more and more constructs you end up with something remniscent of PERL regular expressions which are so rich, that only few programmers know all the possibilities. In my opinion more constructs not only make it more difficult for the user but they also introduce more occasions for errors. Rate this message: 1 2 ReplyEmailView ThreadLinkBookmark 345 grammer help Mizan Rahman 18:18 8 Apr '09 Hi Martin, I need a E/BNF grammer for prefix, infix and postfix operator. Something like this:
infix := unary (( '+' | '*' | '/' | '-' | ) unary) * prefix := ( '+' | '-' | '~' | '^' ) prefix | expr postfix := postfix ( '+' | '-' | '~' | '^' ) | expr unary := prefex | postfix expr := binary | integer

I am not sure if the above is correct? Could you please help me on this? Grammer should NOT consider operator precedence at all - just a generic one. Thank you in advance. Mizan ReplyEmailView ThreadLinkBookmark Re: grammer help Hi Mizan, Martin.Holzherr Rate this message: 3:42 9 Apr '09 12345

there are many possible grammars for generic expressions. Different parsers require different grammars. Your grammar is left recursive (assuming that infix in your grammar is the same as binary) and therefore perhaps suited for an LR or LALR parser (e.g. YACC). But anyway, in my opinion your grammar lacks the hierarchical structure which most grammars for programming languages expose. The following grammar is a PEG grammar (white space is part of the grammar) and avoids therefore any left recursive rule. I derived it from the grammar for the "C" programming language by leaving out productions which consider operator precedence.Here is the grammar:
<<Grammar Name= "generic_expression">> [1] expr: S unary (bin_op S unary)*; [2] unary: unary_op S unary / postfix ; [3] postfix: primary postfix_operator*; [4] postfix_operator: '++' S / '--' S / indexing / call_args; [5] indexing: '[' S expr ']' S; [6] call_args: '(' S expr ')' S; [7] primary: (integer / '(' S expr ')' / ident) S; [8] S: [ \t\r\n]*; [9] integer: [0-9]+; [10]ident: [a-zA-Z_][a-zA-Z_0-9]*; [11]unary_op: '-'/'+'/'~'/'!'; [12]bin_op: '-'/'+'/'*'/'-'; <</Grammar>>

A sample input for this grammar would be:


(a[4]+13)*fibonacci(5)/gcd(1084,384)

The grammar should be also suited for an LR or LALR parser. You just would have to remove the production S, replace / by | and perhaps replace : by ::= Hopes this helps. Best Regards Martin 5.00/5 (1 vote) Rate this message: ReplyEmailView ThreadLinkBookmark Very interessting article! Hello Martin, JohnDoeJohnDoe

12345

8:35 19 Jan '09

I came across your article while searching the web for usefull PEG implementations. Your concept seems to be the best I've seen so far. You got my five! I consider to use it as part of an experimental implementation, hence I would like to know if there will be an update to your PEG Lib soon, so that I must not exchange the PEG Lib afterwards. Thanks a lot! J. Wall ReplyEmailView ThreadLinkBookmark Rate this message: 1 2 3 4 5 Re: Very interessting article! Martin.Holzherr 9:17 19 Jan '09 No, there will be no update to the PEG Lib in the next weeks but there will be follow projects which use the PEG Lib as base.

ReplyEmailView ThreadLinkBookmark Download corrupt? Wolfgang Meszar Hi, the download is corrupt, please post it again.

Rate this message: 1 2 3 4 5 16:00 14 Jan '09

Thanks, W. Meszar ReplyEmailView ThreadLinkBookmark Rate this message: 1 2 3 4 5 Re: Download corrupt? Martin.Holzherr 9:28 19 Jan '09 I just downloaded it without any problem. Please try again. ReplyEmailView ThreadLinkBookmark Rate this message: 1 2 3 4 5 XML Grammar ? Leblanc Meneses 10:03 12 Nov '08 Hello Holzherr, Since i don't have anyone else knowledgable to ask this question to I'll post I ask it in this forum. So how would someone go about parsing xml using peg? http://books.google.com/books?id=V4o9Tcf2YRoC&pg=PA366&lpg=PA366&dq=ebnf+gram mar+xml+parser&source=web&ots=aG0Y5r79fB&sig=bvYYsTUCr9yE0ncXtS9vMHFPNs&hl=en&sa=X&oi=book_result&resnum=8&ct=result#PPA368,M1[^]

Seems like there needs to be a back referencing capabilities like regex //1, //2, //3 maybe i'm just missing the clever arrangements of predicates to parse xml properly without relying on a semantic analyzer Rate this message: 1 2 3 4 5 ReplyEmailView ThreadLinkBookmark Re: XML Grammar ? Mizan Rahman 15:19 2 Feb '09 I have a question regarding a noation in the EBNF for XML located at http://www.w3.org/TR/REC-xml/[^]. If you go to the location and search for "A - B" (without the double quote), you will see:
A - B matches any string that matches A but does not match B.

My question is that if this notation is same as "Not Predicate" in PEG? It would be helpfull I could get some examples as what exactly this matches. Because I have been study EBNF for sometime now and haven't seen this notaion anywhere else (unless of course I really missed it). Regards, Mizan

ReplyEmailView ThreadLinkBookmark Re: XML Grammar ? Leblanc Meneses '-' in grammar means a class of characters. probably a typo.

Rate this message: 22:08 2 Feb '09

1234 5

what i assume you mean is match any string that does not contain 'B'. if this is what you mean then (!('B'\i / S) .)+ ! looks at the current character but doesn't consume.. '.' consumes 'any character' and this continues forever until a 'B' 'b' or 'space' is found. !e . is a sequence so both must be true before character is consumed. just so you know '.' will always return true as long as your not at the end of the character iterator.

download npeg and create a new test just to prove this to yourself. by the way my npeg project supports parsing using dynamic back references... although i don't support it in grammar yet (not sure where i left off). So today you can parse xml using npeg look at the unit tests for an example of this. Also i'm almost done with a c parser library which npeg will render using a visitor. (so i can parse a protocol in embedded system) I currently not sure how to process character class '[...]', '[^...]' which your c# project supports. you mind describing how you solved this so i don't reinvent the wheel? Thanks, Leblanc Meneses ReplyEmailView ThreadLinkBookmark Rate this message: 123 45

Re: XML Grammar ? Martin.Holzherr 3:50 3 Feb '09 The Notation A - B, where A and B are nonterminals or expressions, is used in several places on w3c.org. A - B matches the same strings as are matched by A alone but if the string is matched by both A and B it is not matched by A - B. In the PEG notation this can be expressed by the notation: !B A. The notation A - B is e.g. also supported by the spirit parser framework (C++ boost library). An advantage of the PEG notation compared with other EBNF extensions is the minimality of the underlying constructs. Why invent a notation like A - B if the notation !B A has the same effect. Furhtermore !A has many other uses. You will find a lot of examples for the construct A - B in the w3c.org grammars but also in other grammars like Java and C#.

A sample in the xml grammar is:


[17] PITarget ('L' | 'l')) ::= Name - (('X' | 'x') ('M' | 'm')

This is equivalent to the PEG notation:


[17PEG] PITarget: !([Xx][Mm][Ll]) Name;

In the C#-Grammar you will find the rule:


single-regular-string-literal-character: Any character except " (U+0022), \ (U+005C), and new-line-character

This is equivalent to the PEG rule:


single_regular_string_literal_character: !["\\\r\n] ;

ReplyEmailView ThreadLinkBookmark Last Visit: 8:10 2 Mar '11 General News Last Update: 5:48 3 Mar '11 Question Answer Joke

Rate this message:


/KB/recipes/gram

12345 12 Next

Rant

Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+PgUp/PgDown to switch pages.

You might also like