This is yet another parser, but with some unique features.
- The core algorithm is about 200 lines of code (excluding extensions and type signatures). Which makes it suitable for learning about parsing and easy to extend (which I did). It is directly taken from the Parsing with Zippers (Functional Pearl). I recommend to read it - it is quite short and explains algorithm pretty well. Or watch this video.
- It can handle left recursion and ambiguous grammars. Which makes it easier to use for newbies. One doesn't need to bend their grammar in order to fit restrictions of an algorithm.
- Extensions to the core algorithm add ability to express shape of the desired tree without need to massage it after parsing. Which makes it trully portable grammar - where is other parsers may need additional augmentation (A in ABNF) in hosting language.
- Can work with lexer (array of tokens as input) and without (array of chars, or string), so called scannerless parser.
As of now this is rather proof of concept of the idea. Parser works, but it speed is probably not the best.
I was looking for approachable parser. I read couple of papers about parsing and it is quite hard to understand what is happening there.
Then I found Parsing with Derivatives (A Functional Pearl). This algorithm was easy to follow. I managed to write my first parser. I learned a lot about parsers and formal grammars. Unfortunately algorithm is quite slow and doesn't produce tree in the shape that I want. I left it there
Later I found another paper Parsing with Zippers (Functional Pearl), which is the twist to the first paper. It promised to be more performant and I figured I can manipulate tree shape.
So I started my implementation. I wanted to build it from the ground up - to make sure I understand every bit. I implemented zippers, I implemented vizualization. Then I started experimenting with different extensions, like Kleene star, negative lookahed, ordered choice. I got carried away a bit.
After all I created small digital garden about parsing. See:
I had an idea about tree shape manipulation and adding PEG operators (negative lookahead, ordered choice) when I was implementing first version of PwZ (back in 2023). But recently I discovered Instaparse - which has all the same ideas. So I tried to compile it to JS. It works. But due to clojure runtime library weight is more than 200kb.
Fast forward to the present day. I decided to write cleaner version from scratch, omit the bits I'm not sure about and publish as a library.
The parser itslef is in the category of interpreters (oposite to parser generators). It's API is based on parser combinators (traditional for functional language parsers). Though DSL (parser combinators) has some caveats - that is why they are not exposed and instead one needs to write grammar to use parser.
Grammar itself is variation of regular extensions to BNF. Or you can call it Chomsky CFG with regular extensions.
Classical BNF (Backus–Naur form):
<S> ::= <S> "a" | ""
Chomsky CFG (context free grammar):
S -> S "a" | ""
Grammar used by this library:
S = S "a" | ""
(space) - concatenation|- unordered choice"a"- token aka string, terminal, character, lexemeS- symbol aka variable, non-terminal
And regular extensions (extensions comming from regular expressions)
a*- Kleene star. Any number ofaa+- Kleene "plus". One or moreaa?- Kleene "question". Zero or oneaa{...}- quantifiers#"[a-z]"- regular expression
And tree manipulation extension
<a>- the same as in instaparse[a]- collapse all underlying nodes in one node with all strings concatenated
Simplest example. Grammar: S = "a". Input: a
It produces exactly one node with tag S and value a.
Now let's try sequences (fixed length). Input aaa
S = "a" "a" "a" |
S = A A A; A = "a" |
S = A A A; <A> = "a" |
|---|---|---|
Pay attention there are no tags on as |
Sequences unlimited length. Input aaa
S = S "a" | "" |
S = "a" S | "" |
S = "a"* |
|---|---|---|
For example, list of arguments in function call foo(a,a,a). Input: a,a,a
S = A ("," A)*; A = "a" |
S = A (<","> A)*; A = "a" |
S = (A (<","> A)*)?; A = "a" |
|---|---|---|
| Allows zero items in the list |
Pay attention how * and <> allows precisely recreate tree structure without need to explicitly remove some nodes or flatten tree.
Quite primitive implementation - doesn't support escape characters, like \", \\, etc.
string = <"\""> (#"[^\"]")* <"\""> |
string = [<"\""> (#"[^\"]")* <"\"">] |
|---|---|
Full JSON grammar except string (it is not according spec)
<json> = ws value ws
<value> = null | true | false | array | object | number | string
null = <"null">
true = <"true">
false = <"false">
array = <"["> ws (value (ws <","> ws value)*)? ws <"]">
object = <"{"> ws (member (ws <","> ws member)*)? ws <"}">
member = string ws <":"> ws value
number = [integer fraction? exponent?]
onenine = #"[1-9]"
digit = "0" | onenine
integer = "-"? ("0" | onenine digit*)
fraction = "." digit+
exponent = ("E" | "e") ("+" | "-")? digit+
string = <"\""> [#"[^\"]"*] <"\"">
<ws> = <#"\\s"*>
Parse tree for {"i": 1, "s": "a", "a": [], "f": -0.1E+1, "o": {}}
- There is no reporting for syntax errors. It either can parse or can't, it doesn't give a clue why it can't parse. Which is the biggest pain point right now
- I want to investigate idea of macros (or functions), like in Rosie
- I wonder if it is possible to implement other operators, like negative lookahead, ordered choice, backreferences, limited negation.
- What about disambiguation filters?
- powderizer - main package
- demo - demo