Parsing molly server pages

A molly server page is divided into several sections. A code section can contain a hash section which can contain an expression section and so forth.

This suggests a recursive-descent procedure.

If our molly grammar was working with tokens, it may look like:

A  -> code_open_tok  foo  code_close_tok
   -> exp_open_tok   foo  exp_close_tok
where, for example, the code_open_tok is '[[' and exp_open_tok is '[='

The above grammar is already LL(1). Since we are working with code_open or exp_open tokens already, we don't have to transform the grammar - here the fact that exp_open and code_open share a common prefix of '[' is irrelevant because the regular-expression tokenizer will disambiguate between them.

Generally, we can move the analysis of tokens into the grammar itself. For example, to recognize the following sentences:

[[ foo  ]]
[  bar  ]
Our grammar looks like:
A -> [[ foo ]]
B -> [  bar  ]
To make this LL(1), our grammar now looks like:
A -> [ B

B -> [ C ]]
  -> D ]
Since there is no nesting heirachy to these 'tokens' the above can be parsed with a regular grammar. A combined tokenizer/parser approach is simpler for small grammars like molly pages. The tokenizer looks at as many characters as it needs to disambiguate tokens (for example, it reads 2 characters to distinguish between [[ and [!).

Based on the type of identified token, we use recursive-descent and call a method to further parse the construct that starts with that token.

The molly grammar is:

S                    ->  (HTML_Text | Molly)*
Molly                ->  Expression | Code | Declaration | Comment | DirectiveOrHdoc
Expression           ->  ExpOpen         EText             ExpClose
Declaration          ->  DeclarationOpen EText             DeclarationClose
Comment              ->  CommentOpen     EText             CommentClose
Code                 ->  CodeOpen        EText|Hash        CodeClose
Hash                 ->  HashOpen        EText|Expression  HashClose

HTML_Text            ->  Any chars. Any tags that open molly sections must be escaped.
EText                ->  Any chars. The closing tag (if any) must be escaped
                         within EText.

DeclarationOpen       ->  [!
DeclarationClose      ->  !]
CodeOpen              ->  [[
CodeClose             ->  ]]
ExpOpen               ->  [=
ExpClose              ->  ]
HashOpen              ->  #  
HashClose             ->  #
DeclarationOpen       ->  [include  
                          [include-file 
                          [include-decl
                          [forward
                          [import
DeclarationClose      ->  ]
CommentOpen           ->  [/*
CommentClose          ->  */]
DirectiveOrHdoc       ->  [page Directive|Heredoc
Directive             ->  DirectiveOpen DirectiveClose
Heredoc               ->  HeredocOpen ] HeredocClose
DirectiveOpen         ->  .. a=b, a = "c d" .. etc..
DirectiveClose        ->  ]
HeredocOpen           ->  <<<EText ]
HeredocClose          ->  EText (same as starting EText)

Since there are no ε-productions, we don't have to examine follow sets. A simple visual inspection shows that all first sets are clearly disjoint. This is a very simple grammar !

The entire parser is found in the class PageParser.java in the molly package web.page. The syntax and start-end tokens can be easily changed and customized to one's individual liking.