Good grammar for date data type for recursive descent parser LL(1)

2022-12-23 00:32 问答作者：

I'm building a custom expression parser and evaluator for production environment to provide a limited DSL to the users. The parser itself as the DSL, need to be simple. The parser is going to be built in an exotic language that doesn't support dynamic expression parsing nor has any parser generator tools available.

My current decision is to go for recursive descent approach with LL(1) grammar, so that even programmers with no previous experience in evaluating expression could quickly learn how the code works.

It has to handle mixed expressions made up of several data types: decimals, percentages, strings and dates. And dates in the format of dd/mm/yyyy are easy to confuse with a string of division ops.

Is where a good solution to this problem?

My own solution that is aimed at keeping the parser simple and involves prefixing dates with a special symbol, let's say apostrophe:

<date>   ::= <apostr><digit><digit>/&l开发者_C百科t;digit><digit>/<digit><digit><digit><digit>

<apostr> ::= '

<digit>  ::= '0'..'9'

First off, I'm a fan of LL parsers, so I approve of your approach heartily. Note that one of the newer popular parser generators (ANTLR) is LL. If you allow more look-ahead, rather that restricting yourself to LL(1), you can do pretty much anything you'd ever want to do with an LR(1) parser, but the code will be far clearer, more reliable, and easier to debug.

I don't know enough about your overall grammar to be able to tell. It is possible you might be able to design things so that the LL parser can always tell from context if it is an integer expression or a date constant. However, assuming you can't, yeah you'd need some kind of way to tell the difference. The only other thing I can think of would be to use backslash as a separator instead of slash, but that's kinda ugly.

An LL-like lexerless parser with an infinite lookahead is what you need. And, namely, it is PEG.

http://en.wikipedia.org/wiki/Parsing_expression_grammar

With an ordered choice it is quite easy to avoid this date vs. constant literals division confusion.

When a language is intended for human input, defining it is as much a matter of

adding syntax constraints to ensure unambiguous and easy parsing
removing/bending syntax to ensure that the language feels intuitive, "natural", to the intended human audience.

Satisfying the second requirement is much harder than the first one, and requires insight into the

intended use cases of the language
Which type of keyboard/input device is available? Are there some characters among the allowed characters which are hard to produce or to see on the display?
Which tokens / expression will be frequently used, which will be required only occasionally? Are users frequently inputting short, ad-hoc code snippets, or are the programs meant to be reused and modified over long periods
... etc.
background/culture of the intended audience
Which common practices/idioms from other regular (and plain natural) languages can or should be reused if possible?
Should one favor a terse but cryptic style, or a more explicit, but more verbose style?
... etc.

Basically, it is hard to make suggestions about a language syntax, without a good grasp of the intended usage and users.
Nevertheless, I'd like to suggest the following for the date format question:

Use an alternative format for date values altogether; one that would be "natural" enough to users but distinctive enough that a regular grammar can describe.
For example, one that uses a 3 letters abbreviation for the month (downside DSL becomes tied to English or other tongue, but also advantage, the ambiguity for humans about which is day and which is month is removed). Tentatively:

  dd-mmm-yyyy    (may seem unnatural in cultures where the prevailing date order 
                  starts with the month maybe yyyy-mmm-dd then ?)
  mmm-dd-yyyy    (better for the above mentioned cultures)
  ddmmmyyyy      (avoid the dashes, but impose leading zeros)

  MnnDnnYyyyy    (using "M", "D" and "Y" (or others) as explicit prefixes; now, 
                  this is completely culture neutral, but maybe a bit awkward...)

Anyway, just ideas... Applicability will vary with the human/cultural factors mentioned, and with the rest of the syntax. For example the above may imply that variables be explicitly marked (that's one of the reasons many languages use the $ prefix for example), to avoid possible conflict with [odd, but possible] variable identifiers.

In a nutshell the idea is to substitute the need for a special character prefix (which may then collide with the use these characters for mathematical and other expressions), by making the 12 months tag an good enough discriminator for the parser.

继续阅读：dsl expressionengine grammar parsing recursive-descent

Good grammar for date data type for recursive descent parser LL(1)

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？