Good grammar for date data type for recursive descent parser LL(1)
I'm building a custom expression parser and evaluator for production environment to provide a limited DSL to the users. The parser itself as the DSL, need to be simple. The parser is going to be built in an exotic language that doesn't support dynamic expression parsing nor has any parser generator tools available.
My current decision is to go for recursive descent approach with LL(1) grammar, so that even programmers with no previous experience in evaluating expression could quickly learn how the code works.
It has to handle mixed expressions made up of several data types: decimals, percentages, strings and dates. And dates in the format of dd/mm/yyyy are easy to confuse with a string of division ops.
Is where a good solution to this problem?
My own solution that is aimed at keeping the parser simple and involves prefixing dates with a special symbol, let's say apostrophe:
<date> ::= <apostr><digit><digit>/&l开发者_C百科t;digit><digit>/<digit><digit><digit><digit>
<apostr> ::= '
<digit> ::= '0'..'9'
First off, I'm a fan of LL parsers, so I approve of your approach heartily. Note that one of the newer popular parser generators (ANTLR) is LL. If you allow more look-ahead, rather that restricting yourself to LL(1), you can do pretty much anything you'd ever want to do with an LR(1) parser, but the code will be far clearer, more reliable, and easier to debug.
I don't know enough about your overall grammar to be able to tell. It is possible you might be able to design things so that the LL parser can always tell from context if it is an integer expression or a date constant. However, assuming you can't, yeah you'd need some kind of way to tell the difference. The only other thing I can think of would be to use backslash as a separator instead of slash, but that's kinda ugly.
An LL-like lexerless parser with an infinite lookahead is what you need. And, namely, it is PEG.
http://en.wikipedia.org/wiki/Parsing_expression_grammar
With an ordered choice it is quite easy to avoid this date vs. constant literals division confusion.
When a language is intended for human input, defining it is as much a matter of
- adding syntax constraints to ensure unambiguous and easy parsing
- removing/bending syntax to ensure that the language feels intuitive, "natural", to the intended human audience.
Satisfying the second requirement is much harder than the first one, and requires insight into the
- intended use cases of the language
Which type of keyboard/input device is available? Are there some characters among the allowed characters which are hard to produce or to see on the display?
Which tokens / expression will be frequently used, which will be required only occasionally? Are users frequently inputting short, ad-hoc code snippets, or are the programs meant to be reused and modified over long periods
... etc. - background/culture of the intended audience
Which common practices/idioms from other regular (and plain natural) languages can or should be reused if possible?
Should one favor a terse but cryptic style, or a more explicit, but more verbose style?
... etc.
Basically, it is hard to make suggestions about a language syntax, without a good grasp of the intended usage and users.
Nevertheless, I'd like to suggest the following for the date format question:
Use an alternative format for date values altogether; one that would be "natural" enough to users but distinctive enough that a regular grammar can describe.
For example, one that uses a 3 letters abbreviation for the month (downside DSL becomes tied to English or other tongue, but also advantage, the ambiguity for humans about which is day and which is month is removed). Tentatively:
dd-mmm-yyyy (may seem unnatural in cultures where the prevailing date order
starts with the month maybe yyyy-mmm-dd then ?)
mmm-dd-yyyy (better for the above mentioned cultures)
ddmmmyyyy (avoid the dashes, but impose leading zeros)
MnnDnnYyyyy (using "M", "D" and "Y" (or others) as explicit prefixes; now,
this is completely culture neutral, but maybe a bit awkward...)
Anyway, just ideas... Applicability will vary with the human/cultural factors mentioned, and with the rest of the syntax. For example the above may imply that variables be explicitly marked (that's one of the reasons many languages use the $ prefix for example), to avoid possible conflict with [odd, but possible] variable identifiers.
In a nutshell the idea is to substitute the need for a special character prefix (which may then collide with the use these characters for mathematical and other expressions), by making the 12 months tag an good enough discriminator for the parser.
精彩评论