Where should I draw the line between lexer and parser?

2023-02-17 16:31 问答作者：

I'm writing a lexer for the IMAP protocol for educational purposes and I'm stumped as to where I should draw the line between lexer and parser. Take this example of an IMAP server response:

* FLAGS (\Answered \Deleted)

This response is defined in the formal syntax like this:

mailbox-data 开发者_JS百科  = "FLAGS" SP flag-list
flag-list      = "(" [flag *(SP flag)] ")"
flag           = "\Answered" / "\Deleted"

Since they are specified as string literals (aka "terminal" tokens) would it be more correct for the lexer to emit a unique token for each, like:

(TknAnsweredFlag)
(TknSpace)
(TknDeletedFlag)

Or would it be just as correct to emit something like this:

(TknBackSlash)
(TknString "Answered")
(TknSpace)
(TknBackSlash)
(TknString "Deleted")

My confusion is that the former method could overcomplicate the lexer - if \Answered had two meanings in two different contexts the lexer wouldn't emit the right token. As a contrived example (this situation won't occur because e-mail addresses are enclosed in quotes), how would the lexer deal with an e-mail address like \Answered@googlemail.com? Or is the formal syntax designed to never allow such an ambiguity to arise?

As a general rule, you don't want lexical syntax to propagate into the grammar, because its just detail. For instance, a lexer for a computer programming langauge like C would certainly recognize numbers, but it is generally inappropriate to produce HEXNUMBER and DECIMALNUMBER tokens, because this isn't important to the grammar.

I think what you want are the most abstract tokens that allows your grammar to distinguish cases of interest relative to your purpose. You get to mediate this by confusion caused in one part of the grammar, by choices you might make in other parts.

If your goal is simply to read past the flag values, then in fact you don't need to distinguish among them, and a TknFlag with no associated content would be good enough.

If your goal is to process the flag values individually, you need to know if you got an ANSWERED and/or DELETED indications. How they are lexically spelled is irrelevant; so I'd go with your TknAnsweredFlag solution. I would dump the TknSpace, because in any sequence of flags, there must be intervening spaces (your spec say so), so I'd try to eliminate using whatever whitespace supression machinery you lexer offers.

On occasion, I run into situations where there are dozens of such flag-like things. Then your grammar starts to become cluttered if you have a token for each. If the grammar doesn't need to know specific flags, then you should have a TknFlag with associated string value. If a small subset of the flags are needed by the grammar to distinguish, but most of them are not, then you should compromise: have individual tokens for those flags that matter to the grammar, and a catch all TknFlag with associated string for the rest.

Regarding the difficulty in having two different interpretations: this is one of those tradeoffs. If you have that issue, then your tokens either need to have fine enough detail in both places where they are needed in the grammar so you can discriminate. If "\" is relevant as a token somewhere else in the grammar, you certainly could produce both TknBackSlash and TknAnswered. However, if the way something is treated in one part of the grammar is different than another, you can often get around this using a mode-driven lexer. Think of modes as being a finite state machine, each with an associated (sub)lexer. Transitions between modes are triggered by tokens that are cues (you must have a FLAGS token; it is preciseuly such a cue that you are about to pick up flag values). In a mode, you can produce tokens that other modes would not produce; thus in one mode, you might produce "\" tokens, but in your flag mode you wouldn't need to. Mode support is pretty common in lexers because this problem is more common that you might expect. See Flex documentation for an example.

The fact that you are asking the question shows you are on the right track for making a good choice. You need to balance the maintainability goal of minimizing tokens (technically you can parse using a token for ever ASCII character!) with fundamental require to discriminate well enough for your needs. After you've built a dozen grammars this tradeoff will seem easy, but I think the rules of thumb I've provided are pretty good.

I'd come up with the CFG first and whatever terminals it needs to do its job is what the lexer should recognize; otherwise you are just guessing at the proper way to tokenize the string.

I'd recommend to avoid separating lexer and parser - modern parsing approaches (like PEGs) allows to mix lexing and parsing. This way you won't need tokens at all.

继续阅读：imap lexer parsing

Where should I draw the line between lexer and parser?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？