开发者

How do you parse context-sensitive C-code?

One issue I ran into was that C must be context-sensitive and cannot be parsed with one token of lookahead. For example

int main1;
int main() {}

That's the simplest example I can think of in which both a function definition and a variable declaration start with the same token type. You'd have to look all the way ahead to the left paren or semicolon to determine what to parse.

My question is, how is this accomplished? Does the lexical analyzer have some tricks up its sleeve t开发者_Go百科o do the lookahead and emit an invisible token distinguishing the two? Do modern parses have many tokens of lookahead?


You should read up on LR or shift-reduce parsers. They assemble the parse tree bottom-up. In the case of the main function it goes like:

  • shift int into the stack as a TYPE terminal token
  • shift main into the stack as an IDENTIFIER terminal token
  • shift ( into the stack
  • shift ) into the stack
  • remove the ( and ) and replace with an ARGLIST non-terminal token
  • shift { into the stack
  • shift } into the stack
  • remove those and replace with a STMT_BLOCK non-terminal token
  • remove the TYPE, IDENTIFIER, ARGLIST, and STMT_BLOCK tokens, and replace with a FUNCTION_DEF token.

Of course, every time it does a replacement, it builds a new fragment of parse tree and attaches it to the new token. (I made up those token names.)

It works under control of a finite state machine that recognizes the pattern of tokens on the stack and, together with the next (single) token of input, decides whether to shift the next token in, or apply one of the grammar rules to reduce a group of tokens on the stack into a single one. The FSM is built by the parser generator from a list of the grammar rules.

It is called LR because it reads input tokens from the left, but applies grammar rules from the right. It is distinct from LL, or recursive-descent, which applies grammar rules from the left. Pascal is an LL(1) language. C is not LL(1), so requires an LR(1) parser. It allows C, for example, to embed almost anything in nested parentheses without confusing the parser.

I hope that helps you see what is going on.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜