开发者

Write recursive-descent parsing to parse epsilon(ε) in Java

For example,

EBNF

A ::= B c;

B ::= T1 | T2 | ε

T1 ::= a

T2 ::= b

parseA()
{
switch(currentToken.kind){
case Token.a :
    parseT1();
case Token.b :
    parseT2();
break;

case <epsilon> :
    
break;
default:
    // report error
break;
}
}

How to write the pars开发者_JS百科er parse the epsilon ( set of empty string ) in Java?


Whether it's Java or any language, the point is you have a stream of tokens. You don't "get" a token and decide what to do. Rather you "look at" the next token, and if you can use it, you "accept" it.

See the difference?

Do not "get" and then "decide".
Do "look and decide" and then "accept".
This is the heart of the concept of "lookahead".

So, ParseA calls ParseB.

Parse B looks at the next token in the stream.
If it is "a" it accepts it and returns.
If it is "b" it accepts it and returns.
Otherwise it simply returns to ParseA (without accepting anything).

Then ParseA looks at the next token.
If it is "c" it accepts it and returns.
Otherwise it fails.

Make sense?

To do this, you should add a sentinel token to the end of the stream, that is never accepted. At the end, that should be the only token left in the stream. If not, you've got a syntax error consisting of extra garbage at the end.


epsilon is just a marker for "empty string allowed here", so you don't need to parse anything; the descent is finished. It seems unlikely that you would actually have a token for that though; you need to detect if there's no token available or if the next token can be used in another production.

For example, you might get the input c. You need to realize that this matches the production A ::= B c;, because B can be reduced to epsilon -- there is no epsilon token, you just need to realize that the B rule there is optional and in that case needs to be skipped over to reduce c to A


As it stands, this is a bit on the simplistic side to make much sense as a parser. The whole thing can be written as a simple regular expression: "[ab]?c". If you really insist on writing it as a parser, the code could be something like:

Boolean parseA() { 
    // an A is a B followed by a `c`:
    return parseB() && (get_token() == Token.c);
}

Boolean parseB  {
    // Get a token. 
    //     If it's an `a` or a `b`, consume it. 
    //     Otherwise, throw it back (match null string by consuming nothing).
    // Either way, return true (successful match).
    //
    token current_token = get_token();
    if (token != Token.a && token != Token.b)
        push_back(current_token);
    return true;
}

Edit (in response to comment on another answer): No, when you're matching B, you should not look for a Token.c. As far as B cares, there are three possibilities: match 'a', match 'b', or match nothing at all. It's then up to the part that parses A to check that it has a B followed by a Token.c.

For example, if you were to change the grammar to something like:

A ::= B C

B ::= a | b | ε

C ::= c | d

Since 'B' still has the same definition, you should not have to change it just because some other definition changed. Likewise, you might add to the grammar to allow (for example) an arbitrary string of B's followed by a C.


You 'parse' an epsilon when you see anything in the FOLLOW set of the current non-terminal. So since FOLLOW(B) = { 'c' }, you use case Token.c: in the switch in parseB for it

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜