ANTLR parsing MismatchedTokenException

2023-03-13 09:43 问答作者：

I'm trying to write a simple parser for an even simpler language that I'm writing. It's composed of postfix开发者_Python百科 expressions. As of now, I'm having issues with the parser. When I run it on the input 2 2 * test >> I get a MismatchedTokenException.

Also, how would I go about implementing a recursive postfix parser?

Here's my code:

grammar star;

options {
    language=Python;
    output=AST;
    ASTLabelType=CommonTree;
}

tokens {DECL;}
//start
//  :   decl ;
//decl
//  :   type ID -> ^(DECL type ID)
//  ;

program
    :   (body)+
    ;

body    :   (nested WS)*
    |   (var WS)*
    |   (get WS)*
    ;

var
    :   nested ID '>>'
    ;

get
    :   ID '<<'
    ;

//expressions

term
    :   INT
    ;

expr
    :   term (term operator)*
    ;

nested
    :   expr (expr operator)*
    ;

operator 
    :   ('*' | '+' | '/' | '%' | '-')
    ;

ID
    :   ('a'..'z' | 'A'..'Z') ('a..z' | '0'..'9' | 'A'..'Z')*
    ;

INT
    :   '0'..'9'+
    ;

WS
    :   (' ' | '\n' | '\t' | '\r') {$channel=HIDDEN;}
    ;

A couple of things are not correct:

1

You've put the WS token on the HIDDEN channel, which makes them unavailable to parser rules. So all WS tokens inside your body rule are incorrect.

2

_(~~your latest edit removed the left-recursion issue, but I'll still make a point of it~~ sorry, your other question has a left recursive rule (expr), so I'll leave this info in here)_

ANTLR is an LL parser-generator, so you can'r create left-recursive grammars. The following is left recursive:

expr
  :  term term operator
  ;

term
  :  INT
  |  ID
  |  expr
  ;

because the first term inside the expr rule could possible match an expr rule itself. Like any LL parser, ANTLR generated parser cannot cope with left recursion.

3

If you fix the WS issue, your body rule will produce the following error message:

(1/7) Decision can match input such as "INT" using multiple alternatives

This means that the parser cannot "see" to which rule the INT token belongs. This is due to the fact that all your body alternative can be repeated zero or more times and expr and nested are also repeated. And all of them can match an INT, which is what ANTLR is complaining about. If you remove the *'s like this:

body
    :   nested
    |   var
    |   get
    ;

// ...

expr
    :   term (term operator)
    ;

nested
    :   expr (expr operator)
    ;

the errors would disappear (although that would still not cause your input to be parsed properly!).

I realize that this might still sound vague, but it's not trivial to explain (or comprehend if you're new to all this).

4

To properly account for recursive expr inside expr, you'll need to stay clear of left recursion as I explained in #2. You can do that like this:

expr
  :  term (expr operator | term operator)*
  ;

which is still ambiguous, but that is in case of describing a postfix expression using an LL grammar, unavoidable AFAIK. To resolve this, you could enable global backtracking inside the options { ... } section of the grammar:

options {
  language=Python;
  output=AST;
  backtrack=true;
}

Demo

A little demo of how to parse recursive expressions could look like:

grammar star;

options {
  language=Python;
  output=AST;
  backtrack=true;
}

parse
  :  expr EOF -> expr
  ;

expr
  :  (term -> term) ( expr2 operator -> ^(operator $expr expr2) 
                    | term operator  -> ^(operator term term)
                    )*
  ;

expr2 
  :  expr
  ;

term
  :  INT
  |  ID
  ;

operator 
  :  ('*' | '+' | '/' | '%' | '-')
  ;

ID
  :  ('a'..'z' | 'A'..'Z') ('a..z' | '0'..'9' | 'A'..'Z')*
  ;

INT
  :  '0'..'9'+
  ;

WS
  :  (' ' | '\n' | '\t' | '\r') {$channel=HIDDEN;}
  ;

The test script:

#!/usr/bin/env python
import antlr3
from antlr3 import *
from antlr3.tree import *
from starLexer import *
from starParser import *

def print_level_order(tree, indent):
  print '{0}{1}'.format('   '*indent, tree.text)
  for child in tree.getChildren():
    print_level_order(child, indent+1)

input = "5 1 2 + 4 * + 3 -"
char_stream = antlr3.ANTLRStringStream(input)
lexer = starLexer(char_stream)
tokens = antlr3.CommonTokenStream(lexer)
parser = starParser(tokens)
tree = parser.parse().tree 
print_level_order(tree, 0)

produces the following output:

which corresponds to the following AST:

ANTLR parsing MismatchedTokenException

The problem is that your body rule never terminates, because it's allowed to match nothing. I didn't fire up ANTLR, I really don't like to mess with it, instead I rewrote your grammar in C++ (using AXE parser generator), added print statements to trace the matches and got the following result from parsing "2 2 * test >>":

parsed term: 2
parsed expr: 2
parsed nested: 2
parsed term: 2
parsed expr: 2
parsed nested: 2
parsed body: 2 2
parsed body:
parsed body: ... here goes your infinite loop

If you are interested in debugging this test case, the AXE grammar is shown below, set breakpoints at prints to step through the parser:

using namespace axe;
typedef std::string::iterator It;

auto space = r_any(" \t\n\r");
auto int_rule = r_numstr();
auto id = r_ident();
auto op = r_any("*+/%-");
auto term = int_rule 
    >> e_ref([](It i1, It i2)
{ 
    std::cout << "\nparsed term: " << std::string(i1, i2); 
});
auto expr = (term & *(term & op)) 
    >> e_ref([](It i1, It i2)
{ 
    std::cout << "\nparsed expr: " << std::string(i1, i2); 
});
auto nested = (expr & *(expr & op))
    >> e_ref([](It i1, It i2)
{ 
    std::cout << "\nparsed nested: " << std::string(i1, i2); 
});
auto get = (id & "<<")  
    >> e_ref([](It i1, It i2)
{ 
    std::cout << "\nparsed get: " << std::string(i1, i2); 
});
auto var = (nested & id & ">>")  
    >> e_ref([](It i1, It i2)
{ 
    std::cout << "\nparsed var: " << std::string(i1, i2); 
});
auto body = (*(nested & space) | *(var & space) | *(get & space))
     >> e_ref([](It i1, It i2)
{ 
    std::cout << "\nparsed body: " << std::string(i1, i2); 
});
auto program = +(body)
    | r_fail([](It i1, It i2) 
{
    std::cout << "\nparsing failed, parsed portion: " 
        << std::string(i1, i2);
});
// test parser
std::ostringstream text;
text << "2 2 * test >>";
std::string str = text.str();
program(str.begin(), str.end());

继续阅读：antlr antlrworks parser-generator python

ANTLR parsing MismatchedTokenException

1

2

3

4

Demo

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

1

2

3

4

Demo

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？