开发者

Interactive Antlr

I'm trying to write a simple interactive (using System.in as source) language using antlr, and I have a few problems with it. The examples I've found on the web are all using a per line cycle, e.g.:

while(readline)
  result = parse(line)
  doStuff(result)

But what if I'm writing something like pascal/smtp/etc, with a "first line" looks like X requirment? I know 开发者_运维百科it can be checked in doStuff, but I think logically it is part of the syntax.

Or what if a command is split into multiple lines? I can try

while(readline)
  lines.add(line)
  try
    result = parse(lines)
    lines = []
    doStuff(result)
  catch
    nop

But with this I'm also hiding real errors.

Or I could reparse all lines everytime, but:

  1. it will be slow
  2. there are instructions I don't want to run twice

Can this be done with ANTLR, or if not, with something else?


Dutow wrote:

Or I could reparse all lines everytime, but:

it will be slow there are instructions I don't want to run twice Can this be done with ANTLR, or if not, with something else?

Yes, ANTLR can do this. Perhaps not out of the box, but with a bit of custom code, it sure is possible. You also don't need to re-parse the entire token stream for it.

Let's say you want to parse a very simple language line by line that where each line is either a program declaration, or a uses declaration, or a statement.

It should always start with a program declaration, followed by zero or more uses declarations followed by zero or more statements. uses declarations cannot come after statements and there can't be more than one program declaration.

For simplicity, a statement is just a simple assignment: a = 4 or b = a.

An ANTLR grammar for such a language could look like this:

grammar REPL;

parse
  :  programDeclaration EOF
  |  usesDeclaration EOF
  |  statement EOF
  ;

programDeclaration
  :  PROGRAM ID
  ;

usesDeclaration
  :  USES idList
  ;

statement
  :  ID '=' (INT | ID)
  ;

idList
  :  ID (',' ID)*
  ;

PROGRAM : 'program';
USES    : 'uses';
ID      : ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')*;
INT     : '0'..'9'+;
SPACE   : (' ' | '\t' | '\r' | '\n') {skip();};

But, we'll need to add a couple of checks of course. Also, by default, a parser takes a token stream in its constructor, but since we're planning to trickle tokens in the parser line-by-line, we'll need to create a new constructor in our parser. You can add custom members in your lexer or parser classes by putting them in a @parser::members { ... } or @lexer::members { ... } section respectively. We'll also add a couple of boolean flags to keep track whether the program declaration has happened already and if uses declarations are allowed. Finally, we'll add a process(String source) method which, for each new line, creates a lexer which gets fed to the parser.

All of that would look like:

@parser::members {

  boolean programDeclDone;
  boolean usesDeclAllowed;

  public REPLParser() {
    super(null);
    programDeclDone = false;
    usesDeclAllowed = true;
  }

  public void process(String source) throws Exception {
    ANTLRStringStream in = new ANTLRStringStream(source);
    REPLLexer lexer = new REPLLexer(in);
    CommonTokenStream tokens = new CommonTokenStream(lexer);
    super.setTokenStream(tokens);
    this.parse(); // the entry point of our parser
  } 
}

Now inside our grammar, we're going to check through a couple of gated semantic predicates if we're parsing declarations in the correct order. And after parsing a certain declaration, or statement, we'll want to flip certain boolean flags to allow- or disallow declaration from then on. The flipping of these boolean flags is done through each rule's @after { ... } section that gets executed (not surprisingly) after the tokens from that parser rule are matched.

Your final grammar file now looks like this (including some System.out.println's for debugging purposes):

grammar REPL;

@parser::members {

  boolean programDeclDone;
  boolean usesDeclAllowed;

  public REPLParser() {
    super(null);
    programDeclDone = false;
    usesDeclAllowed = true;
  }

  public void process(String source) throws Exception {
    ANTLRStringStream in = new ANTLRStringStream(source);
    REPLLexer lexer = new REPLLexer(in);
    CommonTokenStream tokens = new CommonTokenStream(lexer);
    super.setTokenStream(tokens);
    this.parse();
  } 
}

parse
  :  programDeclaration EOF
  |  {programDeclDone}? (usesDeclaration | statement) EOF
  ;

programDeclaration
@after{
  programDeclDone = true;
}
  :  {!programDeclDone}? PROGRAM ID {System.out.println("\t\t\t program <- " + $ID.text);}
  ;

usesDeclaration
  :  {usesDeclAllowed}? USES idList {System.out.println("\t\t\t uses <- " + $idList.text);}
  ;

statement
@after{
  usesDeclAllowed = false; 
}
  :  left=ID '=' right=(INT | ID) {System.out.println("\t\t\t " + $left.text + " <- " + $right.text);}
  ;

idList
  :  ID (',' ID)*
  ;

PROGRAM : 'program';
USES    : 'uses';
ID      : ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')*;
INT     : '0'..'9'+;
SPACE   : (' ' | '\t' | '\r' | '\n') {skip();};

which can be tested wit the following class:

import org.antlr.runtime.*;
import java.util.Scanner;

public class Main {
    public static void main(String[] args) throws Exception {
        Scanner keyboard = new Scanner(System.in);
        REPLParser parser = new REPLParser();
        while(true) {
            System.out.print("\n> ");
            String input = keyboard.nextLine();
            if(input.equals("quit")) {
                break;
            }
            parser.process(input);
        }
        System.out.println("\nBye!");
    }
}

To run this test class, do the following:

# generate a lexer and parser:
java -cp antlr-3.2.jar org.antlr.Tool REPL.g

# compile all .java source files:
javac -cp antlr-3.2.jar *.java

# run the main class on Windows:
java -cp .;antlr-3.2.jar Main 
# or on Linux/Mac:
java -cp .:antlr-3.2.jar Main

As you can see, you can only declare a program once:

> program A
                         program <- A

> program B
line 1:0 rule programDeclaration failed predicate: {!programDeclDone}?

uses cannot come after statements:

> program X
                         program <- X

> uses a,b,c
                         uses <- a,b,c

> a = 666
                         a <- 666

> uses d,e
line 1:0 rule usesDeclaration failed predicate: {usesDeclAllowed}?

and you must start with a program declaration:

> uses foo
line 1:0 rule parse failed predicate: {programDeclDone}?


Here's an example of how to parse input from System.in without first manually parsing it one line at a time and without making major compromises in the grammar. I'm using ANTLR 3.4. ANTLR 4 may have addressed this problem already. I'm still using ANTLR 3, though, and maybe someone else with this problem still is too.

Before getting into the solution, here are the hurdles I ran into that keeps this seemingly trivial problem from being easy to solve:

  • The built-in ANTLR classes that derive from CharStream consume the entire stream of data up-front. Obviously an interactive mode (or any other indeterminate-length stream source) can't provide all the data.
  • The built-in BufferedTokenStream and derived class(es) will not end on a skipped or off-channel token. In an interactive setting, this means that the current statement can't end (and therefore can't execute) until the first token of the next statement or EOF has been consumed when using one of these classes.
  • The end of the statement itself may be indeterminate until the next statement begins.

Consider a simple example:

statement: 'verb' 'noun' ('and' 'noun')*
         ;
WS: //etc...

Interactively parsing a single statement (and only a single statement) isn't possible. Either the next statement has to be started (that is, hitting "verb" in the input), or the grammar has to be modified to mark the end of the statement, e.g. with a ';'.

  • I haven't found a way to manage a multi-channel lexer with my solution. It doesn't hurt me since I can replace my $channel = HIDDEN with skip(), but it's still a limitation worth mentioning.
  • A grammar may need a new rule to simplify interactive parsing.

For example, my grammar's normal entry point is this rule:

script    
    : statement* EOF -> ^(STMTS statement*) 
    ;

My interactive session can't start at the script rule because it won't end until EOF. But it can't start at statement either because STMTS might be used by my tree parser.

So I introduced the following rule specifically for an interactive session:

interactive
    : statement -> ^(STMTS statement)
    ;

In my case, there are no "first line" rules, so I can't say how easy or hard it would be to do something similar for them. It may be a matter of making a rule like so and execute it at the beginning of the interactive session:

interactive_start
    : first_line
    ;
  • The code behind a grammar (e.g., code that tracks symbols) may have been written under the assumption that the lifespan of the input and the lifespan of the parser object would effectively be the same. For my solution, that assumption doesn't hold. The parser gets replaced after each statement, so the new parser must be able to pick up the symbol tracking (or whatever) where the last one left off. This is a typical separation-of-concerns problem so I don't think there's much else to say about it.

The first problem mentioned, the limitations of the built-in CharStream classes, was my only major hang-up. ANTLRStringStream has all the workings that I need, so I derived my own CharStream class off of it. The base class's data member is assumed to have all the past characters read, so I needed to override all the methods that access it. Then I changed the direct read to a call to (new method) dataAt to manage reading from the stream. That's basically all there is to this. Please note that the code here may have unnoticed problems and does no real error handling.

public class MyInputStream extends ANTLRStringStream {
    private InputStream in;

    public MyInputStream(InputStream in) {
        super(new char[0], 0);
        this.in = in;
    }

    @Override
    // copied almost verbatim from ANTLRStringStream
    public void consume() {
        if (p < n) {
            charPositionInLine++;
            if (dataAt(p) == '\n') {
                line++;
                charPositionInLine = 0;
            }
            p++;
        }
    }

    @Override
    // copied almost verbatim from ANTLRStringStream
    public int LA(int i) {
        if (i == 0) {
            return 0; // undefined
        }
        if (i < 0) {
            i++; // e.g., translate LA(-1) to use offset i=0; then data[p+0-1]
            if ((p + i - 1) < 0) {
                return CharStream.EOF; // invalid; no char before first char
            }
        }

        // Read ahead
        return dataAt(p + i - 1);
    }

    @Override
    public String substring(int start, int stop) {
        if (stop >= n) {
            //Read ahead.
            dataAt(stop);
        }
        return new String(data, start, stop - start + 1);
    }

    private int dataAt(int i) {
        ensureRead(i);

        if (i < n) {
            return data[i];
        } else {
            // Nothing to read at that point.
            return CharStream.EOF;
        }
    }

    private void ensureRead(int i) {
        if (i < n) {
            // The data has been read.
            return;
        }

        int distance = i - n + 1;

        ensureCapacity(n + distance);

        // Crude way to copy from the byte stream into the char array.
        for (int pos = 0; pos < distance; ++pos) {
            int read;
            try {
                read = in.read();
            } catch (IOException e) {
                // TODO handle this better.
                throw new RuntimeException(e);
            }

            if (read < 0) {
                break;
            } else {
                data[n++] = (char) read;
            }
        }
    }

    private void ensureCapacity(int capacity) {
        if (capacity > n) {
            char[] newData = new char[capacity];
            System.arraycopy(data, 0, newData, 0, n);
            data = newData;
        }
    }
}

Launching an interactive session is similar to the boilerplate parsing code, except that UnbufferedTokenStream is used and the parsing takes place in a loop:

    MyLexer lex = new MyLexer(new MyInputStream(System.in));
    TokenStream tokens = new UnbufferedTokenStream(lex);

    //Handle "first line" parser rule(s) here.

    while (true) {
        MyParser parser = new MyParser(tokens);
        //Set up the parser here.

        MyParser.interactive_return r = parser.interactive();

        //Do something with the return value.
        //Break on some meaningful condition.
    }

Still with me? Okay, well that's it. :)


If you are using System.in as source, which is an input stream, why not just have ANTLR tokenize the input stream as it is read and then parse the tokens?


You have to put it in doStuff....

For instance, if you're declaring a function, the parse would return a function right? without body, so, that's fine, because the body will come later. You'd do what most REPL do.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜