开发者

Unintentional concatenation in Bison/Yacc grammar

I am experimenting with lex and yacc and have run into a strange issue, but I think it would be best to show you my code before detailing the issue. This is my lexer:

%{
#include <stdlib.h>
#include <string.h>
#include "y.tab.h"
void yyerror(char *);
%}

%%

[a-zA-Z]+ {
  yylval.strV = yytext;
  return ID;
}

[0-9]+      {
  yylval.intV = atoi(yytext);
  return INTEGER;
}

[\n] { return *yytext; }

[ \t]        ;

. yyerror("invalid character");

%%

int yywrap(void) {
  return 1;
}

This is my parser:

%{
#include <stdio.h>

int yydebug=1;
void prompt();
void yyerror(char *);
int yylex(void);
%}

%union {
  int intV;
  char *strV;
}

%token INTEGER ID

%%

program: program statement EOF { prompt(); }
       | program EOF { prompt(); }
       | { prompt(); }
       ;

args: /* empty */
    | args ID { printf(":%s ", $<strV>2); }
    ;

statement: ID args { printf("%s", $<strV>1); }
         | INTEGER { printf("%d", $<intV>1); }
;

EOF: '\n'

%%

void yyerror(char *s) {
  fprintf(stderr, "%s\n", s);
}

void prompt() {
  printf("> ");
}

int main(void) {
  yyparse();
  return 0;
}

A very simple language, consisting of no more than strings and integer and a basic REPL. Now, you'll note in the parser that args are output with a leading colon, the intention being that, when combined with the first pattern of the rule of the statement the interaction with the REPL would look something like this:

> aaa aa a
:aa :a aaa>

However, the interaction is this:

> aaa aa a
:aa :a aaa aa aa
>

Why does the token ID in the following rule

statement: ID args { printf("%s", $<strV>1); }
         | INTEGER { pri开发者_如何学Gontf("%d", $<intV>1); }
;

have the semantic value of the total input string, newline included? How can my grammar be reworked so that the interaction I intended?


You have to preserve token strings as they are read if you want them to remain valid. I modified the statement rule to read:

statement: ID { printf("<%s> ", $<strV>1); } args { printf("%s", $<strV>1); }
         | INTEGER { printf("%d", $<intV>1); }
;

Then, with your input, I get the output:

> aaa aa a
<aaa> :aa :a aaa aa a
>

Note that at the time the initial ID is read, the token is exactly what you expected. But, because you did not preserve the token, the string has been modified by the time you get back to printing it after the args have been parsed.


I think there is an associativity conflict between the args and statement productions. This is borne out by the (partial) output from the bison -v parser.output file:

Nonterminals, with rules where they appear

$accept (6)
    on left: 0
program (7)
    on left: 1 2 3, on right: 0 1 2
statement (8)
    on left: 4 5, on right: 1
args (9)
    on left: 6 7, on right: 4 7
EOF (10)
    on left: 8, on right: 1 2

Indeed, I'm having a hard time trying to figure out what your grammar is trying to accept. As a side note, I'd probably move your EOF production into the lexer as an EOL token; this will make resynchronizing on parse errors easier.

Better explanation of your intent would be helpful.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜