How to use yylval with strings in yacc

2022-12-13 03:23 问答作者：

I want to pass the actual string of a token. If I have a token called ID, then I want my yacc file to actually know what ID is called. I thing I have to pass a string using yylval to the yacc file from the flex file. H开发者_开发技巧ow do I do that?

The key to returning a string or any complex type via yylval is the YYSTYPE union created by yacc in the y.tab.h file. The YYSTYPE is a union with a member for each type of token defined within the yacc source file. For example to return the string associated with a SYMBOL token in the yacc source file you declare this YYSTYPE union using %union in the yacc source file:

/*** Yacc's YYSTYPE Union ***/

/* The yacc parser maintains a stack (array) of token values while
   it is parsing.  This union defines all the possible values tokens
   may have.  Yacc creates a typedef of YYSTYPE for this union. All
   token types (see %type declarations below) are taken from
   the field names of this union.  The global variable yylval which lex
   uses to return token values is declared as a YYSTYPE union.
 */

    %union {
        long int4;              /* Constant integer value */
        float fp;               /* Constant floating point value */
        char *str;              /* Ptr to constant string (strings are malloc'd) */
        exprT expr;             /* Expression -  constant or address */
        operatorT *operatorP;   /* Pointer to run-time expression operator */
    };

%type <str> SYMBOL

Then in the LEX source file there is a pattern that matches the SYMBOL token. It is the responsibility of code associated with that rule to return the actual string that represents the SYMBOL. You can't just pass a pointer to the yytext buffer because it is a static buffer that is reused for each token that is matched. To return the matched text the static yytext buffer must be replicated on the heap with _strdup() and a pointer to this string passed via yyval.str. It is then the yacc rule that matches the SYMBOL token's responsibility to free the heap allocated string when it is done with it.

[A-Za-z_][A-Za-z0-9_]*  {{
    int i;

    /*
    * condition letter followed by zero or more letters
    * digits or underscores
    *      Convert matched text to uppercase
    *      Search keyword table
    *      if found
    *          return <keyword>
    *      endif
    * 
    *      set lexical value string to matched text
    *      return <SYMBOL>
    */

    /*** KEYWORDS and SYMBOLS ***/
    /* Here we match a keywords or SYMBOL as a letter
    * followed by zero or more letters, digits or 
    * underscores.
    */

    /* Convert the matched input text to uppercase */
    _strupr(yytext);         /* Convert to uppercase */

    /* First we search the keyword table */
    for (i = 0; i<NITEMS(keytable); i++) {
        if (strcmp(keytable[i].name, yytext)==0)
            return (keytable[i].token);
    }

    /* Return a SYMBOL since we did not match a keyword */
    yylval.str=_strdup(yytext);
    return (SYMBOL);
}}

See the Flex manual section on Interfacing with YACC.

15 Interfacing with Yacc

One of the main uses of flex is as a companion to the yacc parser-generator. yacc parsers expect to call a routine named yylex() to find the next input token. The routine is supposed to return the type of the next token as well as putting any associated value in the global yylval. To use flex with yacc, one specifies the `-d' option to yacc to instruct it to generate the file y.tab.h containing definitions of all the %tokens appearing in the yacc input. This file is then included in the flex scanner. For example, if one of the tokens is TOK_NUMBER, part of the scanner might look like:
     %{
     #include "y.tab.h"
     %}

     %%

     [0-9]+        yylval = atoi( yytext ); return TOK_NUMBER;

Setting up the context

Syntax analysis (to check if an input text follows a specified grammar) consist of two phases:

tokenizing, which is done by tools like lex or flex, with interface yylex()) and
parsing the stream of token generated in step 1 ( as per a user specified grammar), which is done by tools like bison/yacc with the interface yyparse()).

While doing phase 1, given an input stream, each call to yylex() identifies a token (a char string) and yytext points to the first character of that string.For example: With an input stream of "int x = 10;" and with lex rules for tokenization conforming to C language, then first 5 calls to yylex() will identify the following 5 tokens "int", "x", "=", "10", ";" and each time the yytext will point to first char of the return token.

Phase 2, The parser (which you mentioned as yacc ) is a program which is calling this yylex function every time to get a token and uses these tokens to see if it is matching the rules of a grammar. These calls to yylex will return tokens as some integer codes. For example in the previous example, the first 5 calls to yylex() may return the following integers to the parser: TYPE, ID, EQ_OPERATOR and INTEGER ( whose actual integer values are defined in some header file).

Now all parser can see is those integer codes, which may not be useful at times. For example, in the running example you may want to associate TYPE to int, ID to some symbol table pointer, and INTEGER to decimal 10. To facilitate that, each token returned by yylex with associated with another VALUE whose default type is int, but you may have custom types for that. In lex environment this VALUE is accessed as yylval.

For example, again as per the running example, yylex may have the following rule to identify 10

[0-9]+   {  yylval.intval = atoi(yytext); return INTEGER; }

and following to identify x

[a-zA-Z][a-zA-Z0-9]*   {yylval.sym_tab_ptr = SYM_TABLE(yytext); return ID;}

Note that here I have defined the VALUE's ( or yylval's) type as a union containing an int (intval) and an int* pointer (sym_tab_ptr).

But in the yacc world, this VALUE is identified / accessed as $n. For example, consider the following yacc rule to identify a specific assignment statement

TYPE ID '=' VAL:  { //In this action part of the yacc rule, use $2 to get the symbol table pointer associated with ID, use $4 to get decimal 10.}

Answering your question

If you want to access the yytext value of a certain token (which is related to lex world) in yacc world, use that old friend VALUE as folowing:

Augment the union type of VALUE to add another field say char* lex_token_str
In the lex rule, do yylval.lex_token_str = strdup(yytext)
Then in yacc world access it using the appropriate $n.
In case you want to access more that a single value of a token, (for example for the lex identified token ID, the parser may want to access both the name and the symbol table pointer), then augment the union type of VALUE with a structure member, containing char* (for name) and int*(for symtab pointer).

继续阅读：bison lexical-analysis yacc

How to use yylval with strings in yacc

15 Interfacing with Yacc

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？