lex (flex) generated program not parsing whole input

2022-12-19 22:09 问答作者：

I have a relatively simple lex/flex file and have been running it with flex's debug flag to make sure it's tokenizing properly. Unfortunately, I'm always running into one of two problems - either the program the flex generates stops just gives up silently after a couple of tokens, or the rule I'm using to recognize characters and strings isn't called and the default rule is called instead.

Can someone point me in the right direction? I've attached my flex file and sample input / output.

Edit: I've found that the generated lexer stops after a specific rule: "cdr". This is more detailed, but also much more confusing. I've posted a shorted modified lex file.

/* lex file*/
%option noyywrap
%option nodefault

%{
       enum tokens{
                CDR,
                CHARACTER,
                SET
        };
%}

%%

"cdr"                                               { return CDR; }
"set"                                               { return SET; }

[ \t\r\n]                                           /*Nothing*/
[a-zA-Z0-9\\!@#$%^&*()\-_+=~`:;"'?<>,\.]      { return CHARACTER; }

%%

Sample input:

set c cdra + cdr b + () ;

Complete output from running the input through the generated parser:

--(end of buffer or a NUL)
--accepting rule at line 16 ("set")
--accepting rule at line 18 (" ")
--accepting r开发者_高级运维ule at line 19 ("c")
--accepting rule at line 18 (" ")
--accepting rule at line 15 ("cdr")

Any thoughts? The generated program is giving up after half of the input! (for reference, I'm doing input by redirecting the contents of a file to the generated program).

When generating a lexer that's standalone (that is, not one with tokens that are defined in bison/yacc, you typically write an enum at the top of the file defining your tokens. However, the main loop of a lex program, including the main loop generated by default, looks something like this:

while( token = yylex() ){
    ...

This is fine, until your lexer matches the rule that appears first in the enum - in this specific case CDR. Since enums by default start at zero, this causes the while loop to end. Renumbering your enum - will solve the issue.

enum tokens{
            CDR = 1,
            CHARACTER,
            SET
    };

Short version: when defining tokens by hand for a lexer, start with 1 not 0.

This rule

[-+]?([0-9*\.?[0-9]+|[0-9]+\.)([Ee][-+]?[0-9]+)? 
          |

seems to be missing a closing bracket just after the first 0-9, I added a | below where I think it should be. I couldn't begin to guess how flex would respond to that.

The rule I usually use for symbol names is [a-zA-Z$_], this is like your unquoted strings except that I usually allow numbers inside symbols as long as the symbol doesn't start with a number.

[a-zA-Z$_]([a-zA-Z$_]|[0-9])*

A characters is just a short symbol. I don't think it needs to have its own rule, but if it does, then you need to insure that the string rule requires at least 2 characters.

[a-zA-Z$_]([a-zA-Z$_]|[0-9])+

继续阅读：flex-lexer lex parsing regex

lex (flex) generated program not parsing whole input

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？