Big problem with regular expression in Lex (lexical analyzer)
I have some content like this:
author = "Marjan Mernik and Viljem Zumer",
title = "Implementation of multiple attribute grammar inheritance in the tool LISA",
year = 1999
author = "Manfred Broy and Martin Wirsing",
title = "Generalized
Heterogeneous Algebras and
Partial Interpretations",
year = 1983
author = "Ikuo Nakata and Masataka Sassa",
title = "L-Attributed LL(1)-Grammars are
LR-Attributed",
journal = "Information Processing Letters"
And I need to catch everything between double quotes for title. My first try was this:
^(" "|\t)+"title"" "*=" "*"\"".+"\","
Which catches the first example, but not the other two. The other have multiple lines and that's the problem. I though about changing to something with \n
somewhere to allow multiple lines, like this:
^(" "|\t)+"title"" "*=" "*"\""(.|\n)+"\","
But this doesn't help, instead, it catches everything.
Than I though, "what I want is between double quotes, what if I catch everything until I find another "
followed by ,
? This way I could know if I was at the end of the title or not, no matter the number of lines, like this:
^(" "|\t)+"title"" "*=" "*"\""[开发者_高级运维^"\""]+","
But this has another problem... The example above doesn't have it, but the double quote symbol ("
) can be in between the title declaration. For instance:
title = "aaaaaaa \"X bbbbbb",
And yes, it will always be preceded by a backslash (\
).
Any suggestions to fix this regexp?
The classical regex to match strings in double quotes is:
\"([^\"]|\\.)*\"
In your case, you'll want something like this:
"title"\ *=\ *\"([^\"]|\\.)*\"
PS: IMHO, you're putting too many quotes in your regexes, it's hard to read.
You could use start conditions to simplify each separate pattern, for example:
%x title
%%
"title"\ *=\ *\" { /* mark title start */
BEGIN(title);
fputs("found title = <|", yyout);
}
<title>[^"\\]* { /* process title part, use ([^\"]|\\.)* to grab all at once */
ECHO;
}
<title>\\. { /* process escapes inside title */
char c = *(yytext + 1);
fputc(c, yyout); /* double escaped characters */
fputc(c, yyout);
}
<title>\" { /* mark end of title */
fputs("|>", yyout);
BEGIN(0); /* continue as usual */
}
To make an executable:
$ flex parse_ini.y
$ gcc -o parse_ini lex.yy.c -lfl
Run it:
$ ./parse_ini < input.txt
Where input.txt
is:
author = "Marjan\" Mernik and Viljem Zumer",
title = "Imp\"lementation of multiple...",
year = 1999
Output:
author = "Marjan\" Mernik and Viljem Zumer",
found title = <|Imp""lementation of multiple...|>,
year = 1999
It replaced '"'
around the title by '<|'
and '|>'. Also
'\"'` is replaced by '""' inside title.
精彩评论