Regex for strings in Bibtex
I'm trying to parse Bibtex files using lex/yacc. Strings in the bibtex database can be surrounded by quotes "..." or with braces - {...}
But every entry is also enclosed in braces. How do differentiate between开发者_运维百科 an entry and a string surrounded by braces?
@Book{sweig42,
Author = { Stefan Sweig },
title = { The impossible book },
publisher = { Dead Poet Society},
year = 1942,
month = mar
}
you have various options:
lexer start conditions (from a Lex tutorial)
building on the ideas from greg ward, enhance your lex rules with start conditions ('modes' as they are called in the referenced source).
specifically, you would have the start conditions BASIC ENTRY STRING
and the following rules (example taken and slightly enhanced from here):
%START BASIC ENTRY STRING
%%
/* Lexical grammar, mode 1: top-level */
<BASIC>AT @ { BEGIN ENTRY; }
<BASIC>NEWLINE \n
<BASIC>COMMENT \%[^\n]*\n
<BASIC>WHITESPACE. [\ \r\t]+
<BASIC>JUNK [^@\n\ \r\t]+
/* Lexical grammar, mode 2: in-entry */
<ENTRY>NEWLINE \n
<ENTRY>COMMENT \%[^\n]*\n
<ENTRY>WHITESPACE [\ \r\t]+
<ENTRY>NUMBER [0-9]+
<ENTRY>NAME [a-z0-9\!\$\&\*\+\-\.\/\:\;\<\>\?\[\]\^\_\`\|]+ { if (stricmp(yytext, "comment")==0) { BEGIN STRING; } }
<ENTRY>LBRACE \{ { if (delim == '\0') { delim='}'; } else { blevel=1; BEGIN STRING; } }
<ENTRY>RBRACE \} { BEGIN BASIC; }
<ENTRY>LPAREN \( { BEGIN STRING; delim=')'; plevel=1; }
<ENTRY>RPAREN \)
<ENTRY>EQUALS =
<ENTRY>HASH \#
<ENTRY>COMMA ,
<ENTRY>QUOTE \" { BEGIN STRING; bleveL=0; plevel=0; }
/* Lexical grammar, mode 3: strings */
<STRING>LBRACE \{ { if (blevel>0) {blevel++;} }
<STRING>RBRACE \} { if (blevel>0) { blevel--; if (blevel == 0) { BEGIN ENTRY; } } }
<STRING>LPAREN \( { if (plevel>0) { plevel++;} }
<STRING>RPAREN \} { if (plevel>0) { plevel--; if (plevel == 0) { BEGIN ENTRY; } } }
<STRING>QUOTE \" { BEGIN ENTRY; }
please note that the rule set is by no means complete but should get you started. more details to be found here.
btparse
These docs explain in a fairly detailed fashion thenintricacies of parsing the bibtex formats and comes with a 'python parser.
biblex
you might also be interested in employing the unix toolchain of biblex and bibparse. these tools generate and parse a bibtex token stream, respectively.
more info can be found here.
best regards, carsten
精彩评论