开发者

Regex for strings in Bibtex

I'm trying to parse Bibtex files using lex/yacc. Strings in the bibtex database can be surrounded by quotes "..." or with braces - {...}

But every entry is also enclosed in braces. How do differentiate between开发者_运维百科 an entry and a string surrounded by braces?

@Book{sweig42,
  Author =   { Stefan Sweig },
  title =    { The impossible book },
  publisher =    { Dead Poet Society},
  year =     1942,
  month =        mar
}


you have various options:

  • lexer start conditions (from a Lex tutorial)

    building on the ideas from greg ward, enhance your lex rules with start conditions ('modes' as they are called in the referenced source).

specifically, you would have the start conditions BASIC ENTRY STRING and the following rules (example taken and slightly enhanced from here):

%START BASIC ENTRY STRING
%%

/* Lexical grammar, mode 1: top-level */
<BASIC>AT           @ { BEGIN ENTRY; }
<BASIC>NEWLINE      \n
<BASIC>COMMENT      \%[^\n]*\n
<BASIC>WHITESPACE.  [\ \r\t]+
<BASIC>JUNK         [^@\n\ \r\t]+

/* Lexical grammar, mode 2: in-entry */
<ENTRY>NEWLINE      \n
<ENTRY>COMMENT      \%[^\n]*\n
<ENTRY>WHITESPACE   [\ \r\t]+
<ENTRY>NUMBER       [0-9]+
<ENTRY>NAME         [a-z0-9\!\$\&\*\+\-\.\/\:\;\<\>\?\[\]\^\_\`\|]+ { if (stricmp(yytext, "comment")==0) { BEGIN STRING; } }
<ENTRY>LBRACE       \{ { if (delim == '\0') { delim='}'; } else { blevel=1; BEGIN STRING; } }
<ENTRY>RBRACE       \} { BEGIN BASIC; }
<ENTRY>LPAREN       \( { BEGIN STRING; delim=')'; plevel=1; }
<ENTRY>RPAREN       \)
<ENTRY>EQUALS       =
<ENTRY>HASH         \#
<ENTRY>COMMA        ,
<ENTRY>QUOTE        \" { BEGIN STRING; bleveL=0; plevel=0; }

/* Lexical grammar, mode 3: strings */
<STRING>LBRACE       \{ { if (blevel>0) {blevel++;} }
<STRING>RBRACE       \} { if (blevel>0) { blevel--; if (blevel == 0) { BEGIN ENTRY; } } }
<STRING>LPAREN       \( { if (plevel>0) { plevel++;} }
<STRING>RPAREN       \} { if (plevel>0) { plevel--; if (plevel == 0) { BEGIN ENTRY; } } }
<STRING>QUOTE        \" { BEGIN ENTRY; }

please note that the rule set is by no means complete but should get you started. more details to be found here.

  • btparse

    These docs explain in a fairly detailed fashion thenintricacies of parsing the bibtex formats and comes with a 'python parser.

  • biblex

    you might also be interested in employing the unix toolchain of biblex and bibparse. these tools generate and parse a bibtex token stream, respectively.

    more info can be found here.

best regards, carsten

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜