开发者

JavaScript Regex newlines ruin bibtex parsing

I am trying to read a bibtex file into my JavaScript script. The Regex used to parse the file is:

/(.*)\s*=\s*[{"'](.*|.*\s+.*|.*\s+.*\s+.*|.*\s+.*\s+.*\s+.*|.*\s+.*\s+.*\s+.*\s+.*)[}"'],?/g

This works as I want it to:

@Article{journals/aim/Sloman99,
  title =   "Review of Affective Computing",
  author =  "Aaron Sloman",
  journal = "AI Magazine",
  year =    "1999",
  number =  "1",
  volume =  "20",
  url = "http://dblp.uni-trier.de/db/journals/aim/aim20.html#Sloman99",
  pages =   "127--133",
}

It gives me nice key.value pairs like: "author : Aaron Sloman".

This doesn't:

@Article{journals/aim/Sloman99,
  title =   "Review of Affective Computing",
  author =  "Aaron
  S
  l
  o
  m
  a
  n",
  journal = "AI Magazine",
  year =    "1999",
  number =  "1",
  volume =  "20",
  url = "http://dblp.uni-trier.de/db/journals/a开发者_开发技巧im/aim20.html#Sloman99",
  pages =   "127--133",
}

It just omits the author.

So how can I make a regex that matches an entry with as much newlines (not only as much as there are repetitions of ".*\s+") as there are till it encounters a " or a }?


I know people love to use regular expressions to parse markup, it seems to be a fad... like lady gaga, or the Fun Dip. But if you want to parse mark up efficentially you should use a parser or write one.

Why, Regular expressions is meant to parse regular language. Most markdowns cannot be expressed as an NFA or DFA. And because of this using regex to parse them is... impossible at the hardest point, and just slow at the easiest.

There are a couple great JS bibtex parsers out there

  • http://sourceforge.net/projects/jsbibtex/
  • http://code.google.com/p/bibtex-js/

are two, I recommend you look at those... I know you have already done work to create your regular expression but I promise your job will become much easier when you take the step to a real parser.

Just a small example of why your regex is bad and a parser is better.

Matching patterns include

;;;)(>$#@ = 'dfsa3 342 '}
((())))+++>$#@ = 'dfsa3@@//''''''''''''
>$#@ = 'dfsa3@@//'''}}}}"""

These are not good!

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜