Which module should I use to parse mediawiki text into a Perl data structure?
I just need to parse the wikitext into Perl arrays of hashes. I found several modules. The Text::MediawikiFormat seems to be what I need, but it returns HTML, and I want a Perl data structure. I开发者_JAVA技巧 also looked at:
Parse::MediaWikiDump
Text::WikiText
Convert::Wiki
I wrote some code to do this a few years back, but it never got released because parsing mediawiki wikitext semantically is basically impossible. The problem is that mediawiki allows you to freely intermingle wikitext constructs with HTML constructs, and the official parser in mediawiki works by progressively transforming the wikitext into HTML (mostly using a horrifically complex set of regular expression substitutions).
Basically it's my opinion that mediawiki wikitext is unsuitable for any purpose besides being translated into HTML, and if you want to parse anything out of it, you're probably best off using a piece of code that translates it to HTML, and then parsing that HTML.
Postscript: Parse::MediaWikiDump
is an excellent module by a good friend of mine, but it doesn't actually parse wikitext at all; it reads wikimedia dump files and extracts things like page text and titles, revision information, and the categories and links databases. It can give you the wikitext for a page, but it doesn't turn that wikitext into anything else.
精彩评论