Is there a solution to parse wikipedia xml dump file in Java?
I am trying to parse this huge 25GB Plus wikiped开发者_如何学运维ia XML file. Any solution that will help would be appreciated. Preferably a solution in Java.
A Java API to parse Wikipedia XML dumps: WikiXMLJ (Last update was at Nov 2010).
Also, there is alive mirror that is maven-compatible with some bug fixes.
Ofcourse it's possible to parse huge XML files with Java, but you should use the right kind of XML parser - for example a SAX parser which processes the data element by element, and not a DOM parser which tries to load the whole document into memory.
It's impossible to give you a complete solution because your question is very general and superficial - what exactly do you want to do with the data?
Here is an active java project that may be used to parse wikipedia xml dump files:
http://code.google.com/p/gwtwiki/. There are many examples of java programmes to transform wikipedia xml content into html, pdf, text, ... : http://code.google.com/p/gwtwiki/wiki/MediaWikiDumpSupport
Massi
Yep, right. Do not use DOM. If you want to read small amount of data only, and want to store in your own POJO then you can use XSLT transformation also.
Transforming data into XML format which is then converted to some POJO using Castor/JAXB (XML to ojbect libraries).
Please share how you solve the problem so others can have better approach.
thanks.
--- EDIt ---
Check the links below for better comparison between different parsers. It seems that STAX is better because it has control over the parser and it pulls data from parser when needed.
http://java.sun.com/webservices/docs/1.6/tutorial/doc/SJSXP2.html
http://tutorials.jenkov.com/java-xml/sax-vs-stax.html
If you don't intend to write or change anything in that xml, consider using SAX. It keeps in memory one node at a time (instead of DOM, which tries to build the whole tree in the memory).
I would go with StAX as it provides more flexibility than SAX (also good option).
I had this problem some days ago I found out that the wiki parser provided by https://github.com/Stratio/wikipedia-parser does the work. They stream the xml file and read it in chunks which you can then capture in callbacks.
This is a snippet of how I used it in Scala:
val parser = new XMLDumpParser(new BZip2CompressorInputStream(new BufferedInputStream(new FileInputStream(pathToWikipediaDump)), true))
parser.getContentHandler.setRevisionCallback(new RevisionCallback {
override def callback(revision: Revision): Unit = {
val page = revision.getPage
val title = page.getTitle
val articleText = revision.getText()
println(articleText)
}
It streams the wikipedia, parses it, and everytime it finds a revision(Article) it will get its title,text and print the article's text. :)
--- Edit ---
Currently I am working on https://github.com/idio/wiki2vec which I think does part of the pipeline which you might need. Feel free to take a look at the code
精彩评论