High performance XML parsing in C++
Well a lot of questions have been made about parsing XML in C++ and so on... But, instead of a generic problem, mine is very specific.
I am asking for a very efficient XML parser for C++. In particular I have a VERY VERY BIG XML file to parse. My application must open this file and retrieve data. It must also insert new nodes and save the final result in the file again.
To do this I used, at the beginning, rapidxml, but it requires me to open the file, parse it all (all the content because this lib has no functions to access the file directly withou开发者_开发百科t loading the entire tree first), then edit the tree, modify it and store the final tree on the file by overwriting it... It consumes too much resources.
Is there an XML parser that does not require me to load the entire file, but that I can use to insert, quickly, new nodes and retrieve data? Can you please indicate solutions for this problem of mine?
You want a streaming XML parser rather than what is called a DOM parser.
There are two types of streaming parsers: pull and push. A pull parser is good for quickly writing XML parsers that load data into program memory. A push parser is good for writing a program to translate one document to another (which is what you are trying to accomplish). I think, therefore, that a push parser would be best for your problem.
In order to use a push parser, you need to write what is essentially an event handler for parsing events. By "parsing event", I mean events like "start tag reached", "end tag reached", "text found", "attribute parsed", etc.
I suggest that as you read in the document, you write out the transformed document to a separate, temporary file. Thus, your XML parsing event handlers will need to be written so that they are stateful and write out the XML of the translated document incrementally.
Three excellent push parser libraries for C++ include Expat, Xerces-C++, and libxml2.
Search for "SAX parser". They are mostly tokenizers, i.e. they emit tag by tag without building a tree.
SAX parsers are faster than DOM parsers because DOM parsers read the entire file into memory before building an in-memory representation of the XML document, whereas a SAX parser behaves like an event listener and builds the document as it reads in the file. Go here for an explanation.
As you mentioned Xerces is a good C++ SAX parser.
I would recommend looking into ways of breaking the XML document into smaller XML documents as that seems to be part of your problem.
Okay, here is one off the beaten track, I looked at this, but haven't really used it myself, it's called asmxml. These boys claim performance bar none, downside, you need x86 assembler.
If you really seek high performance XML stream parser then libhpxml is likely the right thing for you.
I’m convinced that no XML library exists that allows you to modify a file without loading it first. This simply isn’t possible because files don’t work that way: you cannot insert (or remove) in the middle of a file. You can only overwrite a block of identical size, or append at the end. But your request would require to append or remove in the middle of the file.
Reading only parts of an XML file may be possible. But writing … no way.
Go for template libraries as much as possible, like Boost::property_tree or Boost::XMLParser or POCO::XML and Folly has XML Parser in it.
Avoid old C libraries, it all old code designs.
someone say QtXML module is high performance for huge XML files.
精彩评论