How can I read a large XML file in Ruby with libxml-ruby?

2023-02-17 14:42 问答作者：

We've been using libxml-ruby for a couple of years. It is fantastic on files of 30 MB or less, but it is PLAGUED by seg faults. 开发者_Go百科Nobody at the project really seems to care to fix them, only to blame these on 3rd party software. That's their prerogative of course, it's free.

Yet I still am unable to read these large files. I suppose I could write some miserable hack to split them into smaller files, but I would like to avoid that. Does anyone else have any experience with reading very large XML files in Ruby?

When loading big files, whether they are xml or not, you should start considering taking pieces at a time(in this case called streaming), rather than loading the entire file into memory.

I would highly suggest reading this article about pull parsers. Using this technique will allow you to read this file with greater ease, rather than loading all of the file at once into memory.

Thanks everyone for your excellent input. I was able to solve my problem by looking at Processing large XML file with libxml-ruby chunk by chunk.

The answer was to avoid the use of:

reader.expand

and to instead use:

reader.read

or:

reader.next

in conjunction with:

reader.node

As long as you aren't trying to store the node as is, it works great. You want to operate on that node immediately, because reader.next will blow it away.

To respond to an earlier answer, from what I can understand libxml-ruby IS a streaming parser. The problems with the seg faults arose in garbage collecting issues which were causing memory leaks galore. Once I learned not to use reader.expand, everything came up roses.

UPDATE:

I was NOT able to solve my problem after all. There appears to be NO WAY to get to the subtree without using reader.expand.

And so I guess there is no way to read a read and parse a large XML file with libxml-ruby? The reader.expand memory leak bug has been open without even a response since 2009? FAIL FAIL FAIL.

I'd recommend looking into a SAX XML parser. They're designed to handle huge files. I haven't needed one in a while, but but they're pretty easy to use; As it reads the XML file in it will pass your code various events, which you catch and handle with your code.

The Nokogiri site has a link to SAX Machine which is based on Nokogiri, so that would be another option. Either way, Nokogiri is very well supported, and used by a lot of people, including me for all HTML and XML I parse. It supports both DOM and SAX parsing, allows use of CSS and XPath accessors, and uses libxml2 for its parsing, so it's fast and based on a standard parsing library.

libxml-ruby indeed has plenty of bugs, not just crashing bugs, but version incompatibilities, memory leaks, etc...

I highly recommend Nokogiri. The Ruby community has rallied around Nokogiri as the new hotness for fast XML parsing. It has a reader pull parser, a SAX parser, and your standard in-memory DOM-ish parser.

For really large XML files, I'd recommend Reader, because it's as fast as SAX, but is easier to program for, because you don't have to keep track of so much state manually.

继续阅读：ruby xml

How can I read a large XML file in Ruby with libxml-ruby?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？