开发者

Parsing Very Large XML file with Ruby on Rails (1.4GB) -- Is there a better way than SAXParser?

Currently, I'm using LIBXML::SAXParser::Callbacks to parse a large XML file containing data 140,000 products. I'm using a task to import the data for thes开发者_JAVA技巧e products into my rails app.

My last import took just under 10 hours to complete:

rake asi:import_products --trace  26815.23s user 1393.03s system 80% cpu 9:47:34.09 total

The problem with the current implementation is that the complex dependency structure in the XML means, I need to keep track of the entire product node to know how to parse it properly.

Ideally, I'd like a way that I could process each product node by itself and have the ability to use XPATH, the file size restricts us from using a method that requires loading the entire XML file into memory. I cannot control the format or size of original XML. I have at most, 3GB worth of memory I can use on the process.

Is there a better way than this?

Current Rake Task code:

Snippet of the XML file:


Can you fetch whole file first? If so, then I'd suggest splitting an XML file into smaller chunks (say, 512MBs or so) so you could parse simultaneous chunks at one time (one per core), 'cause I believe you have modern CPU. Regarding the invalid or malformed xml - just append or prepend missing XML with simple string manipulation.

You can also try profiling your callback method. It's a big chunk of code, I'm pretty sure there should be at least one bottle neck which could save you a few minutes.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜