开发者

Trying to parse multiple, possibly incomplete XML fragments from a buffer with Nokogiri

Receiving XML formatted messages via a tcp socket and trying to parse them with Nokogiri. If I could rely on a single, complete root tag in my buffer everything would be straightforward.

Trivial example:

<doc><a>some long text ....</a><b>more text</b开发者_如何学Go></doc>

=> #<Nokogiri::XML::Document:0x1326a30 name="document" children=[#<Nokogiri::XML::Element:0x1325fcc name="doc" children=[#<Nokogiri::XML::Element:0x1325aa4 name="a" children=[#<Nokogiri::XML::Text:0x13255f4 "some long text ....">]>, #<Nokogiri::XML::Element:0x1324f3c name="b" children=[#<Nokogiri::XML::Text:0x1324b68 "more text">]>]>]>

everything as expected.

Long messages may be split across packets, leaving the buffer holding an incomplete tag:

<doc><a>exceptionally long text ....

=> #<Nokogiri::XML::Document:0x12c45ec name="document" children=[#<Nokogiri::XML::Element:0x12c2968 name="doc" children=[#<Nokogiri::XML::Element:0x12c210c name="a" children=[#<Nokogiri::XML::Text:0x12c1cc0 "exceptionally long text">]>]>]>

still as expected, Nokogiri::XML::SyntaxError: Premature end of data in tag doc line 1, we can wait for more data in the buffer.

However, short messages may be clustered within a single packet and arrive at once:

<doc><a>text</a></doc><doc><a>other text</a></doc>

=> #<Nokogiri::XML::Document:0x1312cd8 name="document" children=[#<Nokogiri::XML::Element:0x1312814 name="doc" children=[#<Nokogiri::XML::Element:0x1312594 name="a" children=[#<Nokogiri::XML::Text:0x1312288 "text">]>]>]>

second message not parsed, Nokogiri::XML::SyntaxError: Extra content at the end of the document.

I can't see any way to get Nokogiri to return to me the extra content so I can try to continue parsing. This may be a limitation of the underlying libxml2 or Nokogiri's interface with the library. String.scan doesn't give string indexes (to split messages and preserve the extra text) and Regexp.match won't match globally. Any ideas on how best to extract all of the complete messages from my buffer and leave the trailing incomplete one?


Nokogiri expects an IO stream or string. From the docs for Nokogiri::HTML::Document.parse and Nokogiri::XML::Document.parse.

parse(string_or_io, url = nil, encoding = nil, options = XML::ParseOptions::DEFAULT_HTML)

Parse HTML. thing may be a String, or any object that responds to read and close such as an IO, or StringIO.

"thing" should actually be "string_or_io", to match their example, but you get the idea.

If you can add more information about how you're retrieving the content and parsing it we might be able to give more help.


You may want to try Nokogiri::XML::SAX::PushParser to accomplish this.

See http://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/XML/SAX/PushParser

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜