开发者

SaxParser replacing text while downloading?

I have a Java SAXparser that downloads and parses, using parse(new InputSource(conn.getInputStream())). Unfortunately, sometimes it gives error when downloading a site's xml: "XML or text declaration not at start of entity" Apparently this is bad xml, declaration has to be first:

<!DOCTYPE ... stuff here ...>
<?xml  ... stuff here ...?>

Unfortunately, there doesn't seem to be any way to ignore this error. I suppose I could download the entire xml, then use regex or something to fix this, then parse it, but it seems this wouldn't have the benefit of parsing as i开发者_运维问答t's downloading? Is there a way to replace it while it's parsing?


Easy solution: read the first line from the stream, consuming those bytes, and then pass it to the parser.

Proper Java solution: create an intermediate stream interface that wraps any kind of stream and offers a SAX parser compatible stream in return. Then create a class implementing that interface specifically for your case.

That way, you can detect the problematic header before it ever reaches the SAX parser.

Edit: I would just use the Apache commons XML parser, or a DOM parser instead of SAX. Also, unless your XML is really long, there's not much difference in parsing it during or after the download.


Have a look at Jsoup. It can deal with wrongly formatted xml.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜