开发者

How to parse multiple, consecutive xml files in one document?

I have a big text file that is a sequence of XML-valid documents that looks something like this:

<DOC>
   <TEXT> ... </TEXT>
    ...
</DOC>
<DOC>
    <TEXT> ... </TEXT>
    ...
</DOC>

et开发者_JS百科c. There is no <?xml version="1.0">, the <DOC></DOC> delimits each separate xml. What's the best way to parse this in Java and get the values under <TEXT> in each <DOC>?

If I pass the whole thing to a DocumentBuilder, I get an error saying the document is not well formed. Is there a better solution than simply traversing through, a building a string for each <DOC>?


A valid XML document must have a root element under which you can specify all other elements. Also, in a document only ONE root element can be present. have a look on XML Specification (see point 2)

So, to overcome your issue, you can take all the content of your text file into a String (or StringBuffer/StringBuilder...) And put this string in between <root> and </root> tags e.g ,

String origXML = readContentFromTextFile(fileName);
String validXML = "<root>" + origXML + "</root>";
//parse validXML


The document is not well formed because you don't have a 'root' node:

<ROOT>
    <DOC>
       <TEXT> ... </TEXT>
        ...
    </DOC>
    <DOC>
        <TEXT> ... </TEXT>
        ...
    </DOC>
</ROOT>


You'll have a hard time parsing this with a "standard" XML parser such as Xerces. As you mentioned this XML document is not well-formed in part because it is missing an XML declaration <?xml version="1.0"?> but most importantly because it has two document roots (i.e. the <doc> elements).

I suggest you give TagSoup a try. It is intented to parse (quote) "poor, nasty and brutish" XML. No guarantee but that's probably your best shot.


You can try using xslt for parsing.


You could create a subclass of InputStream that adds a prefix and a suffix to the input stream, and pass an instance of that class to any XML parser:

public class EnclosedInputStream extends InputStream {
    private enum State {
        PREFIX, STREAM, SUFFIX, EOF
    };

    private final byte[] prefix;
    private final InputStream stream;
    private final byte[] suffix;
    private State state = State.PREFIX;
    private int index;

    EnclosedInputStream(byte [] prefix, InputStream stream, byte[] suffix) {
        this.prefix = prefix;
        this.stream = stream;
        this.suffix = suffix;
    }

    @Override
    public int read() throws IOException {
        if (state == State.PREFIX) {
            if (index < prefix.length) {
                return prefix[index++] & 0xFF;
            }
            state = State.STREAM;
        }
        if (state == State.STREAM) {
            int r = stream.read();
            if (r >= 0) {
                return r;
            }
            state = State.SUFFIX;
            index = 0;
        }
        if (state == State.SUFFIX) {
            if (index < suffix.length) {
                return suffix[index++] & 0xFF;
            }
            state = State.EOF;
        }
        return -1;
    }
}
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜