Which XML parser to use here?

2023-03-27 20:19 问答作者：

I am receving an XML file as an input, whose size can vary from a few KBs to a lot more. I am gett开发者_运维技巧ing this file over a network. I need to extract a small number of nodes as per my use, so most of the document is pretty useless for me. I have no memory preferences, I just need speed.

Considering all this, I concluded :

Not using DOM here (due to possible huge size of doc , no CRUD requirement, and source being network)
No SAX as I only need to get a small subset of data.
StaX can be a way to go, but I am not sure if it is the fastest way.
JAXB came up as another option - but what sort of parser does it use ? I read it uses Xerces by default (which is what type - push or pull ?), although I can configure it for use with Stax or Woodstock as per this link

I am reading a lot, still confused with so many options ! Any help would be appreciated.

Thanks !

Edit : I want to add one more question here : What is wrong in using JAXB here ?

Fastest solution is by far a StAX parser, specially as you only need a specific subset of the XML file and you can easily ignore whatever isn't really necessary using StAX, while you would receive the event anyway if you were using a SAX parser.

But it's also a little bit more complicated than using SAX or DOM. One of these days I had to write a StAX parser for the following XML:

<?xml version="1.0"?>
<table>
    <row>
        <column>1</column>
        <column>Nome</column>
        <column>Sobrenome</column>
        <column>email@gmail.com</column>
        <column></column>
        <column>2011-06-22 03:02:14.915</column>
        <column>2011-06-22 03:02:25.953</column>
        <column></column>
        <column></column>
    </row>
</table>

Here's how the final parser code looks like:

public class Parser {

private String[] files ;

public Parser(String ... files) {
    this.files = files;
}

private List<Inscrito> process() {

    List<Inscrito> inscritos = new ArrayList<Inscrito>();


    for ( String file : files ) {

        XMLInputFactory factory = XMLInputFactory.newFactory();

        try {

            String content = StringEscapeUtils.unescapeXml( FileUtils.readFileToString( new File(file) ) );

            XMLStreamReader parser = factory.createXMLStreamReader( new ByteArrayInputStream( content.getBytes() ) );

            String currentTag = null;
            int columnCount = 0;
            Inscrito inscrito = null;           

            while ( parser.hasNext() ) {

                int currentEvent = parser.next();

                switch ( currentEvent ) {
                case XMLStreamReader.START_ELEMENT: 

                    currentTag = parser.getLocalName();

                    if ( "row".equals( currentTag ) ) {
                        columnCount = 0;
                        inscrito = new Inscrito();                      
                    }

                    break;
                case XMLStreamReader.END_ELEMENT:

                    currentTag = parser.getLocalName();

                    if ( "row".equals( currentTag ) ) {
                        inscritos.add( inscrito );
                    }

                    if ( "column".equals( currentTag ) ) {
                        columnCount++;
                    }                   

                    break;
                case XMLStreamReader.CHARACTERS:

                    if ( "column".equals( currentTag ) ) {

                        String text = parser.getText().trim().replaceAll( "\n" , " "); 

                        switch( columnCount ) {
                        case 0:
                            inscrito.setId( Integer.valueOf( text ) );
                            break;
                        case 1:                         
                            inscrito.setFirstName( WordUtils.capitalizeFully( text ) );
                            break;
                        case 2:
                            inscrito.setLastName( WordUtils.capitalizeFully( text ) );
                            break;
                        case 3:
                            inscrito.setEmail( text );
                            break;
                        }

                    }

                    break;
                }

            }

            parser.close();

        } catch (Exception e) {
            throw new IllegalStateException(e);
        }           

    }

    Collections.sort(inscritos);

    return inscritos;

}

public Map<String,List<Inscrito>> parse() {

    List<Inscrito> inscritos = this.process();

    Map<String,List<Inscrito>> resultado = new LinkedHashMap<String, List<Inscrito>>();

    for ( Inscrito i : inscritos ) {

        List<Inscrito> lista = resultado.get( i.getInicial() );

        if ( lista == null ) {
            lista = new ArrayList<Inscrito>();
            resultado.put( i.getInicial(), lista );
        }

        lista.add( i );

    }

    return resultado;
}

}

The code itself is in portuguese but it should be straightforward for you to understand what it is, here's the repo on github.

If you're only extracting a small amount, consider looking into using XPath as this is somewhat simpler than trying to extract the whole document.

Note: I'm the EclipseLink JAXB (MOXy) lead, and a member of the JAXB 2 (JSR-222) expert group.

StAX (JSR-173) is generally the fastest way to parse XML, and Woodstox is know for being a fast StAX parser. In addition to parsing, you need to collect the XML data. This is where a combination of StAX and JAXB comes in handy.

To ensure that our JAXB implementation uses the Woodstox StAX implementation. Configure your environment to use Woodstox (this is as simple as adding Woodstox to your classpath). Create an instance of XMLStreamReader and pass that as the source that JAXB should unmarshal.

Either SAX or StAX could handle this with some complex work figuring out that you're at something you want, but for extracting a small set of things by explicit path, you might be best off with XPath.

Another potential tactic is to first filter to only the parts you want using XSLT and then parse with anything you like, as the result of the filter will be a much smaller document.

I think that you should use SAX or parser based on SAX. I'd recommend you apache Digester. SAX is event driven and does not store state. This is what you need here due to you have to extract only small part of the document (I guess one tag).

继续阅读：jaxb xml xml-parsing

Which XML parser to use here?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？