开发者

Java XPath API extracting selective text

I'm using Java XPath API to extract content from a xhtml file. I'm pasring the html and trying to extract the content of a specific . The contains text and few within. When I'm using XPath, strangely it ignores all html tags and extract the textual content only. Here's a html snippet.

<html>
<body>
<div class="content">
    <div class="content_wrapper">
        <table border="0" cellspacing="0" cellpadding="0" class="test_class">
            <tr>
                <td>
                    <p>
                        Reading and looking at images or movies is one thing. Experiencing it in 3D the other. If you like to figure out more about what Showcase is, I would really encourage you to
                        download Showcase Viewer and have a look at the demo files also available on this site. Interact with the models and see how real it looks.
                    </p>
                    <p style="text-align: center;">
                        <img src="/testsource/fckdata/208123/image/showcarswatch.jpg" alt="" />
                        <img src="/testsource/fckdata/208123/image/engineswatch.jpg" alt="" />
                        <img src="/th.gen/?:760x0:/userdata/fckdata/208123/image/toasterswatch.jpg" alt="" />
                        <img src="/testsource/fckdata/208123/image/smartphoneswatch.jpg" alt="" />
                    </p>
                    <p>
                        <br />
                        Showcase Viewer is actually a full Showcase install, except data processing and creation tools. This means that you can look at any data created with a regular Showcase you
                        just can´t add any information. But you may join a collaboration session hosed by a Showcase Professional user. Here is where you can get it:<br />
                    </p>
                    <p>
                        <strong>Operating System</strong><br />
                        • Microsoft® Windows® XP Professional (SP 2 or higher)<br />
                        • Windows XP Professional x64 Edition (Autodesk® Showcase® software runs as a 32-bit application on 64-bit operating system)<br />
                        • Microsoft Windows Vista® 32-bit or 64-bit, including Business, Enterprise or Ultimate (SP 1)
                    </p>
                </td>
            </tr>
        </table>
    </div>
</div>
</body>
</html>

Now, here's the code I'm 开发者_JAVA百科using. I need to do some cleanup before using the xpath.

CleanerProperties props = new CleanerProperties();
props.setOmitDoctypeDeclaration(true);
props.setAllowHtmlInsideAttributes(true);
props.setOmitUnknownTags(true);

TagNode tagNode = new HtmlCleaner(props).clean(urlXML, "UTF-8");        
Document doc = new DomSerializer(props, true).createDOM(tagNode);

String content = XPathAPI.eval(doc, "/html/body//div[@class='content']/div[@class='content_wrapper']").toString();

And here's the output.


Reading and looking at images or movies is one thing. Experiencing it in 3D the other. If you like to figure out more about what Showcase is, I would really encourage you to
download Showcase Viewer and have a look at the demo files also available on this site. Interact with the models and see how real it looks.

Showcase Viewer is actually a full Showcase install, except data processing and creation tools. This means that you can look at any data created with a regular Showcase you
just can´t add any information. But you may join a collaboration session hosed by a Showcase Professional user. Here is where you can get it

Operating System
• Microsoft® Windows® XP Professional (SP 2 or higher)<br /> 
• Windows XP Professional x64 Edition (Autodesk® Showcase® software runs as a 32-bit application on 64-bit operating system)<br /> 
• Microsoft Windows Vista® 32-bit or 64-bit, including Business, Enterprise or Ultimate (SP 1)

All I need is the complete content within the content_wrapper div.

Any pointers will be highly appreciated.

  • Thanks

EDIT

Sample code in response to yamburg solution.

XPathFactory factory = XPathFactory.newInstance();
XPath xpathCompiled = factory.newXPath();
XPathExpression expr = xpathCompiled.compile(contentPath);
NodeList nodes = (NodeList) expr.evaluate(doc, XPathConstants.NODESET);


for (int i = 0; i < nodes.getLength(); i++) {
    Node n = (Node)nodes.item(i);
    traverseNodes(n);
}

public static void traverseNodes( Node n ) {
    NodeList children = n.getChildNodes();
    if( children != null ) {
        for(int i = 0; i &gt; children.getLength(); i++ ) {
            Node childNode = children.item( i );
            System.out.println( "node name = " + childNode.getNodeName() );
            System.out.println( "node value = " + childNode.getNodeValue() );
            System.out.println( "node type = " + childNode.getNodeType() );
            traverseNodes( childNode );
        }
    }
}


XPath matches a node set. Text node in your case, with child element nodes. toString() gets the textual representation of that node(s) which is just that -- text, without element names or attributes.

You should get the node:

NodeSequence nodes = (NodeSequence)XPathAPI.eval();

and then walk through nodes, dumping what ever you want from them, or convert it into a new DOM document, for instance.

P.S. Xalan is good, but modern Java has JAXP. For the sake of portability of code and knowledge I'd suggest to use that (unless Xalan extensions are required/useful):

XPathFactory factory = XPathFactory.newInstance();
XPath xpathCompiled = factory.newXPath();
XPathExpression expr = xpathCompiled.compile(xpath);

NodeList nodes = (NodeList) expr.evaluate(doc, XPathConstants.NODESET);

Then, to convert it into String (apparently that's what you want):

StringWriter sw = new StringWriter();
Transformer serializer = TransformerFactory.newInstance().newTransformer();
serializer.transform(new DOMSource(nodes.item(0)), new StreamResult(sw));
String result = sw.toString(); 

Note that it only takes the very first element from the NodeList, because XML must have a root element. In your case it is OK, if I understand right, otherwise you'd need to add a top-level element over the node set.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜