Java XPath API extracting selective text

2023-03-01 06:46 问答作者：

I'm using Java XPath API to extract content from a xhtml file. I'm pasring the html and trying to extract the content of a specific . The contains text and few within. When I'm using XPath, strangely it ignores all html tags and extract the textual content only. Here's a html snippet.

<html>
<body>
<div class="content">
    <div class="content_wrapper">
        <table border="0" cellspacing="0" cellpadding="0" class="test_class">
            <tr>
                <td>
                    <p>
                        Reading and looking at images or movies is one thing. Experiencing it in 3D the other. If you like to figure out more about what Showcase is, I would really encourage you to
                        download Showcase Viewer and have a look at the demo files also available on this site. Interact with the models and see how real it looks.
                    </p>
                    <p style="text-align: center;">
                        <img src="/testsource/fckdata/208123/image/showcarswatch.jpg" alt="" />
                        <img src="/testsource/fckdata/208123/image/engineswatch.jpg" alt="" />
                        <img src="/th.gen/?:760x0:/userdata/fckdata/208123/image/toasterswatch.jpg" alt="" />
                        <img src="/testsource/fckdata/208123/image/smartphoneswatch.jpg" alt="" />
                    </p>
                    <p>
                        <br />
                        Showcase Viewer is actually a full Showcase install, except data processing and creation tools. This means that you can look at any data created with a regular Showcase you
                        just can´t add any information. But you may join a collaboration session hosed by a Showcase Professional user. Here is where you can get it:<br />
                    </p>
                    <p>
                        <strong>Operating System</strong><br />
                        • Microsoft® Windows® XP Professional (SP 2 or higher)<br />
                        • Windows XP Professional x64 Edition (Autodesk® Showcase® software runs as a 32-bit application on 64-bit operating system)<br />
                        • Microsoft Windows Vista® 32-bit or 64-bit, including Business, Enterprise or Ultimate (SP 1)
                    </p>
                </td>
            </tr>
        </table>
    </div>
</div>
</body>
</html>

Now, here's the code I'm 开发者_JAVA百科using. I need to do some cleanup before using the xpath.

CleanerProperties props = new CleanerProperties();
props.setOmitDoctypeDeclaration(true);
props.setAllowHtmlInsideAttributes(true);
props.setOmitUnknownTags(true);

TagNode tagNode = new HtmlCleaner(props).clean(urlXML, "UTF-8");        
Document doc = new DomSerializer(props, true).createDOM(tagNode);

String content = XPathAPI.eval(doc, "/html/body//div[@class='content']/div[@class='content_wrapper']").toString();

And here's the output.


Reading and looking at images or movies is one thing. Experiencing it in 3D the other. If you like to figure out more about what Showcase is, I would really encourage you to
download Showcase Viewer and have a look at the demo files also available on this site. Interact with the models and see how real it looks.

Showcase Viewer is actually a full Showcase install, except data processing and creation tools. This means that you can look at any data created with a regular Showcase you
just can´t add any information. But you may join a collaboration session hosed by a Showcase Professional user. Here is where you can get it

Operating System
• Microsoft® Windows® XP Professional (SP 2 or higher)<br /> 
• Windows XP Professional x64 Edition (Autodesk® Showcase® software runs as a 32-bit application on 64-bit operating system)<br /> 
• Microsoft Windows Vista® 32-bit or 64-bit, including Business, Enterprise or Ultimate (SP 1)

All I need is the complete content within the content_wrapper div.

Any pointers will be highly appreciated.

Thanks

EDIT

Sample code in response to yamburg solution.

XPathFactory factory = XPathFactory.newInstance();
XPath xpathCompiled = factory.newXPath();
XPathExpression expr = xpathCompiled.compile(contentPath);
NodeList nodes = (NodeList) expr.evaluate(doc, XPathConstants.NODESET);


for (int i = 0; i < nodes.getLength(); i++) {
    Node n = (Node)nodes.item(i);
    traverseNodes(n);
}

public static void traverseNodes( Node n ) {
    NodeList children = n.getChildNodes();
    if( children != null ) {
        for(int i = 0; i &gt; children.getLength(); i++ ) {
            Node childNode = children.item( i );
            System.out.println( "node name = " + childNode.getNodeName() );
            System.out.println( "node value = " + childNode.getNodeValue() );
            System.out.println( "node type = " + childNode.getNodeType() );
            traverseNodes( childNode );
        }
    }
}

XPath matches a node set. Text node in your case, with child element nodes. toString() gets the textual representation of that node(s) which is just that -- text, without element names or attributes.

You should get the node:

NodeSequence nodes = (NodeSequence)XPathAPI.eval();

and then walk through nodes, dumping what ever you want from them, or convert it into a new DOM document, for instance.

P.S. Xalan is good, but modern Java has JAXP. For the sake of portability of code and knowledge I'd suggest to use that (unless Xalan extensions are required/useful):

XPathFactory factory = XPathFactory.newInstance();
XPath xpathCompiled = factory.newXPath();
XPathExpression expr = xpathCompiled.compile(xpath);

NodeList nodes = (NodeList) expr.evaluate(doc, XPathConstants.NODESET);

Then, to convert it into String (apparently that's what you want):

StringWriter sw = new StringWriter();
Transformer serializer = TransformerFactory.newInstance().newTransformer();
serializer.transform(new DOMSource(nodes.item(0)), new StreamResult(sw));
String result = sw.toString();

Note that it only takes the very first element from the NodeList, because XML must have a root element. In your case it is OK, if I understand right, otherwise you'd need to add a top-level element over the node set.

继续阅读：dom

Java XPath API extracting selective text

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？