How do I replace HTML escapes in an input stream before parsing it to XML?

2023-01-16 18:04 问答作者：

I have an input stream which is being converted to XML, and read. When I get down to some text elements in the XML, they are truncated. I believe the parser is dropping everything after escaped HTML such as & Here is the code getting the input stream and then getting the text element.

String hvurl = "https://www.mysite.com/api/a/" + answerId;
in = OpenHttpConnection(hvurl); 

Document doc = null;
DocumentBuilderFactory dbf = 
    DocumentBuilderFactory.newInstance();
DocumentBuilder db;

try {
    db = dbf.newDocumentBuilder();
    doc = db.p开发者_StackOverflow社区arse(in);

} catch (ParserConfigurationException e) {
    // TODO Auto-generated catch block
    e.printStackTrace();
} catch (SAXException e) {
    // TODO Auto-generated catch block
    e.printStackTrace();
}     

...
//Now when I get the text element, it's truncated
//---get the <varietalTitle> elements under the <varietal> 
// element---
NodeList varietalTitleNodes = 
    (varietalElement).getElementsByTagName("varietaltitle");

//---convert a Node into an Element---
Element varietalTitleElement = (Element) varietalTitleNodes.item(0);

//---get all the child nodes under the <varietaltitle> element---
NodeList varietalTitleTextNodes = 
    ((Node) varietalTitleElement).getChildNodes();

//---retrieve the text of the <varietalid> element---
strVarietalTitle = ((Node) varietalTitleTextNodes.item(0)).getNodeValue();

Cant get where the problem occurs. My guess use the normalize() method as below.

Try this:

 strVarietalTitle = ((Node) varietalTitleTextNodes.item(0)).getNodeValue().normalize();

From documentation Normalize():

Puts Puts all Text nodes in the full depth of the sub-tree underneath this Node, including attribute nodes, into a "normal" form where only structure (e.g., elements, comments, processing instructions, CDATA sections, and entity references) separates Text nodes, i.e., there are neither adjacent Text nodes nor empty Text nodes. This can be used to ensure that the DOM view of a document is the same as if it were saved and re-loaded, and is useful when operations (such as XPointer [XPointer] lookups) that depend on a particular document tree structure are to be used. If the parameter "normalize-characters" of the DOMConfiguration object attached to the Node.ownerDocument is true, this method will also fully normalize the characters of the Text nodes. Note: In cases where the document contains CDATASections, the normalize operation alone may not be sufficient, since XPointers do not differentiate between Text nodes and CDATASection nodes.

An XML parser should cope with character entities such as "&" ... assuming that's what you are talking about.

One possibility is that your input contains particular named entities that the XML parser doesn't know about.

继续阅读：android

How do I replace HTML escapes in an input stream before parsing it to XML?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？