How to find the node elements which is not inside the <a> tag using HtmlCleaner?
I use HTMLCleaner for mining the data.... Here is how does it works:
HtmlCleaner cleaner = new HtmlCleaner();
final String siteUrl = "http://www.a开发者_StackOverflowpple.com/";
TagNode node = cleaner.clean(new URL(siteUrl));
TagNode[] aTagNode = node.getAllElements(true);
for(int i = 0; i< aTagNode.length; i++){
if(!aTagNode[i].hasAttribute("a")){
System.out.println(aTagNode[i].getText());
}
}
But I find there are some problems.... For example, get the text:
<a href="/choose-your-country/">
<img src="http://images.apple.com/home/elements/worldwide_us.png" alt="United States of America" height="22" width="22" />
<span class="more">Choose your country or region</span>
</a>
The "Choose your country or region" is inside the span tag, but it's parent node is a "a" tag..... I don't want it also, I just want something like this....:
<p class="left">Shop the <a href="/store/">Apple Online Store</a> (1-800-MY-APPLE), visit an <a href="/retail/">Apple Retail Store</a>, or find a <a href="/buy/">reseller</a>.</p>
I want the result is Stop the
, (1-800-MY-APPLE),visit an
, or find a
, and .
Because Apple Online Store
, Apple Retail Store
and reseller
is the text inside the a tag, so, I want to ignore these words. Thank you.
TagNode[] aTagNode = node.getAllElements(true);
ArrayList<TagNode> tagNodes = new ArrayList<TagNode>();
Set<TagNode> toBeRemoved = new HashSet<TagNode>();
for(int i = 0; i< aTagNode.length; i++){
if(!aTagNode[i].hasAttribute("a")){
tagNodes.add(aTagNode[i]);
}else{
TagNode[] children = aTagNode[i].getChildTags().
for(TagNode child : children) {
toBeRemoved.add(child);
}
}
}
for(TagNode node : tagNodes){
if(!toBeRemoved.contains(node)){
System.out.println(node.getText());
}
}
精彩评论