How to iterate over plain text segments with the Jericho HTML parser
For a Jericho Element, I am trying to find out how to loop over all child nodes, whether an element or plain text.
Now there is Element.getNodeIterator(), but this references ALL descendants within the Element, not just the first descendants.
I need the equivalent of Element.getChildSegments(). Any ideas?
Thanks
All plain text segments not within any child elements, correct?
public static Iterator<Segment> directPlainTextChildren(Element elem) {
final Iterator<Segment> it = elem.getContent().getNodeIterator();
final List<Segment> results = new LinkedList<Segment>();
final List<Element> children = elem.getChildElements();
while (it.hasNext()) {
Segment cur = it.next();
if (!(cur instanceof Tag) && !(cur instanceof CharacterReference)) {
for (Element child : children)
if (child.contains(cur)) continue;
results.add(cur);
}
}
return results.iterator();
}
An element should have few direct children and the Element::contains(Segment) method is just a simple bounds check, so the performance should be adequate.
edit: If you wanted to add the ability to iterate all direct child segments it would look like this:
public static Iterator<Segment> getChildSegments(Element elem) {
final Iterator<Segment> it = elem.getContent().getNodeIterator();
final List<Segment> results = new LinkedList<Segment>();
final List<Element> children = elem.getChildElements();
while (it.hasNext()) {
Segment cur = it.next();
if (cur instanceof CharacterReference)
results.add(cur);
else if (cur instanceof Tag) {
if (cur instanceof StartTag)
results.add(((StartTag)cur).getElement());
}
else {
for (Element child : children)
if (child.contains(cur)) continue;
results.add(cur);
}
}
return results.iterator();
}
Using the methodology from Gunslinger47 above, the following returns immediate (first descendant) child segments for the Element elem:
public static List<Segment> getChildSegments(Element elem) {
final Iterator<Segment> it = elem.getContent().getNodeIterator();
final List<Segment> results = new LinkedList<Segment>();
final List<Element> children = elem.getChildElements();
while (it.hasNext()) {
Segment cur = it.next();
if (!(cur instanceof Tag) && !(cur instanceof CharacterReference) && !cur.isWhiteSpace()) {
boolean enclosed = false;
for (Element child : children) {
if (child.encloses(cur)) {
enclosed = true;
}
}
if (!enclosed) results.add(cur);
} else {
for (Element child : children) {
if (child.getStartTag().equals(cur)) {
results.add(cur);
break;
}
}
}
}
return results;
}
精彩评论