Parsing HTML content with sibling tags in Java (or) Finding content between two <open> tags
Background: I'm writing a Java program to go through HTML files and replace all the content in tags that are not <script>
or <style>
with Lorem Ipsum. I originally did this with a regex just removing everything between a > and a <, which actually worked quite well (blasphemous I know), but I'm trying to turn this into a tool others may find useful so I wouldn't dare threaten the sanctity of the universe any more by trying to use regex on html.
I'm trying to use HtmlCleaner, a Java library that attracted me because it has no other dependencies. However, trying to implement it I've been un开发者_如何学编程able to deal with html like this:
<div>
This text is in the div <span>but this is also in a span.</span>
</div>
The problem is simple. When the TagNodeVisitor reaches the div, if I replace its contents with the right amount of lipsum, it will eliminate the span tag. But if I drill down to only TagNodes with no other children, I would miss the first bit of text.
HtmlCleaner has a ContentNode object, but that object has no replace method. Anything I can think of to deal with this seems like it must be far too complicated. Is anyone familiar with a way to deal with this, with HtmlCleaner or some other parsing library you're more familiar with?
You can pretty much do anything you want with JSoup setters
Would that suit you ?
Element div = doc.select("div").first(); // <div></div>
div.html("<p>lorem ipsum</p>"); // <div><p>lorem ipsum</p></div>
HtmlCleaner's ContentNode has a method getContent() that returns a java.lang.StringBuilder. This is mutable and can be changed to whatever value you want.
精彩评论