Parsing HTML content with sibling tags in Java (or) Finding content between two <open> tags

2023-03-31 04:58 问答作者：

Background: I'm writing a Java program to go through HTML files and replace all the content in tags that are not <script> or <style> with Lorem Ipsum. I originally did this with a regex just removing everything between a > and a <, which actually worked quite well (blasphemous I know), but I'm trying to turn this into a tool others may find useful so I wouldn't dare threaten the sanctity of the universe any more by trying to use regex on html.

I'm trying to use HtmlCleaner, a Java library that attracted me because it has no other dependencies. However, trying to implement it I've been un开发者_如何学编程able to deal with html like this:

<div>
    This text is in the div <span>but this is also in a span.</span>
</div>

The problem is simple. When the TagNodeVisitor reaches the div, if I replace its contents with the right amount of lipsum, it will eliminate the span tag. But if I drill down to only TagNodes with no other children, I would miss the first bit of text.

HtmlCleaner has a ContentNode object, but that object has no replace method. Anything I can think of to deal with this seems like it must be far too complicated. Is anyone familiar with a way to deal with this, with HtmlCleaner or some other parsing library you're more familiar with?

You can pretty much do anything you want with JSoup setters

Would that suit you ?

 Element div = doc.select("div").first(); // <div></div>
 div.html("<p>lorem ipsum</p>"); // <div><p>lorem ipsum</p></div>

HtmlCleaner's ContentNode has a method getContent() that returns a java.lang.StringBuilder. This is mutable and can be changed to whatever value you want.

继续阅读：htmlcleaner nested parsing

Parsing HTML content with sibling tags in Java (or) Finding content between two <open> tags

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？