Regex to remove html does not get rid of img tag

2023-03-13 10:21 问答作者：

I am using a regex to remove HTML tags. I do something like - result.replaceAll("\<.*?\>", "");

How开发者_开发问答ever, it does not help me get rid of the img tags in the html. Any idea what is a good way to do that?

If you cannot use HTML parsers/cleaners then I would at least suggest you to use Pattern.DOTALL flag to take care of multi-line HTML blocks. Consider code like this:

String str = "123 <img \nsrc='ping.png'>abd foo";
Pattern pt = Pattern.compile("<.*?>", Pattern.DOTALL);
Matcher matcher = pt.matcher(str);
StringBuffer sb = new StringBuffer();
while (matcher.find()) {
    matcher.appendReplacement(sb, "");
}
matcher.appendTail(sb);
System.out.println("Output: " + sb);

OUTPUT

Output: 123 abd foo

To give a more concrete recommendation, use JSoup (or NekoHTML) to parse the HTML into a Java object.

Once you've got a Document object it can easily be traversed to remove the tags. This cookbook recipe shows how to get attributes and text from the DOM.

Another suggestion is HtmlCleaner

I'm just re-iterating what others have said already, but this point cannot be over-stated: DO NOT USE REGEXES TO PARSE HTML. There are a 1,000 similar questions on this on SO. Use a proper HTML parser, it will make your life so much easier, and is far more robust and reliable. Take a look at Dom4j, Jericho, JSoup. Please.

So, a piece of code for you. I use http://htmlparser.sourceforge.net/ to parse HTML. It is not overcomplicated and quite straightforward to use.

Basically it looks like this:

import org.htmlparser.Parser;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;

    ...

    String html; /* read your HTML into variable 'html' */
    String result=null;
    ....
    try {
        Parser p = new Parser(html);
        NodeList nodes = p.parse(null);
        result = nodes.asString();
    } catch (ParserException e) {
        e.printStackTrace();
    }

That will give you plain text stripped of tags (but no substitutes like & would be fixed). And of course you can do plenty more with this library, like applying filters, visitors, iterating and all the stuff.

use html parser instead. iterate over the object, print however you like and get the best result.

I have been able achieve do this with the below code snippet.

String htmlContent = values.get(position).getContentSnippet();
String plainTextContent = htmlContent.replaceAll("<img .*?/>", "");

I used the above regex to clean the img tags in my RSS content.

继续阅读：html-parsing

Regex to remove html does not get rid of img tag

OUTPUT

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

OUTPUT

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？