\", \"\");" />
开发者

Regex to remove html does not get rid of img tag

I am using a regex to remove HTML tags. I do something like - result.replaceAll("\<.*?\>", "");

How开发者_开发问答ever, it does not help me get rid of the img tags in the html. Any idea what is a good way to do that?


If you cannot use HTML parsers/cleaners then I would at least suggest you to use Pattern.DOTALL flag to take care of multi-line HTML blocks. Consider code like this:

String str = "123 <img \nsrc='ping.png'>abd foo";
Pattern pt = Pattern.compile("<.*?>", Pattern.DOTALL);
Matcher matcher = pt.matcher(str);
StringBuffer sb = new StringBuffer();
while (matcher.find()) {
    matcher.appendReplacement(sb, "");
}
matcher.appendTail(sb);
System.out.println("Output: " + sb);

OUTPUT

Output: 123 abd foo


To give a more concrete recommendation, use JSoup (or NekoHTML) to parse the HTML into a Java object.

Once you've got a Document object it can easily be traversed to remove the tags. This cookbook recipe shows how to get attributes and text from the DOM.


Another suggestion is HtmlCleaner


I'm just re-iterating what others have said already, but this point cannot be over-stated: DO NOT USE REGEXES TO PARSE HTML. There are a 1,000 similar questions on this on SO. Use a proper HTML parser, it will make your life so much easier, and is far more robust and reliable. Take a look at Dom4j, Jericho, JSoup. Please.


So, a piece of code for you. I use http://htmlparser.sourceforge.net/ to parse HTML. It is not overcomplicated and quite straightforward to use.

Basically it looks like this:

import org.htmlparser.Parser;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;

    ...

    String html; /* read your HTML into variable 'html' */
    String result=null;
    ....
    try {
        Parser p = new Parser(html);
        NodeList nodes = p.parse(null);
        result = nodes.asString();
    } catch (ParserException e) {
        e.printStackTrace();
    }

That will give you plain text stripped of tags (but no substitutes like &amp; would be fixed). And of course you can do plenty more with this library, like applying filters, visitors, iterating and all the stuff.


use html parser instead. iterate over the object, print however you like and get the best result.


I have been able achieve do this with the below code snippet.

String htmlContent = values.get(position).getContentSnippet();
String plainTextContent = htmlContent.replaceAll("<img .*?/>", "");

I used the above regex to clean the img tags in my RSS content.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜