How to find if String contains html data?
How do I find if a string contains HTML data or not? The user provides input via web interface and it's quite possible he could have used either a simple text or use开发者_如何学编程d HTML formatting.
I know this is an old question but I ran into it and was looking for something more comprehensive that could detect things like HTML entities and would ignore other uses of < and > symbols. I came up with the following class that works well.
You can play with it live at http://ideone.com/HakdHo
I also uploaded this to GitHub with a bunch of JUnit tests.
package org.github;
/**
* Detect HTML markup in a string
* This will detect tags or entities
*
* @author dbennett455@gmail.com - David H. Bennett
*
*/
import java.util.regex.Pattern;
public class DetectHtml
{
// adapted from post by Phil Haack and modified to match better
public final static String tagStart=
"\\<\\w+((\\s+\\w+(\\s*\\=\\s*(?:\".*?\"|'.*?'|[^'\"\\>\\s]+))?)+\\s*|\\s*)\\>";
public final static String tagEnd=
"\\</\\w+\\>";
public final static String tagSelfClosing=
"\\<\\w+((\\s+\\w+(\\s*\\=\\s*(?:\".*?\"|'.*?'|[^'\"\\>\\s]+))?)+\\s*|\\s*)/\\>";
public final static String htmlEntity=
"&[a-zA-Z][a-zA-Z0-9]+;";
public final static Pattern htmlPattern=Pattern.compile(
"("+tagStart+".*"+tagEnd+")|("+tagSelfClosing+")|("+htmlEntity+")",
Pattern.DOTALL
);
/**
* Will return true if s contains HTML markup tags or entities.
*
* @param s String to test
* @return true if string contains HTML
*/
public static boolean isHtml(String s) {
boolean ret=false;
if (s != null) {
ret=htmlPattern.matcher(s).find();
}
return ret;
}
}
You can use regular expressions to search for HTML tags.
I'm using regex:
[\S\s]*\<html[\S\s]*\>[\S\s]*\<\/html[\S\s]*\>[\S\s]*
So in JAVA it looks like:
text.matches("[\\S\\s]*\\<html[\\S\\s]*\>[\\S\\s]*\\<\\/html[\\S\\s]*\\>[\S\s]*");
It should match any correct (as well as some incorrect) XML file that contains somewhere an "html" element. So there might be false positives.
Edit:
Since I have posted that, I have removed the last part with html element closing, as I found some websites don't use it. (?!) So in case, you prefer false positives to false negatives, I encourage to do that!
In your backing bean, you can try to find html tags such as <b>
or <i>
, etc...
You can use regular expressions (slow) or just try to find the "<>" chars. It depends on how sure you want to be that the user used html or not.
Keep in mind that the user could write <asdf>
. If you want to be 100% sure that the html used is valid you will need to use a complex html parser from some library (TidyHTML maybe?)
If you don't want the user to have HTML in their input, you can replace all '<' characters with their HTML entity equivalent, '& lt;' and all '>' with '& gt;' (no spaces between & and g)
Below will match any tags. You can also extract tag, attributes and value
Pattern pattern = Pattern.compile("<(\\w+)( +.+)*>((.*))</\\1>");
Matcher matcher = pattern.matcher("<as testAttr='5'> TEST</as>");
if (matcher.find()) {
for (int i = 0; i < matcher.groupCount(); i++) {
System.out.println(i + ":" + matcher.group(i));
}
}
You have to get help only by the regular expression strings. They help you find out potential html tags. You can then compare the inner to contain any html keywords. If its found, put up an alert telling not to use HTML. Or simply delete it if you feel otherwise.
精彩评论