Why this regular expression shows safety?
I have a JSP redemption for XSS attacks, in which it checks if the content matches a regular expression to determine whether it is safe or not, here is the code:
String contents = bodyContent.getString();
String regExp = new String("^\\w{5,25}$");
// Do a regex to find the good stuff
if (contents.matches(regExp)) {
//write the original content
}else{
//change content to make it safe and write it
}
My question is about the regular expression "^\w{5,25}$", which you can see it here visually. Why matching this regular expression shows safe开发者_如何学运维ty?
If the regular expression was:
^\w{5,25}$
then this would limit the string to letters, numbers and underscores - i.e. no spaces or other punctuation. This means that it cannot be a nefarious script as that would surely include spaces, or semi-colons.
That railroad diagram is incorrect, "\w" is a regex special that matches so-called word characters. These are A-Z, a-z, 0-9 and underscores.
Input matching this is usually considered safe since it cannot include any of the normally used special or escape characters, but is by no means a guarantee.
Apart from the concrete question which has already been answered by others, that's a plain wrong way to prevent your JSPs from XSS attacks. You should be just using JSTL <c:out>
tag or fn:escapeXml()
function to redisplay user-controlled data.
E.g.
<c:out value="${header['user-agent']}" />
or
<input type="text" name="foo" value="${fn:escapeXml(param.foo)}" />
This way HTML/XML special characters like <
, >
and so on won't be interpreted literally (which would cause a potential XSS hole), but will be escaped so that they get just displayed as-is.
This is behind the scenes just done by a literal char-by-char match and replace. All <
are replaced by <
, all >
are replaced by >
, all "
are replaced by "
and so on. This does and should not involve regex.
You're matching a number of "word" characters, anchored to start and end of string. So we know there's no punctuation other than _ in that set.
Anything matching this set is deemed safe, I guess that the authors assume that nothing evil can be done in such a string.
I can't understand why less that 5 characters is deemed unsafe.
I don't see why if a string of 25 such characters is safe, 26 is not.
Your regex validates that the string contains only the "word" character class, [a-Z0-9]. So, it is just validation that there is not punctuation or special characters in the string. It also validates for length, from 5 to 25.
An XSS attack commonly relies on a <script>...</script>
routine getting inserted into the database - which obviously has a couple special characters [<>/].
The only reason I can think of why less than five characters would be "unsafe" is that if it was being used for a search query, 1 to 4 characters might return an excessive number of results. Many database-driven search functions require a minimum of 3-5 characters to avoid huge numbers of hits. Will this string be used for any sort of string matching?
精彩评论