开发者

Understand this RegEx statement

I'm trying to understand this RegEx statement in details. It's supposed to validate filename from ASP.Net FileUpload control to allow only jpeg and gif files. It was designed by somebody else and I do not completely understand it. It works fine in Internet Explorer 7.0 but not in Firefox 3.6.

<asp:RegularExpressionValidator id="FileUpLoadValidator" runat="server" 
     ErrorMessage="Upload Jpegs and Gifs only." 
     ValidationExpression="^(([a-zA-Z]:)|(\\{2}\w+)\$?)(\\(\w[\w].开发者_开发百科*))(.jpg|.JPG|.gif|.GIF)$"
     ControlToValidate="LogoFileUpload">
</asp:RegularExpressionValidator>


Here's a short explanation:

^               # match the beginning of the input
(               # start capture group 1
  (             #   start capture group 2
    [a-zA-Z]    #     match any character from the set {'A'..'Z', 'a'..'z'}
    :           #     match the character ':'
  )             #   end capture group 2
  |             #   OR
  (             #   start capture group 3
    \\{2}       #     match the character '\' and repeat it exactly 2 times
    \w+         #     match a word character: [a-zA-Z_0-9] and repeat it one or more times
  )             #   end capture group 3
  \$?           #   match the character '$' and match it once or none at all
)               # end capture group 1
(               # start capture group 4
  \\            #   match the character '\'
  (             #   start capture group 5
    \w          #     match a word character: [a-zA-Z_0-9] 
    [\w]        #     match any character from the set {'0'..'9', 'A'..'Z', '_', 'a'..'z'}
    .*          #     match any character except line breaks and repeat it zero or more times
  )             #   end capture group 5
)               # end capture group 4
(               # start capture group 6
  .             #   match any character except line breaks
  jpg           #   match the characters 'jpg'
  |             #   OR
  .             #   match any character except line breaks
  JPG           #   match the characters 'JPG'
  |             #   OR
  .             #   match any character except line breaks
  gif           #   match the characters 'gif'
  |             #   OR
  .             #   match any character except line breaks
  GIF           #   match the characters 'GIF'
)               # end capture group 6
$               # match the end of the input

EDIT

As some of the comments request, the above is generated by a little tool I wrote. You can download is here: http://www.big-o.nl/apps/pcreparser/pcre/PCREParser.html (WARNING: heavily under development!)

EDIT 2

It will match strings like these:

x:\abc\def\ghi.JPG
c:\foo\bar.gif
\\foo$\baz.jpg

Here's what the groups 1, 4 and 6 match individually:

group 1 | group 4      | group 6
--------+--------------+--------
        |              |
 x:     | \abc\def\ghi | .JPG
        |              |
 c:     | \foo\bar     | .gif
        |              |
 \\foo$ | \baz         | .jpg
        |              |

Note that it also matches a string like c:\foo\bar@gif since the DOT matches any character (except line breaks). And it will reject a string like c:\foo\bar.Gif (capital G in gif).


This is a bad regex.

^(([a-zA-Z]:)|(\\{2}\w+)\$?)(\\(\w[\w].*))(.jpg|.JPG|.gif|.GIF)$

Let's do it part by part.

([a-zA-Z]:)

This requires the file path starts with a driveletter like C:, d:, etc.

(\\{2}\w+)\$?)

\\{2} means the backslash repeated twice (note the \ needs to be escaped), followed by some alphanumerics (\w+), and then maybe a dollar sign (\$?). This is the host part of UNC path.

([a-zA-Z]:)|(\\{2}\w+)\$?)

The | means "or". So either starts with a drive letter or an UNC path. Congratulations for kicking out non-Windows users.

(\\(\w[\w].*))

This should the directory part of the path, but actually is 2 alphanumerics followed by anything except new lines (.*), like \ab!@#*(#$*).

The proper regex for this part should be (?:\\\w+)+

(.jpg|.JPG|.gif|.GIF)$

This means the last 3 characters of the path must be jpg, JPG, gif or GIF. Note that . is not a dot, but matches anything except \n, so a filename like haha.abcgif or malicious.exe\0gif will pass.

The proper regex for this part should be \.(?:jpg|JPG|gif|GIF)$

Together,

^(([a-zA-Z]:)|(\\{2}\w+)\$?)(\\(\w[\w].*))(.jpg|.JPG|.gif|.GIF)$

will match

D:\foo.jpg
\\remote$\dummy\..\C:\Windows\System32\Logo.gif
C:\Windows\System32\cmd.exe;--gif

and will fail

/home/user/pictures/myself.jpg
C:\a.jpg
C:\d\e.jpg

The proper regex is /\.(?:jpg|gif)$/i, and check whether the uploaded file is really an image on the server side.


It splits a filename into the parts driveletter, path, filename and extension.

Most probably IE uses backslashes while FireFox uses slashes. Try to replace the \\ parts with [\\/] so the expression will accept both slashes and backslashes.


From Expresso this is what Expresso says:

///  A description of the regular expression:
///  
///  Beginning of line or string
///  [1]: A numbered capture group. [([a-zA-Z]:)|(\\{2}\w+)\$?]
///      Select from 2 alternatives
///          [2]: A numbered capture group. [[a-zA-Z]:]
///              [a-zA-Z]:
///                  Any character in this class: [a-zA-Z]
///                  :
///          (\\{2}\w+)\$?
///              [3]: A numbered capture group. [\\{2}\w+]
///                  \\{2}\w+
///                      Literal \, exactly 2 repetitions
///                      Alphanumeric, one or more repetitions
///              Literal $, zero or one repetitions
///  [4]: A numbered capture group. [\\(\w[\w].*)]
///      \\(\w[\w].*)
///          Literal \
///          [5]: A numbered capture group. [\w[\w].*]
///              \w[\w].*
///                  Alphanumeric
///                  Any character in this class: [\w]
///                  Any character, any number of repetitions
///  [6]: A numbered capture group. [.jpg|.JPG|.gif|.GIF]
///      Select from 4 alternatives
///          .jpg
///              Any character
///              jpg
///          .JPG
///              Any character
///              JPG
///          .gif
///              Any character
///              gif
///          .GIF
///              Any character
///              GIF
///  End of line or string
///  

Hope this helps, Best regards, Tom.


You may need to implement server-side validation. Check out this article.

Solving the Challenges of ASP.NET Validation

Also, there are some good online tools for creating or interpreting Regex expressions. but I suspect that the issue isn't with the expression.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜