开发者

MFC: How do I construct a good regular expression that validates URLs?

Here's the regular expression I use, and I parse it using CAtlRegExp of MFC :

(((h|H?)(t|T?)开发者_如何学C(t|T?)(p|P?)(s|S?))://)?([a-zA-Z0-9]+[\.]+[a-zA-Z0-9]+[\.]+[a-zA-Z0-9])

It works fine except with one flaw. When URL is preceded by characters, it still accepts it as a URL.

ex inputs:

  • this is a link www.google.com (where I can just tokenize the spaces and validate each word)

  • is...www.google.com (this string still matches the RegEx above :( )

Please help... Thanks...


  1. Use the IgnoreCase flag instead of catering for each case.
  2. Stick a ^ at the beginning if you want the start of the string to be the start of the URL
  3. You're missing a lot of characters from possible, valid URLs.


You need to tell the regex to only match at the start and end of the string. I'm not sure how you do that in VC++ - in most regexs you enclose the pattern with ^ and $. The ^ says "the start of the string" and the $ says "the end of the string."

^(((h|H?)(t|T?)(t|T?)(p|P?)(s|S?))\://)?([a-zA-Z0-9]+[\\.]+[a-zA-Z0-9]+[\\.]+[a-zA-Z0-9])$

The second is matching because the string still contains a valid URL.


How about using CUrl (that is, 'C-Url', in ATL, not curl as in libcurl) which can 'parse' urls with CUrl::CrackUrl . If that function returns FALSE you assume it's not a valid URL.

That said, decomposing URL is sufficiently complex to warrant a proper parser, not a regex based decomposition. Cfr. rfc 2396 etc. for an overview on the complexities.


Start the regex with ^ to and end it with $ to have the regex match only if the entire sting matches (if that's what you want):

^(((h|H?)(t|T?)(t|T?)(p|P?)(s|S?))\://)?([a-zA-Z0-9]+[\.]+[a-zA-Z0-9]+[\.]+[a-zA-Z0-9])$


What about this one: (((f|ht)tp://)[-a-zA-Z0-9@:%_\+.~#?&//=]+) ?


This Regular Expression has been tested to work for the following

http|https://host[:port]/[?][parameter=value]*

public static final String URL_PATTERN = "(https?|ftp)://(www\\.)?(((([a-zA-Z0-9.-]+\\.){1,}[a-zA-Z]{2,4}|localhost))|((\\d{1,3}\\.){3}(\\d{1,3})))(:(\\d+))?(/([a-zA-Z0-9-._~!$&'()*+,;=:@/]|%[0-9A-F]{2})*)?(\\?([a-zA-Z0-9-._~!$&'()*+,;=:/?@]|%[0-9A-F]{2})*)?(#([a-zA-Z0-9._-]|%[0-9A-F]{2})*)?";

PS. It also validates on localhost link.

(Thoroughly written by me :-))

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜