CPP + Regular Expression to Validate URL
I want to build a regular expression in c++{MFC} which validates the URL.
The regular expression must satisfy following conditions.
Valid URL:- http://cu-241.dell-tech.co.in/MyWebSite/ISAPIWEBSITE/Denypage.aspx/ http://www.google.com http://www.google.co.in
Invalid URL:-
http://cu-241.dell-tech.co.in/\MyWebSite/\ISAPIWEBSITE/\Denypage.aspx/ = Regx must check & invalid URL as '\' character between "/\MyWebSite/\ISAPIWEBSITE/\Denypage.aspx/"
http://cu-241.dell-tech.co.in//////MyWebSite/ISAPIWEBSITE/Denypage.aspx/ = Regx must check & invalidate URL due to multiple entries of "///////" in url.
http://news.google.co.in/%5Cnwshp?hl=en&tab=wn = Regex must check & invalidate URL for additional insertion of %5C & %2F characte开发者_如何学运维r.
How can we develop a generic Regular Expression satisfying above condition. Please, Help us by providing a regular expression that will handle above scenario's in CPP{MFC}
Have you tried using the RFC 3986 suggestion? If you're capable of using GCC-4.9 then you can go directly with <regex>
.
It states that with ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
you can get as submatches:
scheme = $2
authority = $4
path = $5
query = $7
fragment = $9
For example:
int main(int argc, char *argv[])
{
std::string url (argv[1]);
unsigned counter = 0;
std::regex url_regex (
R"(^(([^:\/?#]+):)?(//([^\/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?)",
std::regex::extended
);
std::smatch url_match_result;
std::cout << "Checking: " << url << std::endl;
if (std::regex_match(url, url_match_result, url_regex)) {
for (const auto& res : url_match_result) {
std::cout << counter++ << ": " << res << std::endl;
}
} else {
std::cerr << "Malformed url." << std::endl;
}
return EXIT_SUCCESS;
}
Then:
./url-matcher http://localhost.com/path\?hue\=br\#cool
Checking: http://localhost.com/path?hue=br#cool
0: http://localhost.com/path?hue=br#cool
1: http:
2: http
3: //localhost.com
4: localhost.com
5: /path
6: ?hue=br
7: hue=br
8: #cool
9: cool
look at http://gskinner.com/RegExr/, there is a community tab on the right where you find contributed regex's. There is a URI category, not sure you'll find exactly what you need but this is a good start
With the following regex you can filter out simply most of the incorrect URLs:
int main(int argc, char* argv[]) {
std::string url(argv[1]);
std::regex urlRegex(R"(^https?://[0-9a-z\.-]+(:[1-9][0-9]*)?(/[^\s]*)*$)");
if (!std::regex_match(value, urlRegex)) {
throw Poco::InvalidArgumentException(
"Malformed URL: \"" + value + "\". "
"The URL must start with http:// or https://, "
"the domain name should only contain lowercase alphanumeric characters, '.' and '-', "
"the port should not start with 0, "
"and the URL should not contain any whitespace.");
}
}
It checks if the URL starts with http://
or https://
, whether the domain name is only lowercase alphanumeric characters
with '.'
and '-'
, checks that the port is not starting with 0 (e.g. 0123), and allows for any port number and any path/query string that does not contain whitespace.
But to be absolutely sure that the URL is valid, you're probably better off parsing the URL. I would not recommend trying to cover all scenarios with regex (including the correctness of paths, queries, fragments), because it would be pretty difficult.
精彩评论