Recognize "Invalid" URLs without Trying to Resolve Them
I'm building a Facebook App which grabs the开发者_Python百科 URLs from various sources in a user's Facebook acount--e.g., a user's likes.
A problem I've encountered is that many Facebook entries have string which are not URLs in their "website" and "link" fields. Facebook does no checking on user input so these fields can essentially contain any string.
I want to be able to process the strings in these field such that URLs like "http://google.com"
, "https://www.bankofamerica.com"
, "http://www.nytimes.com/2011/06/13/us/13fbi.html?_r=1&hp"
, "bit.ly"
, "www.pbs.org"
are all accepted.
And all the strings like "here is a random string of text the user entered"
, "here'\s ano!!! #%#$^ther weird random string"
are all rejected.
It seems to me the only way to be "sure" of a URL is to attempt to resolve it, but I believe that will be prohibitively resource intensive.
Can anyone think of clever way to regex or otherwise analyze these strings such that "a lot" of the URLS are properly captured--80%? 95% 99.995% of URLs?
Thanks!
EDIT: FYI, I'm developing in Python. But a language agnostic solution is great as well.
There are numerous tools for validating URLs depending on your development language. Assuming you are developing in JavaScript, a quick Google search unearths many approaches, depending on the level of robustness your need requires.
See http://www.w3.org/Addressing/URL/url-spec.txt for the authoritative specification.
I'd first match for "^(?:https?://)?([A-Za-z0-9-\.]+)/"
and then do a DNS lookup (cached) for that hostname, if you want to make sure that the hostname isn't misspelled. The 95% technique uses a whitelist of toplevel domains (or some regular expression for them), which you'd have to maintain when new ones (.info, .eu, .biz, .aero) become available.
There are also certain characters that are not allowed (unescaped) in URLs - however, some people do enter URLs like "http://example.com/I don't wanna go!!!"
and their browser then escapes it to the valid "...I%20don%27t%20wanna%20go%21%21%21"
.
精彩评论