Why does this regex check return true for this string?
I need a regex that will determine if a string is a tweet URL. I've got this
Regexp.new(/http:|https:\/\/(twitter\.com\/.*\/status\/.*|twitter\.com\/.*\/statuses\/.*|www\.twitter\.com\/开发者_运维问答.*\/status\/.*|www\.twitter\.com\/.*\/statuses\/.*|mobile\.twitter\.com\/.*\/status\/.*|mobile\.twitter\.com\/.*\/statuses\/.*)/i)
Why does it return true for the following?
"http://i.stack.imgur.com/QdOS0.jpg".match(Regexp.new(/http:|https:\/\/(twitter\.com\/.*\/status\/.*|twitter\.com\/.*\/statuses\/.*|www\.twitter\.com\/.*\/status\/.*|www\.twitter\.com\/.*\/statuses\/.*|mobile\.twitter\.com\/.*\/status\/.*|mobile\.twitter\.com\/.*\/statuses\/.*)/i))? true : false
=> true
http:
will always match a URL starting with http:
Try the following:
/https?:\/\/(twitter\.com\/.*\/status\/.*|twitter\.com\/.*\/statuses\/.*|www\.twitter\.com\/.*\/status\/.*|www\.twitter\.com\/.*\/statuses\/.*|mobile\.twitter\.com\/.*\/status\/.*|mobile\.twitter\.com\/.*\/statuses\/.*)/i
The question mark will make the s
optional, thus matching http
or https
.
Your regex could be abbreviated like :
#^https?://(:?www\.|mobile\.)?twitter\.com/.*?/status(:?es)?/.*#i
explanation:
# regex delimiter
^ start of line
https? http or https
:// ://
(:? start of non capture group
www\.|mobile\. www. or mobile.
)? end of group
twitter\.com/ twitter.com
.*? any number of any char not greedy
/status /status
(:?es)? non capture group that contains possibly `es`
/.* / followed by any number of any char
$ end of string
#i delimiter and case insensitive
No need for regular expressions here (as usual).
require 'uri'
uri = URI.parse("http://www.twitter.com/status/12345")
p uri.host.split('.')[-2] == 'twitter' # returns true
More docs at: http://ruby-doc.org/stdlib/
You should group your OR-Clauses, like this:
(http:|https:)
Additionally, it wouldn't hurt to specify beginning and end of it:
^(http:|https:).*$
The start of your regex specifies an option of just 'http:', which naturally matches the URL you are testing. Depending on how strict you need your check to be, you could just remove the http/https parts from the start of the regex.
While many other answers show you a better regex, the answer is because /foo|bar/
will match either foo
or bar
, and what you wrote was /http:|.../
, hence all URLs will be matched.
See @giraff's answer for how you could have written the alternation to do what you expect, or @M42's or @Koraktor's answers for a better regexp.
And as posted in the comments, note that you can write a regex literal as %r{...}
instead of /.../
, which is nice when you want to use /
characters in your regex without escaping them.
精彩评论