How to retrieve the `scheme://domain` part of an URL without including subdomains?
I am using Ruby on Rails 3.0.10 and I would like to retrieve the scheme://domain
part of an URL without including the subdomain part. That is, if I have the following URL
http://www.sub_domain.domain.com
I would like to retrieve
http://www.domain.com
How can I do that (should I use a regex?)?
UPDATE
@mu is too short
rightly said in his\her comment (that made me think...):
You misunderstand. www.ac.uk is meaningless, the base domain for Oxford is ox.ac.uk; the ac.uk part means "academic UK" and is, semantically, one component. A few other countries have similar naming schemes.
So, the update question is:
How can I iterate over a开发者_如何学运维n URL (for example http://www.maths.ox.ac.uk/
) as made in the following steps so to delete progressively subdomain parts until the last?
http://www.maths.ox.ac.uk/ # Step 0 (start)
http://www.ox.ac.uk/ # Step 1
http://www.ac.uk/ # Step 2 (end)
This is a total hack, and I have no idea how it could be useful in the generic sense, but here you go.
ruby-1.8.7-p352 > uri = URI.parse("http://www.foo.domain.com/")
=> #<URI::HTTP:0x105011840 URL:http://www.foo.domain.com/>
ruby-1.8.7-p352 > uri.scheme + "://" + uri.host.split(/\./)[-2..-1].join(".")
=> "http://domain.com"
If you know that the URL ends in .com
and follows the format you specified, you could try a regular expression like this:
\.[\w\-]+\.com
to parse out the domain and the following .com
. Prefix that with http://www
and you should be all set.
There is no "general case" solution for this. Some URLs use a suffix with one dot (.com
or .edu
), while some use multiple dots (.co.jp
, etc). You won't be able to solve this with something as simple as a regex.
What you may be able to do is to make a list of possible URL suffixes and construct a regex for each. If it matches your input string, use a variation of the above:
base_regex = '\.[\w\-]+'
list_of_suffixes.each {|s|
thisregex = Regexp.new(base_regex + s)
match = thisregex.match(url)
next if match == nil
return 'http://www.' + match[0]
}
Note: code is off the top of my head and for illustration purposes only (it probably won't run exactly as-is, but you get the point)
The right way to deal with this is to use URI
:
# Parse and remove all the stuff you don't want.
u = URI.parse('http://www.sub-domain.domain.com/pancakes')
u.userinfo = nil
u.path = ''
u.fragment = nil
# You might want to check u.scheme as well
host = u.host
And now you have to figure out what you want to do with host
. You could start at the last component and work your way backwards until you get a domain name that resolves to something using Net::DNS:
require 'net/dns/resolver'
components = host.split('.')
basename = (1 .. components.length).
map { |i| components.last(i + 1).join('.') }.
find { |n| Resolver(n).answer.length > 0 }
# basename is now nil or something with a DNS A record
if(basename.nil?)
# complain and bail out
end
u.host = basename
# Your trimmed URL is in u.to_s
You have to check that the domain names resolve to something useful or you won't know if you have a valid one. You could try to track down all the various naming rules all over the world instead but there's no point.
This still won't guarantee you that you have a useful URL, you'd have to check to see if the name you end up with responds to an HTTP request to be sure.
To answer your original question:
should I use a regex?
Absolutely not. URLs are a lot more complicated than most people think so you should use a real URL parser such as URI
. Furthermore, domain names are also more complicated than most people think so you have to resort to DNS lookups to get anything sensible.
精彩评论