How to retrieve the `scheme://domain` part of an URL without including subdomains?
I am using Ruby on Rails 3.0.10 and I would like to retrieve the scheme://domain part of an URL without including the subdomain part. That is, if I have the following URL
http://www.sub_domain.domain.com
I would like to retrieve
http://www.domain.com
How can I do that (should I use a regex?)?
UPDATE
@mu is too short rightly said in his\her comment (that made me think...):
You misunderstand. www.ac.uk is meaningless, the base domain for Oxford is ox.ac.uk; the ac.uk part means "academic UK" and is, semantically, one component. A few other countries have similar naming schemes.
So, the update question is:
How can I iterate over a开发者_如何学运维n URL (for example http://www.maths.ox.ac.uk/) as made in the following steps so to delete progressively subdomain parts until the last?
http://www.maths.ox.ac.uk/ # Step 0 (start)
http://www.ox.ac.uk/       # Step 1
http://www.ac.uk/          # Step 2 (end)
This is a total hack, and I have no idea how it could be useful in the generic sense, but here you go.
ruby-1.8.7-p352 >   uri = URI.parse("http://www.foo.domain.com/")
 => #<URI::HTTP:0x105011840 URL:http://www.foo.domain.com/> 
ruby-1.8.7-p352 > uri.scheme + "://" + uri.host.split(/\./)[-2..-1].join(".")
 => "http://domain.com" 
If you know that the URL ends in .com and follows the format you specified, you could try a regular expression like this:
\.[\w\-]+\.com
to parse out the domain and the following .com.  Prefix that with http://www and you should be all set.
There is no "general case" solution for this.  Some URLs use a suffix with one dot (.com or .edu), while some use multiple dots (.co.jp, etc).  You won't be able to solve this with something as simple as a regex.
What you may be able to do is to make a list of possible URL suffixes and construct a regex for each. If it matches your input string, use a variation of the above:
base_regex = '\.[\w\-]+'
list_of_suffixes.each {|s|
    thisregex = Regexp.new(base_regex + s)
    match = thisregex.match(url)
    next if match == nil
    return 'http://www.' + match[0]
}
Note: code is off the top of my head and for illustration purposes only (it probably won't run exactly as-is, but you get the point)
The right way to deal with this is to use URI:
# Parse and remove all the stuff you don't want.
u = URI.parse('http://www.sub-domain.domain.com/pancakes')
u.userinfo = nil
u.path     = ''
u.fragment = nil
# You might want to check u.scheme as well
host = u.host
And now you have to figure out what you want to do with host. You could start at the last component and work your way backwards until you get a domain name that resolves to something using Net::DNS:
require 'net/dns/resolver'
components = host.split('.')
basename   = (1 .. components.length).
             map  { |i| components.last(i + 1).join('.') }.
             find { |n| Resolver(n).answer.length > 0    }
# basename is now nil or something with a DNS A record
if(basename.nil?)
    # complain and bail out
end
u.host = basename
# Your trimmed URL is in u.to_s
You have to check that the domain names resolve to something useful or you won't know if you have a valid one. You could try to track down all the various naming rules all over the world instead but there's no point.
This still won't guarantee you that you have a useful URL, you'd have to check to see if the name you end up with responds to an HTTP request to be sure.
To answer your original question:
should I use a regex?
Absolutely not. URLs are a lot more complicated than most people think so you should use a real URL parser such as URI. Furthermore, domain names are also more complicated than most people think so you have to resort to DNS lookups to get anything sensible.
 
         加载中,请稍侯......
 加载中,请稍侯......
      
精彩评论