How to retrieve the `scheme://domain` part of an URL without including subdomains?

2023-04-03 10:33 问答作者：

I am using Ruby on Rails 3.0.10 and I would like to retrieve the scheme://domain part of an URL without including the subdomain part. That is, if I have the following URL

http://www.sub_domain.domain.com

I would like to retrieve

http://www.domain.com

How can I do that (should I use a regex?)?

UPDATE

@mu is too short rightly said in his\her comment (that made me think...):

You misunderstand. www.ac.uk is meaningless, the base domain for Oxford is ox.ac.uk; the ac.uk part means "academic UK" and is, semantically, one component. A few other countries have similar naming schemes.

So, the update question is:

How can I iterate over a开发者_如何学运维n URL (for example http://www.maths.ox.ac.uk/) as made in the following steps so to delete progressively subdomain parts until the last?

http://www.maths.ox.ac.uk/ # Step 0 (start)
http://www.ox.ac.uk/       # Step 1
http://www.ac.uk/          # Step 2 (end)

This is a total hack, and I have no idea how it could be useful in the generic sense, but here you go.

ruby-1.8.7-p352 >   uri = URI.parse("http://www.foo.domain.com/")
 => #<URI::HTTP:0x105011840 URL:http://www.foo.domain.com/> 
ruby-1.8.7-p352 > uri.scheme + "://" + uri.host.split(/\./)[-2..-1].join(".")
 => "http://domain.com"

If you know that the URL ends in .com and follows the format you specified, you could try a regular expression like this:

\.[\w\-]+\.com

to parse out the domain and the following .com. Prefix that with http://www and you should be all set.

There is no "general case" solution for this. Some URLs use a suffix with one dot (.com or .edu), while some use multiple dots (.co.jp, etc). You won't be able to solve this with something as simple as a regex.

What you may be able to do is to make a list of possible URL suffixes and construct a regex for each. If it matches your input string, use a variation of the above:

base_regex = '\.[\w\-]+'
list_of_suffixes.each {|s|
    thisregex = Regexp.new(base_regex + s)
    match = thisregex.match(url)
    next if match == nil
    return 'http://www.' + match[0]
}

Note: code is off the top of my head and for illustration purposes only (it probably won't run exactly as-is, but you get the point)

The right way to deal with this is to use URI:

# Parse and remove all the stuff you don't want.
u = URI.parse('http://www.sub-domain.domain.com/pancakes')
u.userinfo = nil
u.path     = ''
u.fragment = nil
# You might want to check u.scheme as well

host = u.host

And now you have to figure out what you want to do with host. You could start at the last component and work your way backwards until you get a domain name that resolves to something using Net::DNS:

require 'net/dns/resolver'
components = host.split('.')
basename   = (1 .. components.length).
             map  { |i| components.last(i + 1).join('.') }.
             find { |n| Resolver(n).answer.length > 0    }

# basename is now nil or something with a DNS A record
if(basename.nil?)
    # complain and bail out
end
u.host = basename
# Your trimmed URL is in u.to_s

You have to check that the domain names resolve to something useful or you won't know if you have a valid one. You could try to track down all the various naming rules all over the world instead but there's no point.

This still won't guarantee you that you have a useful URL, you'd have to check to see if the name you end up with responds to an HTTP request to be sure.

To answer your original question:

should I use a regex?

Absolutely not. URLs are a lot more complicated than most people think so you should use a real URL parser such as URI. Furthermore, domain names are also more complicated than most people think so you have to resort to DNS lookups to get anything sensible.

继续阅读：dns ruby ruby-on-rails ruby-on-rails-3

How to retrieve the `scheme://domain` part of an URL without including subdomains?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？