开发者

Getting domain of an URL with Regular Expressions

I'm trying to get the domain of a given URL. For example http://www.facebook.com/someuser/ will return facebook.com. The given URL can be on 开发者_StackOverflow社区these formats:

  1. https://www.facebook.com/someuser (www. is optional, but should be ignored)
  2. www.facebook.com/someuser (http:// is not required)
  3. facebook.com/someuser
  4. http://someuser.tumblr.com -> this has to return tumblr.com only

I wrote this regex:

/(?: \.|\/{2})(?: www\.)?([^\/]*)/i

But it does not work as I expect.

I can do this in parts:

  1. Remove http:// and https://, if present on string, with string.delete "/https?:\/\//i".
  2. Remove www. with string.delete "/www\./i".
  3. Get the domain with match and /(\w+\.\w+)+/i

But this won't work with subdomains. String for testing:

https://www.facebook.com/username
http://last.fm/user/username
www.google.com
facebook.com/username
http://sub.tumblr.com/
sub.tumblr.com

I need this to work with the minimum memory and processing coast as possible.

Any ideas?


Why don't you just use the URI class to do this?

URI.parse( your_uri ).host

And you're done.

Just one thing, if there's no "http://" or "https://" at the beginning of the url, you'll have to add one, or the parse method is not going to give you a host (it's going to be nil).


This works for me: /^h?t?t?p?s?:?\/?\/?w?w?w?\.?(.*\.[A-Z]{2,})+[A-Z\/]/i It will always give you the domain part only Take a look at it at: http://rubular.com/r/0hudnJSgVT

To use it create a method like this, I put it in my helpers so I have access to in in the views.

def website_url(website_url)
    if website_url[/^h?t?t?p?s?:?\/?\/?w?w?w?\.?(.*\.[A-Z\/]{2,})$/i]
      website_id = $1
    end

    %Q{http://#{ website_id }}
  end


Does it have to be a regex? You could do this also.

require 'uri'
yourURL = URI.parse('https://www.facebook.com/username')
print yourURL.host


You could use this regex:

/(\w+\.\w{2,6})(?:\/|$)/


If you really wanted to use a regex, you could try something along the lines of:

test_string.scan(/\w+\.\w+(?=\/|\s|$)/) { |match| do_stuff_with(match) }

This wouldn't account for domain names such as something.co.uk but it would match everything in your test string.


I have created a function for String class through Open Classes technique for my purpose.

class String
  def to_dn
    return '' if self.blank?
    return self.split('@').last if self.match('@')
    link = self
    link = "http://#{link}" unless link.match(/^(http:\/\/|https:\/\/)/)
    link = URI.parse(URI.encode(link)).host.present? ? URI.parse(URI.encode(link)).host : link.strip
    domain_name = link.sub(/.*?www./,'')
    domain_name = domain_name.match(/[A-Z]+.[A-Z]{2,4}$/i).to_s if domain_name.split('.').length >= 2 && domain_name.match(/[A-Z]+.[A-Z]{2,4}$/i).present?
  end
end

Example:

 1. "https://www.facebook.com/someuser".to_dn = "facebook.com"
 2. "www.facebook.com/someuser".to_dn = "facebook.com"
 3. "facebook.com/someuser".to_dn = "facebook.com"
 4. "http://someuser.tumblr.com".to_dn = "tumblr.com" 
 5. "dc.ads.linkedin.com".to_dn = "linkedin.com" 
 6. 'your_name@domain.com'.to_dn = "domain.com"

It also work for email addresses (which require for my purpose). Hope it will useful of others. Correct me if you find anything incorrect :)

Note: It will not works for 'www.domainname.co.in'. I am working on it :)

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜