开发者

Extract domain using regular expression

Suppose I got these urls.

http://abdd.eesfea.domainname.com/b/33tA$/0021/file
http://mail.domainname.org/abc/abc/aaa
http://domainname.edu 

I just want to extract "domainame.com" or "domainname.org" or "domainname.edu" out. How can I do this?

I think, I need to find the last "dot" just before "com|org|edu..." and print out content from this "dot"'s p开发者_StackOverflow社区revious dot to this dot's next dot(if it has).

Need help about the regular-expres. Thanks a lot!!! I am using Python.


why use regex?

http://docs.python.org/library/urlparse.html


If you would like to go the regex route...

RFC-3986 is the authority regarding URIs. Appendix B provides this regex to break one down into its components:

re_3986 = r"^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?"
# Where:
# scheme    = $2
# authority = $4
# path      = $5
# query     = $7
# fragment  = $9

Here is an enhanced, Python friendly version which utilizes named capture groups. It is presented in a function within a working script:

import re

def get_domain(url):
    """Return top two domain levels from URI"""
    re_3986_enhanced = re.compile(r"""
        # Parse and capture RFC-3986 Generic URI components.
        ^                                    # anchor to beginning of string
        (?:  (?P<scheme>    [^:/?#\s]+): )?  # capture optional scheme
        (?://(?P<authority>  [^/?#\s]*)  )?  # capture optional authority
             (?P<path>        [^?#\s]*)      # capture required path
        (?:\?(?P<query>        [^#\s]*)  )?  # capture optional query
        (?:\#(?P<fragment>      [^\s]*)  )?  # capture optional fragment
        $                                    # anchor to end of string
        """, re.MULTILINE | re.VERBOSE)
    re_domain =  re.compile(r"""
        # Pick out top two levels of DNS domain from authority.
        (?P<domain>[^.]+\.[A-Za-z]{2,6})  # $domain: top two domain levels.
        (?::[0-9]*)?                      # Optional port number.
        $                                 # Anchor to end of string.
        """, 
        re.MULTILINE | re.VERBOSE)
    result = ""
    m_uri = re_3986_enhanced.match(url)
    if m_uri and m_uri.group("authority"):
        auth = m_uri.group("authority")
        m_domain = re_domain.search(auth)
        if m_domain and m_domain.group("domain"):
            result = m_domain.group("domain");
    return result

data_list = [
    r"http://abdd.eesfea.domainname.com/b/33tA$/0021/file",
    r"http://mail.domainname.org/abc/abc/aaa",
    r"http://domainname.edu",
    r"http://domainname.com:80",
    r"http://domainname.com?query=one",
    r"http://domainname.com#fragment",
    ]
cnt = 0
for data in data_list:
    cnt += 1
    print("Data[%d] domain = \"%s\"" %
        (cnt, get_domain(data)))

For more information regarding the picking apart and validation of a URI according to RFC-3986, you may want to take a look at an article I've been working on: Regular Expression URI Validation


In addition to Jase' answer. If you don't wan't to use urlparse, just split the URL's.

Strip of the protocol (http:// or https://) The you just split the string by first occurrence of '/'. This will leave you with something like: 'mail.domainname.org' on the second URL. This can then be split by '.' and the you just select the last two from the list by [-2]

This will always yield the domainname.org or whatever. Provided you get the protocol stripped out right, and that the URL are valid.

I would just use urlparse, but it can be done. Dunno about the regex, but this is how I would do it.


Should you need more flexibility than urlparse provides, here's an example to get you started:

import re
def getDomain(url):
    #requires 'http://' or 'https://'
    #pat = r'(https?):\/\/(\w+\.)*(?P<domain>\w+)\.(\w+)(\/.*)?'
    #'http://' or 'https://' is optional
    pat = r'((https?):\/\/)?(\w+\.)*(?P<domain>\w+)\.(\w+)(\/.*)?'
    m = re.match(pat, url)
    if m:
        domain = m.group('domain')
        return domain
    else:
        return False

I used the named group (?P<domain>\w+) to grab the match, which is then indexed by its name, m.group('domain'). The great thing about learning regular expressions is that once you are comfortable with them, solving even the most complicated parsing problems is relatively simple. This pattern could be improved to be more or less forgiving if necessary -- this one for example will return '678' if you pass it 'http://123.45.678.90', but should work great on just about any other URL you can come up with. Regexr is a great resource for learning and testing regexes.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜