Extract domain using regular expression

2023-02-21 02:16 问答作者：

Suppose I got these urls.

http://abdd.eesfea.domainname.com/b/33tA$/0021/file
http://mail.domainname.org/abc/abc/aaa
http://domainname.edu

I just want to extract "domainame.com" or "domainname.org" or "domainname.edu" out. How can I do this?

I think, I need to find the last "dot" just before "com|org|edu..." and print out content from this "dot"'s p开发者_StackOverflow社区revious dot to this dot's next dot(if it has).

Need help about the regular-expres. Thanks a lot!!! I am using Python.

why use regex?

http://docs.python.org/library/urlparse.html

If you would like to go the regex route...

RFC-3986 is the authority regarding URIs. Appendix B provides this regex to break one down into its components:

re_3986 = r"^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?"
# Where:
# scheme    = $2
# authority = $4
# path      = $5
# query     = $7
# fragment  = $9

Here is an enhanced, Python friendly version which utilizes named capture groups. It is presented in a function within a working script:

import re

def get_domain(url):
    """Return top two domain levels from URI"""
    re_3986_enhanced = re.compile(r"""
        # Parse and capture RFC-3986 Generic URI components.
        ^                                    # anchor to beginning of string
        (?:  (?P<scheme>    [^:/?#\s]+): )?  # capture optional scheme
        (?://(?P<authority>  [^/?#\s]*)  )?  # capture optional authority
             (?P<path>        [^?#\s]*)      # capture required path
        (?:\?(?P<query>        [^#\s]*)  )?  # capture optional query
        (?:\#(?P<fragment>      [^\s]*)  )?  # capture optional fragment
        $                                    # anchor to end of string
        """, re.MULTILINE | re.VERBOSE)
    re_domain =  re.compile(r"""
        # Pick out top two levels of DNS domain from authority.
        (?P<domain>[^.]+\.[A-Za-z]{2,6})  # $domain: top two domain levels.
        (?::[0-9]*)?                      # Optional port number.
        $                                 # Anchor to end of string.
        """, 
        re.MULTILINE | re.VERBOSE)
    result = ""
    m_uri = re_3986_enhanced.match(url)
    if m_uri and m_uri.group("authority"):
        auth = m_uri.group("authority")
        m_domain = re_domain.search(auth)
        if m_domain and m_domain.group("domain"):
            result = m_domain.group("domain");
    return result

data_list = [
    r"http://abdd.eesfea.domainname.com/b/33tA$/0021/file",
    r"http://mail.domainname.org/abc/abc/aaa",
    r"http://domainname.edu",
    r"http://domainname.com:80",
    r"http://domainname.com?query=one",
    r"http://domainname.com#fragment",
    ]
cnt = 0
for data in data_list:
    cnt += 1
    print("Data[%d] domain = \"%s\"" %
        (cnt, get_domain(data)))

For more information regarding the picking apart and validation of a URI according to RFC-3986, you may want to take a look at an article I've been working on: Regular Expression URI Validation

In addition to Jase' answer. If you don't wan't to use urlparse, just split the URL's.

Strip of the protocol (http:// or https://) The you just split the string by first occurrence of '/'. This will leave you with something like: 'mail.domainname.org' on the second URL. This can then be split by '.' and the you just select the last two from the list by [-2]

This will always yield the domainname.org or whatever. Provided you get the protocol stripped out right, and that the URL are valid.

I would just use urlparse, but it can be done. Dunno about the regex, but this is how I would do it.

Should you need more flexibility than urlparse provides, here's an example to get you started:

import re
def getDomain(url):
    #requires 'http://' or 'https://'
    #pat = r'(https?):\/\/(\w+\.)*(?P<domain>\w+)\.(\w+)(\/.*)?'
    #'http://' or 'https://' is optional
    pat = r'((https?):\/\/)?(\w+\.)*(?P<domain>\w+)\.(\w+)(\/.*)?'
    m = re.match(pat, url)
    if m:
        domain = m.group('domain')
        return domain
    else:
        return False

I used the named group (?P<domain>\w+) to grab the match, which is then indexed by its name, m.group('domain'). The great thing about learning regular expressions is that once you are comfortable with them, solving even the most complicated parsing problems is relatively simple. This pattern could be improved to be more or less forgiving if necessary -- this one for example will return '678' if you pass it 'http://123.45.678.90', but should work great on just about any other URL you can come up with. Regexr is a great resource for learning and testing regexes.

继续阅读：python regex

Extract domain using regular expression

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？