Extract main domain name from a given url

2023-04-01 09:21 问答作者：

I used the following to extract the domain from a url: (They are test cases)

String regex = "^(ww[a-zA-Z0-9-]{0,}\\.)";
ArrayList<String> cases = new ArrayList<String>();
ca开发者_开发百科ses.add("www.google.com");
cases.add("ww.socialrating.it");
cases.add("www-01.hopperspot.com");
cases.add("wwwsupernatural-brasil.blogspot.com");
cases.add("xtop10.net");
cases.add("zoyanailpolish.blogspot.com");

for (String t : cases) {  
    String res = t.replaceAll(regex, "");  
}

I can get the following results:

google.com
hopperspot.com
socialrating.it
blogspot.com
xtop10.net
zoyanailpolish.blogspot.com

The first four cases are good. The last one is not good. What I want is: blogspot.com for the last one, but it gives zoyanailpolish.blogspot.com. What am I doing wrong?

Using Guava library, we can easily get domain name:

InternetDomainName.from(tld).topPrivateDomain()

Refer API link for more details

https://google.github.io/guava/releases/14.0/api/docs/

http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/net/InternetDomainName.html

Obtain the host through REGEX is pretty complicated or impossible because TLD's don't obey to simple rules but are provided by ICANN and change in time.

You should use instead the functionality provided by JAVA library like this:

URL myUrl = new URL(urlString);
myUrl.getHost();

This is 2013 and solution I found is straight forward:

System.out.println(InternetDomainName.fromLenient(uriHost).topPrivateDomain().name());

It is much simpler:

  try {
        String domainName = new URL("http://www.zoyanailpolish.blogspot.com/some/long/link").getHost();

        String[] levels = domainName.split("\\.");
        if (levels.length > 1)
        {
            domainName = levels[levels.length - 2] + "." + levels[levels.length - 1];
        }

        // now value of domainName variable is blogspot.com
    } catch (Exception e) {}

As suggested by BalusC and others the most practical solution would be to get a list of TLDs (see this list), save them to a file, load them and then determine what TLD is being used by a given url String. From there on you could constitute the main domain name as follows:

    String url = "zoyanailpolish.blogspot.com";

    String tld = findTLD( url ); // To be implemented. Add to helper class ?

    url = url.replace( "." + tld,"");  

    int pos = url.lastIndexOf('.');

    String mainDomain = "";

    if (pos > 0 && pos < url.length() - 1) {
        mainDomain = url.substring(pos + 1) + "." + tld;
    }
    // else: Main domain name comes out empty

The implementation details are left up to you.

The reason why your are seeing zoyanailpolish.blogspot.com is that your regex finds only strings that start with a 'ww'. What you are asking is that in addition to removing all strings that start with a 'ww' , it should also work for a string starting with 'zoyanailpolish' (?). In that case , use the regex String regex = "^((ww|z|a)[a-zA-Z0-9-]{0,}\\.)"; This will remove any word that starts with a 'ww' or 'z' or 'a'. Customize it based on what you need exactly.

InternetDomainName.from("test.blogspot.com").topPrivateDomain() -> test.blogspot.com

This works better in my case:

InternetDomainName.from("test.blogspot.com").topDomainUnderRegistrySuffix() -> blogspot.com

Details: https://github.com/google/guava/wiki/InternetDomainNameExplained

继续阅读：domain-name regex

Extract main domain name from a given url

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？