Extract main domain name from a given url
I used the following to extract the domain from a url: (They are test cases)
String regex = "^(ww[a-zA-Z0-9-]{0,}\\.)";
ArrayList<String> cases = new ArrayList<String>();
ca开发者_开发百科ses.add("www.google.com");
cases.add("ww.socialrating.it");
cases.add("www-01.hopperspot.com");
cases.add("wwwsupernatural-brasil.blogspot.com");
cases.add("xtop10.net");
cases.add("zoyanailpolish.blogspot.com");
for (String t : cases) {
String res = t.replaceAll(regex, "");
}
I can get the following results:
google.com
hopperspot.com
socialrating.it
blogspot.com
xtop10.net
zoyanailpolish.blogspot.com
The first four cases are good. The last one is not good. What I want is: blogspot.com
for the last one, but it gives zoyanailpolish.blogspot.com
. What am I doing wrong?
Using Guava library, we can easily get domain name:
InternetDomainName.from(tld).topPrivateDomain()
Refer API link for more details
https://google.github.io/guava/releases/14.0/api/docs/
http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/net/InternetDomainName.html
Obtain the host through REGEX is pretty complicated or impossible because TLD's don't obey to simple rules but are provided by ICANN and change in time.
You should use instead the functionality provided by JAVA library like this:
URL myUrl = new URL(urlString);
myUrl.getHost();
This is 2013 and solution I found is straight forward:
System.out.println(InternetDomainName.fromLenient(uriHost).topPrivateDomain().name());
It is much simpler:
try {
String domainName = new URL("http://www.zoyanailpolish.blogspot.com/some/long/link").getHost();
String[] levels = domainName.split("\\.");
if (levels.length > 1)
{
domainName = levels[levels.length - 2] + "." + levels[levels.length - 1];
}
// now value of domainName variable is blogspot.com
} catch (Exception e) {}
As suggested by BalusC and others the most practical solution would be to get a list of TLDs (see this list), save them to a file, load them and then determine what TLD is being used by a given url String. From there on you could constitute the main domain name as follows:
String url = "zoyanailpolish.blogspot.com";
String tld = findTLD( url ); // To be implemented. Add to helper class ?
url = url.replace( "." + tld,"");
int pos = url.lastIndexOf('.');
String mainDomain = "";
if (pos > 0 && pos < url.length() - 1) {
mainDomain = url.substring(pos + 1) + "." + tld;
}
// else: Main domain name comes out empty
The implementation details are left up to you.
The reason why your are seeing zoyanailpolish.blogspot.com
is that your regex finds only strings that start with a 'ww'. What you are asking is that in addition to removing all strings that start with a 'ww' , it should also work for a string starting with 'zoyanailpolish' (?). In that case , use the regex String regex = "^((ww|z|a)[a-zA-Z0-9-]{0,}\\.)";
This will remove any word that starts with a 'ww' or 'z' or 'a'. Customize it based on what you need exactly.
InternetDomainName.from("test.blogspot.com").topPrivateDomain() -> test.blogspot.com
This works better in my case:
InternetDomainName.from("test.blogspot.com").topDomainUnderRegistrySuffix() -> blogspot.com
Details: https://github.com/google/guava/wiki/InternetDomainNameExplained
精彩评论