Top level domain from URL in C#
I am using C# and ASP.NET for this.
We receive a lot of "strange" requests on our IIS 6.0 servers and I want to log and catalog开发者_高级运维 these by domain.
Eg. we get some strange requests like these:
- http://www.poker.winner4ever.example.com/
- http://www.hotgirls.example.com/
- http://santaclaus.example.com/
- http://m.example.com/
- http://wap.example.com/
- http://iphone.example.com/
the latter three are kinda obvious, but I would like to sort them all into one as "example.com" IS hosted on our servers. The rest isn't, sorry :-)
So I am looking for some good ideas for how to retrieve example.com from the above. Secondly I would like to match the m., wap., iphone etc into a group, but that's probably just a quick lookup in a list of mobile shortcuts.I could handcode this list for a start.
But is regexp the answer here or is pure string manipulation the easiest way? I was thinking of "splitting" the URL string by "." and the look for item[0] and item[1]...
Any ideas?
You can use the following nuget Nager.PublicSuffix package. It uses the same data source that browser vendors use.
nuget
PM> Install-Package Nager.PublicSuffix
Example
var domainParser = new DomainParser(new WebTldRuleProvider());
var domainInfo = domainParser.Parse("sub.test.co.uk");
//domainInfo.Domain = "test";
//domainInfo.Hostname = "sub.test.co.uk";
//domainInfo.RegistrableDomain = "test.co.uk";
//domainInfo.SubDomain = "sub";
//domainInfo.TLD = "co.uk";
The following code uses the Uri
class to obtain the host name, and then obtains the second level host (examplecompany.com) from Uri.Host
by splitting the host name on periods.
var uri = new Uri("http://www.poker.winner4ever.examplecompany.com/");
var splitHostName = uri.Host.Split('.');
if (splitHostName.Length >= 2)
{
var secondLevelHostName = splitHostName[splitHostName.Length - 2] + "." +
splitHostName[splitHostName.Length - 1];
}
There may be some examples where this returns something other than what is desired, but country codes are the only ones that are 2 characters, and they may or may not have a short second level (2 or 3 characters) typically used. Therefore, this will give you what you want in most cases:
string GetRootDomain(string host)
{
string[] domains = host.Split('.');
if (domains.Length >= 3)
{
int c = domains.Length;
// handle international country code TLDs
// www.amazon.co.uk => amazon.co.uk
if (domains[c - 1].Length < 3 && domains[c - 2].Length <= 3)
return string.Join(".", domains, c - 3, 3);
else
return string.Join(".", domains, c - 2, 2);
}
else
return host;
}
This is not possible without a up-to-date database of different domain levels.
Consider:
s1.moh.gov.cn
moh.gov.cn
s1.google.com
google.com
Then at which level you want to get the domain? It's completely depends of the TLD
, SLD
, ccTLD
... because ccTLD
in under control of countries they may define very special SLD
which is unknown to you.
I've written a library for use in .NET 2+ to help pick out the domain components of a URL.
More details are on github but one benefit over previous options is that it can download the latest data from http://publicsuffix.org automatically (once per month) so the output from the library should be more-or-less on a par with the output used by web browsers to establish domain security boundaries (i.e. pretty good).
It's not perfect yet but suits my needs and shouldn't take much work to adapt to other use cases so please fork and send a pull request if you want.
Use a regular expression:
^https?://([\w./]+[^.])?\.?(\w+\.(com)|(co.uk)|(com.au))$
This will match any URL ending with a TLD in which you are interested. Extend the list for as many as you want. Further, the capturing groups will contain the subdomain, hostname and TLD respectively.
uri.Host.ToLower().Replace("www.","").Substring(uri.Host.ToLower().Replace("www.","").IndexOf('.'))
returns ".com" for
Uri uri = new Uri("http://stackoverflow.com/questions/4643227/top-level-domain-from-url-in-c");
returns ".co.jp" for
Uri uri = new Uri("http://stackoverflow.co.jp");
returns ".s1.moh.gov.cn" for
Uri uri = new Uri("http://stackoverflow.s1.moh.gov.cn");
etc.
精彩评论