Javascript/Regex for finding just the root domain name without sub domains

2023-01-11 05:41 问答作者：

I had a search and found lot's of similar regex examples, but not quite what I need.

I want to be able to pass in the following urls and return the results:

www.google.com returns google.开发者_运维百科com
sub.domains.are.cool.google.com returns google.com
doesntmatterhowlongasubdomainis.idont.wantit.google.com returns google.com
sub.domain.google.com/no/thanks returns google.com

Hope that makes sense :) Thanks in advance!-James

You can't do this with a regular expression because you don't know how many blocks are in the suffix.

For example google.com has a suffix of com. To get from subdomain.google.com to google.com you'd have to take the last two blocks - one for the suffix and one for google.

If you apply this logic to subdomain.google.co.uk though you would end up with co.uk.

You will actually need to look up the suffix from a list like http://publicsuffix.org/

Don't use regex, use the .split() method and work from there.

var s = domain.split('.');

If your use case is fairly narrow you could then check the TLDs as needed, and then return the last 2 or 3 segments as appropriate:

return s.slice(-2).join('.');

It'll make your eyes bleed less than any regex solution.

I've not done a lot of testing on this, but if I understand what you're asking for, this should be a decent starting point...

([A-Za-z0-9-]+\.([A-Za-z]{3,}|[A-Za-z]{2}\.[A-Za-z]{2}|[A-za-z]{2}))\b

EDIT:

To clarify, it's looking for:

one or more alpha-numeric characters or dashes, followed by a literal dot

and then one of three things...

three or more alpha characters (i.e. com/net/mil/coop, etc.)
two alpha characters, followed by a literal dot, followed by two more alphas (i.e. co.uk)
two alpha characters (i.e. us/uk/to, etc)

and at the end of that, a word boundary (\b) meaning the end of the string, a space, or a non-word character (in regex word characters are typically alpha-numerics, and underscore).

As I say, I didn't do much testing, but it seemed a reasonable jumping off point. You'd likely need to try it and tune it some, and even then, it's unlikely that you'll get 100% for all test cases. There are considerations like Unicode domain names and all sorts of technically-valid-but-you'll-likely-not-encounter-in-the-wild things that'll trip up a simple regex like this, but this'll probably get you 90%+ of the way there.

If you have limited subset of data, I suggest to keep the regex simple, e.g.

(([a-z\-]+)(?:\.com|\.fr|\.co.uk))

This will match:

www.google.com --> google.com
www.google.co.uk --> google.co.uk
www.foo-bar.com --> foo-bar.com

In my case, I know that all relevant URLs will be matched using this regex.

Collect a sample dataset and test it against your regex. While prototyping, you can do that using a tool such https://regex101.com/r/aG9uT0/1. In development, automate it using a test script.

([A-Za-z0-9-]+\.([A-Za-z]{3,}|[A-Za-z]{2}\.[A-Za-z]{2}|[A-za-z]{2}))(?!\.([A-Za-z]{3,}|[A-Za-z]{2}\.[A-Za-z]{2}|[A-za-z]{2}))\b

This is an improvement upon theracoonbear's answer. I did a quick bit of testing and noticed that if you give it a domain where the subdomain has a subdomain, it will fail. I also wanted to point out that the "90%" was definitely not generous. It will be a lot closer to 100% than you think. It works on all subdomains of the top 50 most visited websites which accounts for a huge chunk of worldwide internet activity. The only time it would fail is potentially with unicode domains, etc.

My solution starts off working the same way that theracoonbear's does. Instead of checking for a word boundary, it uses a negative lookahead to check if there is not something that could be a TLD at the end (just copied the TLD checking part over into a negative lookahead).

Without testing the validity of top level domain, I'm using an adaptation of stormsweeper's solution:

domain = 'sub.domains.are.cool.google.com'

s = domain.split('.')

tld = s.slice(-2..-1).join('.')

EDIT: Be careful of issues with three part TLDs like domain.co.uk.

继续阅读：dns javascript regex

Javascript/Regex for finding just the root domain name without sub domains

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？