Regular Expression to Remove Subdomain from Root Domain in List - Notepad++ or Gvim
I have a list of URLs stored in a .txt file (I'm using Windows 7).
The format of the URLs is this:开发者_JS百科
somesite1.com
somesite2.com
somesite3.com
sub1.somesite3.com
sub2.somesite3.com
sub3.somesite3.com
sub1.somesite3.net
sub1.somesite1.org
In notepad++, there is an option to use "find-replace with regular expressions", and I'm fairly sure that gvim allows the user of regular expressions (although I'm not entirely sure how to use them in Gvim).
Anyway, I don't know what to put in the find & replace boxes so it can go through the contents of the file and leave me with only the root domains. If done properly, it would turn the above example list into this:
somesite1.com
somesite2.com
somesite3.com
somesite3.com
somesite3.com
somesite3.com
somesite3.net
somesite1.org
Can somebody help me out?
A couple of ways of doing it for Vim (the trailing slashes are optional, too):
:%s/^.\+\.\ze[^.]\+\.[^.]\+$//
:%s/^.\+\.\([^.]\+\.[^.]\+\)$/\1/
See also :help /\ze
etc. \ze
and \zs
are Vim-specific and very useful. There are also look-ahead and look-behind assertions which can be useful, in Vim and PCRE.
I believe Notepad++ uses PCRE; find ^.+\.([^.]+\.[^.]+)$
and replace it with \1
should work (but I don't use Notepad++).
Be aware this won't work well with country code top level domains which use third-level registration - example.com.au
would be turned into com.au
. And then there are some countries which use second- or third-level registration under certain rules... if you care about those cases, you'll need more rules and a full parser would be neater than a regular expression (though as always it would be possible with regular expressions).
Replace ^[^.]*\.(?=\w+\.\w+$)
with <blank>
Deciphered, this means:
^
= start of line[^.]*
= any number of chars that are not a dot\.
= a dot(?=[^.]+\.[^.]+$)
= there must be exactly one word, one dot then one word from here to the end
EDITED - Added look ahead for another dot
EDITED AGAIN - Changed look ahead for exactly one dot between words
Replace whole of line to Last word and previous word of one.
%s/^.*\.\(\w\+\.\w\+\)$/\1/g
Note that vim require \
,(
,)
for + like \+
UPDATE:
%s/^.*\.\([0-9a-z\-]\+\.[0-9a-z\-]\+\)$/\1/g
is better maybe.
精彩评论