Remove duplicate domains from list with regular expressions
I'd like to use PCRE to take a list of URI's and distill it.
Start:
http://abcd.tld/products/widget1
http://abcd.tld/products/widget2
开发者_开发百科http://abcd.tld/products/review
http://1234.tld/
Finish:
http://abcd.tld/products/widget1
http://1234.tld/
Any ideas, dear members of StackOverflow?
You can you simple tools like uniq.
See kobi's example in the comments:
grep -o "^[^/]*//[^/]*/" urls.txt | sort | uniq
While it's INSANELY inefficient, it can be done...
(?<!^http://\2/.*?$.*)^(http://(.*?)/.*?$)
Please don't use this
Parse out the domain using a URI library, then insert it into a hash. You'll write over any URL that exists in that hash already so you'll end up with unique links.
Here's a Ruby example:
require 'uri'
unique_links = {}
links.each do |l|
u = URI.parse(l)
unique_links[u.host] = l
end
unique_links.values # returns an Array of the unique links
If you can work with the whole file as a single string, rather than line-by-line, then why shouldn't something like this work. (I'm not sure about the char ranges.)
s!(\w+://[a-zA-Z0-9.]+/\S+/)([^ /]+)\n(\1[^ /]+\n)+!\1\2!
if you have (g)awk on your system
awk -F"/" '{
s=$1
for(i=2;i<NF;i++){ s=s"/"$i }
if( !(s in a) ){ a[s]=$NF }
}
END{
for(i in a) print i"/"a[i]
} ' file
output
$ ./shell.sh
http://abcd.tld/products/widget1
http://1234.tld/
精彩评论