Remove duplicate domains from list with regular expressions

2022-12-20 05:48 问答作者：

I'd like to use PCRE to take a list of URI's and distill it.

Start:

http://abcd.tld/products/widget1       
http://abcd.tld/products/widget2    
开发者_开发百科http://abcd.tld/products/review    
http://1234.tld/

Finish:

http://abcd.tld/products/widget1
http://1234.tld/

Any ideas, dear members of StackOverflow?

You can you simple tools like uniq.

See kobi's example in the comments:

grep -o "^[^/]*//[^/]*/" urls.txt | sort | uniq

While it's INSANELY inefficient, it can be done...

(?<!^http://\2/.*?$.*)^(http://(.*?)/.*?$)

Please don't use this

Parse out the domain using a URI library, then insert it into a hash. You'll write over any URL that exists in that hash already so you'll end up with unique links.

Here's a Ruby example:

require 'uri'

unique_links = {}

links.each do |l|
  u = URI.parse(l)
  unique_links[u.host] = l
end

unique_links.values # returns an Array of the unique links

If you can work with the whole file as a single string, rather than line-by-line, then why shouldn't something like this work. (I'm not sure about the char ranges.)

s!(\w+://[a-zA-Z0-9.]+/\S+/)([^ /]+)\n(\1[^ /]+\n)+!\1\2!

if you have (g)awk on your system

awk -F"/" '{
 s=$1
 for(i=2;i<NF;i++){ s=s"/"$i }
 if( !(s in a) ){ a[s]=$NF }
}
END{
    for(i in a) print i"/"a[i]
} ' file

output

$ ./shell.sh
http://abcd.tld/products/widget1
http://1234.tld/

继续阅读：regex string text uri

Remove duplicate domains from list with regular expressions

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？