开发者

Match domain name from url (www.google.com=google)

So I want to match just the domain from ether:

http://开发者_开发技巧www.google.com/test/
http://google.com/test/
http://google.net/test/

Output should be for all 3: google

I got this code working for just .com

echo "http://www.google.com/test/" | sed -n "s/.*www\.\(.*\)\.com.*$/\1/p"
Output: 'google'

Then I thought it would be as simple as doing say (com|net) but that doesn't seem to be true:

echo "http://www.google.com/test/" | sed -n "s/.*www\.\(.*\)\.(com|net).*$/\1/p"
Output: '' (nothing)

I was going to use a similar method to get rid of the "www" but it seems im doing something wrong… (does it not work with regex outside the \( \) …)


This will output "google" in all cases:

sed -n "s|http://\(.*\.\)*\(.*\)\..*|\2|p"

Edit:

This version will handle URLs like "'http://google.com.cn/test" and "http://www.google.co.uk/" as well as the ones in the original question:

sed -nr "s|http://(www\.)?([^.]*)\.(.*\.?)*|\2|p"

This version will handle cases that don't include "http://" (plus the others):

sed -nr "s|(http://)?(www\.)?([^.]*)\.(.*\.?)*|\3|p"


if you have Python, you can use urlparse module

import urlparse
for http in open("file"):
    o = urlparse.urlparse(http)
    d = o.netloc.split(".")
    if "www" in o.netloc:
        print d[1]
    else:
        print d[0]

output

$ cat file
http://www.google.com/test/
http://google.com/test/
http://google.net/test/

$ ./python.py
google
google
google

or you can use awk

awk -F"/" '{
    gsub(/http:\/\/|\/.*$/,"")
    split($0,d,".")
    if(d[1]~/www/){
        print d[2]
    }else{
        print d[1]
    }
} ' file

$ cat file
http://www.google.com/test/
http://google.com/test/
http://google.net/test/
www.google.com.cn/test
google.com/test

$ ./shell.sh
google
google
google
google
google


s|http://(www\.)?([^.]*)|$2|

It's Perl with alternate delimiters (because it makes it more legible), I'm sure you can port it to sed or whatever you need.


#! /bin/bash

urls=(                        \
  http://www.google.com/test/ \
  http://google.com/test/     \
  http://google.net/test/     \
)

for url in ${urls[@]}; do
  echo $url | sed -re 's,^http://(.*\.)*(.+)\.[a-z]+/.+$,\2,'
done


Have you tried using the "-r" switch on your sed command? This enables the extended regular expression mode (egrep-compatible regexes).

Edit: try this, it seems to work. The "?:" characters in front of com|net are to prevent this set of characters to be captured by their surrounding parenthesis.

 echo "http://www.google.com/test/" | sed -nr "s/.*www\.(.*)\.(?:com|net).*$/\1/p"
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜