开发者

Regex to find external links from the html file using grep

From past few days I'm trying to develop a regex that fetch all the external links from the web pages given to it using grep.

Here is my grep command

grep -h -o -e "\(\(mailto:\|\(\(ht\|f\)tp\(s\?\)\)\)\://\)\{1\}\(.*\?\)" "/mnt/websites_folder/folder_to_search" -r 

now the grep seem to return everything after the external links in that given line

Example

if an html file contain something like this on same line

Google

https://yahoo.com'>Yahoo

then the given grep command return the following result

http://www.google.com">Google</a><p><a href='https://yahoo.com'>Yahoo</a></p>

the idea here is that if an html file contain more than one links(irrespective in a,img etc) in sa开发者_JS百科me line then the regex should fetch only the links and not all content of that line

I managed to developed the same in rubular.com the regex is as follow

("|')(\b((ht|f)tps?:\/\/)(.*?)\b)("|')

with work with the above input but iam not able to replicate the same in grep can anyone help I can't modify the html file so don't ask me to do that neither I can look for each specific tags and check their attributes to to get external links as it addup processing time and my application doesn't demand that

Thank You


Try this:

cat /path/to/file | egrep -o "(mailto|ftp|http(s)?://){1}[^'\"]+"

egrep -o "(mailto|ftp|http(s)?://){1}[^'\"]+" /path/to/file

Outputs one link per line. It assumes every link is inside single or double quotes. To exclude some certain domain links, use -v:

egrep -o "(mailto|ftp|http(s)?://){1}[^'\"]+" /path/to/file | egrep -v "yahoo.com"


By default grep prints the entire line a match was found on. The -o switch selects only the matched parts of a line. See the man page.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜