Regex to find external links from the html file using grep
From past few days I'm trying to develop a regex that fetch all the external links from the web pages given to it using grep.
Here is my grep command
grep -h -o -e "\(\(mailto:\|\(\(ht\|f\)tp\(s\?\)\)\)\://\)\{1\}\(.*\?\)" "/mnt/websites_folder/folder_to_search" -r
now the grep seem to return everything after the external links in that given line
Example
if an html file contain something like this on same line
https://yahoo.com'>Yahoo
then the given grep command return the following result
http://www.google.com">Google</a><p><a href='https://yahoo.com'>Yahoo</a></p>
the idea here is that if an html file contain more than one links(irrespective in a,img etc
) in sa开发者_JS百科me line then the regex should fetch only the links and not all content of that line
I managed to developed the same in rubular.com the regex is as follow
("|')(\b((ht|f)tps?:\/\/)(.*?)\b)("|')
with work with the above input but iam not able to replicate the same in grep can anyone help I can't modify the html file so don't ask me to do that neither I can look for each specific tags and check their attributes to to get external links as it addup processing time and my application doesn't demand that
Thank You
Try this:
cat /path/to/file | egrep -o "(mailto|ftp|http(s)?://){1}[^'\"]+"
egrep -o "(mailto|ftp|http(s)?://){1}[^'\"]+" /path/to/file
Outputs one link per line. It assumes every link is inside single or double quotes. To exclude some certain domain links, use -v
:
egrep -o "(mailto|ftp|http(s)?://){1}[^'\"]+" /path/to/file | egrep -v "yahoo.com"
By default grep prints the entire line a match was found on. The -o
switch selects only the matched parts of a line. See the man page.
精彩评论