开发者

Finding unique file names from an html file

$ cat downloaded_file.html

1373 <A HREF="http://site.com/STDMON11202010_company.txt">STDMON11202010_company.txt</A><br> Monday, November 22, 2010  1:31 AM  

How do I search an html file from my shell sc开发者_Go百科ript and select the unique filenames those start with STDMON and end with _company.txt


If you have only digits between STDMON and _company.txt you can do:

grep -o 'STDMON[0-9]*_company\.txt' input.txt | sort -u

See it

And if there can be anything you can do:

grep -oP 'STDMON.*?_company\.txt' input.txt | sort -u


 awk -F'>|<' '$3 ~ /STDMON[0-9]+_company.txt/ && !a[$0=$3]++' download_file.html

Input

$ cat downloaded_file.html
1373 <A HREF="http://site.com/STDMON11202010_company.txt">STDMON11202010_company.txt</A><br> Monday, November 22, 2010  1:31 AM
1373 <A HREF="http://site.com/STDMON11202010_company.txt">STDMON11202010_company.txt</A><br> Monday, November 22, 2010  1:31 AM
1373 <A HREF="http://site.com/STDMON11202010_company.txt">STDMON14959440_company.txt</A><br> Monday, November 22, 2010  1:31 AM
1373 <A HREF="http://site.com/STDMON11202010_company.txt">STDMON11202010_company.txt</A><br> Monday, November 22, 2010  1:31 AM
1373 <A HREF="http://site.com/STDMON11202010_company.txt">STDMON14959440_company.txt</A><br> Monday, November 22, 2010  1:31 AM
1373 <A HREF="http://site.com/STDMON11202010_company.txt">STDMON11202010_company.txt</A><br> Monday, November 22, 2010  1:31 AM
1373 <A HREF="http://site.com/STDMON11202010_company.txt">STDMON12342440_company.txt</A><br> Monday, November 22, 2010  1:31 AM

Output

$ awk -F'>|<' '$3 ~ /STDMON[0-9]+_company.txt/ && !a[$0=$3]++'
STDMON11202010_company.txt
STDMON14959440_company.txt
STDMON12342440_company.txt
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜