How do I get the URLs out of an HTML file?

2023-02-10 18:05 问答作者：

I need to开发者_StackOverflow get a long list of valid URLs for testing my DNS server. I found a web page that has a ton of links in it that would probably yield quite a lot of good links (http://www.cse.psu.edu/~groenvel/urls.html), and I figured that the easiest way to do this would be to download the HTML file and simply grep for the URLs. However, I can't get it to list out my results with only the link.

I know there are lots of ways to do this. I'm not picky how it's done.

Given the URL above, I want a list of all of the URLs (one per line) like this:

http://www.cse.psu.edu/~groenvel/
http://www.acard.com/
http://www.acer.com/
...

Method 1

Step1:

wget "http://www.cse.psu.edu/~groenvel/urls.html"

Step2:

perl -0ne 'print "$1\n" while (/a href=\"(.*?)\">.*?<\/a>/igs)' /PATH_TO_YOUR/urls.html | grep 'http://' > /PATH_TO_YOUR/urls.txt

Just replace the "/PATH_TO_YOUR/" with your filepath. This would yield a text file with only urls.

Method 2

If you have lynx installed you could simply do this in 1 step:

Step1:

lynx --dump http://www.cse.psu.edu/~groenvel/urls.html | awk '/(http|https):\/\// {print $2}' > /PATH_TO_YOUR/urls.txt

Method 3

Using curl:

Step1

curl http://www.cse.psu.edu/~groenvel/urls.html 2>&1 | egrep -o  "(http|https):.*\">" | awk  'BEGIN {FS="\""};{print $1}' > /PATH_TO_YOUR/urls.txt

Method 4

Using wget:

wget -qO- http://www.cse.psu.edu/~groenvel/urls.html 2>&1 | egrep -o  "(http|https):.*\">" | awk  'BEGIN {FS="\""};{print $1}' > /PATH_TO_YOUR/urls.txt

you need wget, grep, sed. I will try a solution and update my post later.

Update:

wget [the_url];

cat urls.html | egrep -i '<a href=".*">' | sed -e 's/.*<A HREF="\(.*\)">.*/\1/i'

继续阅读：grep

How do I get the URLs out of an HTML file?

Method 1

Method 2

Method 3

Method 4

更多精彩内容

精彩评论

最新问答

汽车清洗保养方法（汽车清洗保养方法是什么）？

《七龙珠电光炸裂! ZERO》中文版游戏模式及登场角色介绍宣传影片？

2023武汉科技馆新馆开放时间+领票时间+参观信息？

目前全国农民工约3亿人外出务工农民工1.9亿人？

【中字】《如龙8外传夏威夷海盗》发表！2025年2月28日发售？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？

Method 1

Method 2

Method 3

Method 4

更多精彩内容

精彩评论

最新问答

汽车清洗保养方法（汽车清洗保养方法是什么）？

《七龙珠 电光炸裂! ZERO》中文版游戏模式及登场角色介绍宣传影片？

2023武汉科技馆新馆开放时间+领票时间+参观信息？

目前全国农民工约3亿人 外出务工农民工1.9亿人？

【中字】《如龙8外传 夏威夷海盗》发表！2025年2月28日发售？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生 新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？

《七龙珠电光炸裂! ZERO》中文版游戏模式及登场角色介绍宣传影片？

目前全国农民工约3亿人外出务工农民工1.9亿人？

【中字】《如龙8外传夏威夷海盗》发表！2025年2月28日发售？

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？