开发者

Extracting all links from webpage [closed]

It开发者_StackOverflow中文版9;s difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center. Closed 11 years ago.

I want to write a function that takes a web page URL, downloads the web page and returns a list of the URLs in that page.(with using urllib module) any help would be appreciated


Here you go:

import sys
import urllib2
import lxml.html

try:
    url = sys.argv[1]
except IndexError:
    print "Specify a url to scrape"
    sys.exit(1)

if not url.startswith("http://"):
    print "Please include the http:// at the beginning of the url"
    sys.exit(1)

html = urllib2.urlopen(url).read()
etree = lxml.html.fromstring(html)

for href in etree.xpath("//a/@href"):
    print href

C:\Programming>getlinks.py http://example.com
/
/domains/
/numbers/
/protocols/
/about/
/go/rfc2606
/about/
/about/presentations/
/about/performance/
/reports/
/domains/
/domains/root/
/domains/int/
/domains/arpa/
/domains/idn-tables/
/protocols/
/numbers/
/abuse/
http://www.icann.org/
mailto:iana@iana.org?subject=General%20website%20feedback
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜