Extracting all links from webpage [closed]
I want to write a function that takes a web page URL, downloads the web page and returns a list of the URLs in that page.(with using urllib module) any help would be appreciated
Here you go:
import sys
import urllib2
import lxml.html
try:
url = sys.argv[1]
except IndexError:
print "Specify a url to scrape"
sys.exit(1)
if not url.startswith("http://"):
print "Please include the http:// at the beginning of the url"
sys.exit(1)
html = urllib2.urlopen(url).read()
etree = lxml.html.fromstring(html)
for href in etree.xpath("//a/@href"):
print href
C:\Programming>getlinks.py http://example.com / /domains/ /numbers/ /protocols/ /about/ /go/rfc2606 /about/ /about/presentations/ /about/performance/ /reports/ /domains/ /domains/root/ /domains/int/ /domains/arpa/ /domains/idn-tables/ /protocols/ /numbers/ /abuse/ http://www.icann.org/ mailto:iana@iana.org?subject=General%20website%20feedback
精彩评论