What is a simple way to extract the list of URLs on a webpage using python? [closed]
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question 开发者_运维问答I want to create a simple web crawler for fun. I need the web crawler to get a list of all links on one page. Does the python library have any built in functions that would make this any easier? Thanks any knowledge appreciated.
This is actually very simple with BeautifulSoup.
from BeautifulSoup import BeautifulSoup
[element['href'] for element in BeautifulSoup(document_contents).findAll('a', href=True)]
# [u'http://example.com/', u'/example', ...]
One last thing: you can use urlparse.urljoin
to make all URLs absolute. If you need the link text, you can use something like element.contents[0]
.
And here's how you might tie it all together:
import urllib2
import urlparse
from BeautifulSoup import BeautifulSoup
def get_all_link_targets(url):
return [urlparse.urljoin(url, tag['href']) for tag in
BeautifulSoup(urllib2.urlopen(url)).findAll('a', href=True)]
There's an article on using HTMLParser to get URLs from <a>
tags on a webpage.
The code is this:
from HTMLParser import HTMLParser from urllib2 import urlopen
class Spider(HTMLParser):
def __init__(self, url):
HTMLParser.__init__(self)
req = urlopen(url)
self.feed(req.read())
def handle_starttag(self, tag, attrs):
if tag == 'a' and attrs:
print "Found link => %s" % attrs[0][1]
Spider('http://www.python.org')
If you ran that script, you'd get output like this:
rafe@linux-7o1q:~> python crawler.py Found link => / Found link => #left-hand-navigation Found link => #content-body Found link => /search Found link => /about/ Found link => /news/ Found link => /doc/ Found link => /download/ Found link => /community/ Found link => /psf/ Found link => /dev/ Found link => /about/help/ Found link => http://pypi.python.org/pypi Found link => /download/releases/2.7/ Found link => http://docs.python.org/ Found link => /ftp/python/2.7/python-2.7.msi Found link => /ftp/python/2.7/Python-2.7.tar.bz2 Found link => /download/releases/3.1.2/ Found link => http://docs.python.org/3.1/ Found link => /ftp/python/3.1.2/python-3.1.2.msi Found link => /ftp/python/3.1.2/Python-3.1.2.tar.bz2 Found link => /community/jobs/ Found link => /community/merchandise/ Found link => margin-top:1.5em Found link => margin-top:1.5em Found link => margin-top:1.5em Found link => color:#D58228; margin-top:1.5em Found link => /psf/donations/ Found link => http://wiki.python.org/moin/Languages Found link => http://wiki.python.org/moin/Languages Found link => http://www.google.com/calendar/ical/b6v58qvojllt0i6ql654r1vh00%40group.calendar.google.com/public/basic.ics Found link => http://wiki.python.org/moin/Python2orPython3 Found link => http://pypi.python.org/pypi Found link => /3kpoll Found link => /about/success/usa/ Found link => reference Found link => reference Found link => reference Found link => reference Found link => reference Found link => reference Found link => /about/quotes Found link => http://wiki.python.org/moin/WebProgramming Found link => http://wiki.python.org/moin/CgiScripts Found link => http://www.zope.org/ Found link => http://www.djangoproject.com/ Found link => http://www.turbogears.org/ Found link => http://wiki.python.org/moin/PythonXml Found link => http://wiki.python.org/moin/DatabaseProgramming/ Found link => http://www.egenix.com/files/python/mxODBC.html Found link => http://sourceforge.net/projects/mysql-python Found link => http://wiki.python.org/moin/GuiProgramming Found link => http://wiki.python.org/moin/WxPython Found link => http://wiki.python.org/moin/TkInter Found link => http://wiki.python.org/moin/PyGtk Found link => http://wiki.python.org/moin/PyQt Found link => http://wiki.python.org/moin/NumericAndScientific Found link => http://www.pasteur.fr/recherche/unites/sis/formation/python/index.html Found link => http://www.pentangle.net/python/handbook/ Found link => /community/sigs/current/edu-sig Found link => http://www.openbookproject.net/pybiblio/ Found link => http://osl.iu.edu/~lums/swc/ Found link => /about/apps Found link => http://docs.python.org/howto/sockets.html Found link => http://twistedmatrix.com/trac/ Found link => /about/apps Found link => http://buildbot.net/trac Found link => http://www.edgewall.com/trac/ Found link => http://roundup.sourceforge.net/ Found link => http://wiki.python.org/moin/IntegratedDevelopmentEnvironments Found link => /about/apps Found link => http://www.pygame.org/news.html Found link => http://www.alobbs.com/pykyra Found link => http://www.vrplumber.com/py3d.py Found link => /about/apps Found link => reference external Found link => reference external Found link => reference external Found link => reference external Found link => reference external Found link => reference external Found link => reference external Found link => reference external Found link => reference external Found link => reference external Found link => reference external Found link => reference external Found link => reference external Found link => /channews.rdf Found link => /about/website Found link => http://www.xs4all.com/ Found link => http://www.timparkin.co.uk/ Found link => /psf/ Found link => /about/legal
You can use regex then to distinguish between absolute and relative URLs.
Solution done using libxml.
import urllib
import libxml2
parse_opts = libxml2.HTML_PARSE_RECOVER + \
libxml2.HTML_PARSE_NOERROR + \
libxml2.HTML_PARSE_NOWARNING
doc = libxml2.htmlReadDoc(urllib.urlopen(url).read(), '', None, parse_opts)
print [ i.getContent() for i in doc.xpathNewContext().xpathEval("//a/@href") ]
精彩评论