What is a simple way to extract the list of URLs on a webpage using python? [closed]

2023-01-24 02:55 问答作者：

Closed. This question needs details or clarity. It is not currently accepting answers.

Want to improve this question? Add details and clarify the problem by editing this post.

Closed 2 years ago.

Improve this question 开发者_运维问答

I want to create a simple web crawler for fun. I need the web crawler to get a list of all links on one page. Does the python library have any built in functions that would make this any easier? Thanks any knowledge appreciated.

This is actually very simple with BeautifulSoup.

from BeautifulSoup import BeautifulSoup

[element['href'] for element in BeautifulSoup(document_contents).findAll('a', href=True)]

# [u'http://example.com/', u'/example', ...]

One last thing: you can use urlparse.urljoin to make all URLs absolute. If you need the link text, you can use something like element.contents[0].

And here's how you might tie it all together:

import urllib2
import urlparse
from BeautifulSoup import BeautifulSoup

def get_all_link_targets(url):
    return [urlparse.urljoin(url, tag['href']) for tag in
            BeautifulSoup(urllib2.urlopen(url)).findAll('a', href=True)]

There's an article on using HTMLParser to get URLs from <a> tags on a webpage.

The code is this:

from HTMLParser import HTMLParser from urllib2 import urlopen

class Spider(HTMLParser):

    def __init__(self, url):
        HTMLParser.__init__(self)
        req = urlopen(url)
        self.feed(req.read())

    def handle_starttag(self, tag, attrs):
        if tag == 'a' and attrs:
            print "Found link => %s" % attrs[0][1]

Spider('http://www.python.org')

If you ran that script, you'd get output like this:

rafe@linux-7o1q:~> python crawler.py
Found link => /
Found link => #left-hand-navigation
Found link => #content-body
Found link => /search
Found link => /about/
Found link => /news/
Found link => /doc/
Found link => /download/
Found link => /community/
Found link => /psf/
Found link => /dev/
Found link => /about/help/
Found link => http://pypi.python.org/pypi
Found link => /download/releases/2.7/
Found link => http://docs.python.org/
Found link => /ftp/python/2.7/python-2.7.msi
Found link => /ftp/python/2.7/Python-2.7.tar.bz2
Found link => /download/releases/3.1.2/
Found link => http://docs.python.org/3.1/
Found link => /ftp/python/3.1.2/python-3.1.2.msi
Found link => /ftp/python/3.1.2/Python-3.1.2.tar.bz2
Found link => /community/jobs/
Found link => /community/merchandise/
Found link => margin-top:1.5em
Found link => margin-top:1.5em
Found link => margin-top:1.5em
Found link => color:#D58228; margin-top:1.5em
Found link => /psf/donations/
Found link => http://wiki.python.org/moin/Languages
Found link => http://wiki.python.org/moin/Languages
Found link => http://www.google.com/calendar/ical/b6v58qvojllt0i6ql654r1vh00%40group.calendar.google.com/public/basic.ics
Found link => http://wiki.python.org/moin/Python2orPython3
Found link => http://pypi.python.org/pypi
Found link => /3kpoll
Found link => /about/success/usa/
Found link => reference
Found link => reference
Found link => reference
Found link => reference
Found link => reference
Found link => reference
Found link => /about/quotes
Found link => http://wiki.python.org/moin/WebProgramming
Found link => http://wiki.python.org/moin/CgiScripts
Found link => http://www.zope.org/
Found link => http://www.djangoproject.com/
Found link => http://www.turbogears.org/
Found link => http://wiki.python.org/moin/PythonXml
Found link => http://wiki.python.org/moin/DatabaseProgramming/
Found link => http://www.egenix.com/files/python/mxODBC.html
Found link => http://sourceforge.net/projects/mysql-python
Found link => http://wiki.python.org/moin/GuiProgramming
Found link => http://wiki.python.org/moin/WxPython
Found link => http://wiki.python.org/moin/TkInter
Found link => http://wiki.python.org/moin/PyGtk
Found link => http://wiki.python.org/moin/PyQt
Found link => http://wiki.python.org/moin/NumericAndScientific
Found link => http://www.pasteur.fr/recherche/unites/sis/formation/python/index.html
Found link => http://www.pentangle.net/python/handbook/
Found link => /community/sigs/current/edu-sig
Found link => http://www.openbookproject.net/pybiblio/
Found link => http://osl.iu.edu/~lums/swc/
Found link => /about/apps
Found link => http://docs.python.org/howto/sockets.html
Found link => http://twistedmatrix.com/trac/
Found link => /about/apps
Found link => http://buildbot.net/trac
Found link => http://www.edgewall.com/trac/
Found link => http://roundup.sourceforge.net/
Found link => http://wiki.python.org/moin/IntegratedDevelopmentEnvironments
Found link => /about/apps
Found link => http://www.pygame.org/news.html
Found link => http://www.alobbs.com/pykyra
Found link => http://www.vrplumber.com/py3d.py
Found link => /about/apps
Found link => reference external
Found link => reference external
Found link => reference external
Found link => reference external
Found link => reference external
Found link => reference external
Found link => reference external
Found link => reference external
Found link => reference external
Found link => reference external
Found link => reference external
Found link => reference external
Found link => reference external
Found link => /channews.rdf
Found link => /about/website
Found link => http://www.xs4all.com/
Found link => http://www.timparkin.co.uk/
Found link => /psf/
Found link => /about/legal

You can use regex then to distinguish between absolute and relative URLs.

Solution done using libxml.

import urllib
import libxml2
parse_opts = libxml2.HTML_PARSE_RECOVER + \
            libxml2.HTML_PARSE_NOERROR + \
            libxml2.HTML_PARSE_NOWARNING

doc = libxml2.htmlReadDoc(urllib.urlopen(url).read(), '', None, parse_opts)
print [ i.getContent() for i in doc.xpathNewContext().xpathEval("//a/@href") ]

继续阅读：python web-applications

What is a simple way to extract the list of URLs on a webpage using python? [closed]

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？