BeautifulSoup to scrape details from remote site for display on local site

2023-02-14 01:33 问答作者：

new to Python and BeautifulSoup, I am trying to scrape the race details from a website to display on my local club website.

Here is my code so far :

import urllib2
import sys
import os

sys.path.insert(0, os.path.abspath(os.path.dirname(__file__)))
from BeautifulSoup import BeautifulSoup

# Road
#cyclelab_url='http://www.cyclelab.com/OnLine%20Entries.aspx?type=Road%20Events'

# MTB
cyclelab_url='http://www.cyclelab.com/OnLine%20Entries.aspx?type=Mountain%20Biking%20Events'

respons开发者_如何学Pythone = urllib2.urlopen(cyclelab_url)
html = response.read()

soup = BeautifulSoup(html)
event_names = soup.findAll(attrs= {"class" : "SpanEventName"})
for event in event_names:
    txt = event.find(text=True)
    print txt

event_details = soup.findAll(attrs= {"class" : "TDText"})
for detail in event_details:
    lines=[]
    txt_details = detail.find(text=True)
    print txt_details

This prints the event names and the event details, what I want to do is, print the event name and then below it the event details for that event. It seems like it should be simple to do but I am stumped.

If you look at the structure of the page, you'll see that the event name that you find in the first loop is enclosed by a table that has all the other useful details as pairs of cells in rows of the table. So, what I'd do is to just have one loop, and each time you find an event name, look for the enclosing table and find all the events under that. This seems to work OK:

soup = BeautifulSoup(html)
event_names = soup.findAll(attrs= {"class" : "SpanEventName"})
for event in event_names:
    txt = event.find(text=True)
    print "Event name: "+txt.strip()
    # Find each parent in turn until we find the table that encloses
    # the event details:
    parent = event.parent
    while parent and parent.name != "table":
        parent = parent.parent
    if not parent:
        raise Exception, "Failed to find a <table> enclosing the event"
    # Now parent is the table element, so look for every
    # row under that table, and then the cells under that:
    for row in parent.findAll('tr'):
        cells = row.findAll('td')
        # We only care about the rows where there is a multiple of two
        # cells, since these are the key / value pairs:
        if len(cells) % 2 != 0:
            continue
        for i in xrange(0,len(cells),2):
            key_text = cells[i].find(text=True)
            value_text = cells[i+1].find(text=True)
            if key_text and value_text:
                print "  Key:",key_text.strip()
                print "  Value:",value_text.strip()

The output looks like:

Event name: Columbia Grape Escape 2011
  Key: Category:
  Value: Mountain Biking Events
  Key: Event Date:
  Value: 4 March 2011 to 6 March 2011
  Key: Entries Close:
  Value: 31 January 2011 at 23:00
  Key: Venue:
  Value: Eden on the Bay, Blouberg
  Key: Province:
  Value: Western Cape
  Key: Distance:
  Value: 3 Day, 3 Stage Race (228km)
  Key: Starting Time:
  Value: -1:-1
  Key: Timed By:
  Value: RaceTec
Event name: Investpro MTB Race 2011
  Key: Category:
  Value: Mountain Biking Events
  Key: Event Date:
  Value: 5 March 2011
  Key: Entries Close:
  Value: 25 February 2011 at 23:00

... etc.

Update: Mark Longair has the correct/better answer. See comments.

Code gets executed from top to bottom. So, in your code, first all the events are printed and then the details. You have to "weave" the code together, meaning for every event, print all of it's details, then move to the next event. Try something like this:

[....]
event_names = soup.findAll(attrs= {"class" : "SpanEventName"})
event_details = soup.findAll(attrs= {"class" : "TDText"})
for event in event_names:
       txt = event.find(text=True)
       print txt
    for detail in event_details:
        txt_details = detail.find(text=True)
        print txt_details

Some further improvements: You can remove all the whitespace and newlines with .strip(). For example: text_details = detail.find(text=True).strip().

继续阅读：python

BeautifulSoup to scrape details from remote site for display on local site

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？