开发者

BeautifulSoup to scrape details from remote site for display on local site

new to Python and BeautifulSoup, I am trying to scrape the race details from a website to display on my local club website.

Here is my code so far :

import urllib2
import sys
import os

sys.path.insert(0, os.path.abspath(os.path.dirname(__file__)))
from BeautifulSoup import BeautifulSoup

# Road
#cyclelab_url='http://www.cyclelab.com/OnLine%20Entries.aspx?type=Road%20Events'

# MTB
cyclelab_url='http://www.cyclelab.com/OnLine%20Entries.aspx?type=Mountain%20Biking%20Events'

respons开发者_如何学Pythone = urllib2.urlopen(cyclelab_url)
html = response.read()

soup = BeautifulSoup(html)
event_names = soup.findAll(attrs= {"class" : "SpanEventName"})
for event in event_names:
    txt = event.find(text=True)
    print txt

event_details = soup.findAll(attrs= {"class" : "TDText"})
for detail in event_details:
    lines=[]
    txt_details = detail.find(text=True)
    print txt_details

This prints the event names and the event details, what I want to do is, print the event name and then below it the event details for that event. It seems like it should be simple to do but I am stumped.


If you look at the structure of the page, you'll see that the event name that you find in the first loop is enclosed by a table that has all the other useful details as pairs of cells in rows of the table. So, what I'd do is to just have one loop, and each time you find an event name, look for the enclosing table and find all the events under that. This seems to work OK:

soup = BeautifulSoup(html)
event_names = soup.findAll(attrs= {"class" : "SpanEventName"})
for event in event_names:
    txt = event.find(text=True)
    print "Event name: "+txt.strip()
    # Find each parent in turn until we find the table that encloses
    # the event details:
    parent = event.parent
    while parent and parent.name != "table":
        parent = parent.parent
    if not parent:
        raise Exception, "Failed to find a <table> enclosing the event"
    # Now parent is the table element, so look for every
    # row under that table, and then the cells under that:
    for row in parent.findAll('tr'):
        cells = row.findAll('td')
        # We only care about the rows where there is a multiple of two
        # cells, since these are the key / value pairs:
        if len(cells) % 2 != 0:
            continue
        for i in xrange(0,len(cells),2):
            key_text = cells[i].find(text=True)
            value_text = cells[i+1].find(text=True)
            if key_text and value_text:
                print "  Key:",key_text.strip()
                print "  Value:",value_text.strip()

The output looks like:

Event name: Columbia Grape Escape 2011
  Key: Category:
  Value: Mountain Biking Events
  Key: Event Date:
  Value: 4 March 2011 to 6 March 2011
  Key: Entries Close:
  Value: 31 January 2011 at 23:00
  Key: Venue:
  Value: Eden on the Bay, Blouberg
  Key: Province:
  Value: Western Cape
  Key: Distance:
  Value: 3 Day, 3 Stage Race (228km)
  Key: Starting Time:
  Value: -1:-1
  Key: Timed By:
  Value: RaceTec
Event name: Investpro MTB Race 2011
  Key: Category:
  Value: Mountain Biking Events
  Key: Event Date:
  Value: 5 March 2011
  Key: Entries Close:
  Value: 25 February 2011 at 23:00

... etc.


Update: Mark Longair has the correct/better answer. See comments.

Code gets executed from top to bottom. So, in your code, first all the events are printed and then the details. You have to "weave" the code together, meaning for every event, print all of it's details, then move to the next event. Try something like this:

[....]
event_names = soup.findAll(attrs= {"class" : "SpanEventName"})
event_details = soup.findAll(attrs= {"class" : "TDText"})
for event in event_names:
       txt = event.find(text=True)
       print txt
    for detail in event_details:
        txt_details = detail.find(text=True)
        print txt_details

Some further improvements: You can remove all the whitespace and newlines with .strip(). For example: text_details = detail.find(text=True).strip().

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜