BeautifulSoup to scrape details from remote site for display on local site
new to Python and BeautifulSoup, I am trying to scrape the race details from a website to display on my local club website.
Here is my code so far :
import urllib2
import sys
import os
sys.path.insert(0, os.path.abspath(os.path.dirname(__file__)))
from BeautifulSoup import BeautifulSoup
# Road
#cyclelab_url='http://www.cyclelab.com/OnLine%20Entries.aspx?type=Road%20Events'
# MTB
cyclelab_url='http://www.cyclelab.com/OnLine%20Entries.aspx?type=Mountain%20Biking%20Events'
respons开发者_如何学Pythone = urllib2.urlopen(cyclelab_url)
html = response.read()
soup = BeautifulSoup(html)
event_names = soup.findAll(attrs= {"class" : "SpanEventName"})
for event in event_names:
txt = event.find(text=True)
print txt
event_details = soup.findAll(attrs= {"class" : "TDText"})
for detail in event_details:
lines=[]
txt_details = detail.find(text=True)
print txt_details
This prints the event names and the event details, what I want to do is, print the event name and then below it the event details for that event. It seems like it should be simple to do but I am stumped.
If you look at the structure of the page, you'll see that the event name that you find in the first loop is enclosed by a table that has all the other useful details as pairs of cells in rows of the table. So, what I'd do is to just have one loop, and each time you find an event name, look for the enclosing table and find all the events under that. This seems to work OK:
soup = BeautifulSoup(html)
event_names = soup.findAll(attrs= {"class" : "SpanEventName"})
for event in event_names:
txt = event.find(text=True)
print "Event name: "+txt.strip()
# Find each parent in turn until we find the table that encloses
# the event details:
parent = event.parent
while parent and parent.name != "table":
parent = parent.parent
if not parent:
raise Exception, "Failed to find a <table> enclosing the event"
# Now parent is the table element, so look for every
# row under that table, and then the cells under that:
for row in parent.findAll('tr'):
cells = row.findAll('td')
# We only care about the rows where there is a multiple of two
# cells, since these are the key / value pairs:
if len(cells) % 2 != 0:
continue
for i in xrange(0,len(cells),2):
key_text = cells[i].find(text=True)
value_text = cells[i+1].find(text=True)
if key_text and value_text:
print " Key:",key_text.strip()
print " Value:",value_text.strip()
The output looks like:
Event name: Columbia Grape Escape 2011
Key: Category:
Value: Mountain Biking Events
Key: Event Date:
Value: 4 March 2011 to 6 March 2011
Key: Entries Close:
Value: 31 January 2011 at 23:00
Key: Venue:
Value: Eden on the Bay, Blouberg
Key: Province:
Value: Western Cape
Key: Distance:
Value: 3 Day, 3 Stage Race (228km)
Key: Starting Time:
Value: -1:-1
Key: Timed By:
Value: RaceTec
Event name: Investpro MTB Race 2011
Key: Category:
Value: Mountain Biking Events
Key: Event Date:
Value: 5 March 2011
Key: Entries Close:
Value: 25 February 2011 at 23:00
... etc.
Update: Mark Longair has the correct/better answer. See comments.
Code gets executed from top to bottom. So, in your code, first all the events are printed and then the details. You have to "weave" the code together, meaning for every event, print all of it's details, then move to the next event. Try something like this:
[....]
event_names = soup.findAll(attrs= {"class" : "SpanEventName"})
event_details = soup.findAll(attrs= {"class" : "TDText"})
for event in event_names:
txt = event.find(text=True)
print txt
for detail in event_details:
txt_details = detail.find(text=True)
print txt_details
Some further improvements: You can remove all the whitespace and newlines with .strip(). For example: text_details = detail.find(text=True).strip()
.
精彩评论