How to fix non-compliant HTML so Expat will parse it (htmltidy not working)
I'm trying to scrape information from http://www.nfl.com/scores (in particular, find out when a game is over so my computer can stop recording it). I can download the HTML easily enough, and it makes this claim about compliance with standards:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
But
An attempt to parse it with Expat produces the error
not well-formed (invalid token)
.The W3C's online validation service reports 399 Errors and 121 warnings.
I tried to run HTML tidy (just called
tidy
) on my Linux system with the-xml
option, but tidy reports 56 warnings and 117 errors and is unable to recover a good XML file. The errors look like this:line 409 column 122 - Warning: unescaped & or unknown entity "&role" ... line 409 column 172 - Warning: unescaped & or unknown entity "&tabSeq" ... line 1208 column 65 - Error: unexpected </td> in <br> line 1209 column 57 - Error: unexpected </tr> in <br> line 1210 col开发者_开发百科umn 49 - Error: unexpected </table> in <br>
But when I check the input, the "unknown entities" appear to be part of a properly quoted URL, so I don't know if a double quote is missing somewhere or what.
I know that there is something out there that can parse this stuff because both Firefox and w3m display something reasonable. What tool will fix the non-compliant HTML so that I can parse it with Expat?
They're using some kind of Javascript on the score boxes, so you're going to have to play more clever tricks (line breaks mine):
/* box of awesome */
// iscurrentweek ? true;
(new nfl.scores.Game('2009112905','54635',{state:'pre',container:'scorebox-2009112905',
wrapper:'sb-wrapper-2009112905',template:($('scorebox-2009112905').innerHTML),homeabbr:'NYJ',
awayabbr:'CAR'}));
However, to answer your question, BeautifulSoup parses it (seemingly) fine:
fp = urlopen("http://www.nfl.com/scores")
data = ""
while 1:
r = fp.read()
if not r:
break
data += r
fp.close()
soup = BeautifulSoup(data)
print soup.contents[2].contents[1].contents[1]
Outputs:
<title>NFL Scores: 2009 - Week 12</title>
Might be easier to scrape Yahoo's NFL scoreboard, in my opinion...in fact, off to try it.
EDIT: Used your question as an excuse to get around to learning BeautifulSoup. Alex Martelli has been singing its praise, so I figured it worth a try -- man, am I impressed.
Anyway, I was able to cook up a rudimentary score scraper from the Yahoo! scoreboard, like so:
def main():
soup = BeautifulSoup(YAHOO_SCOREBOARD)
on_first_team = True
scores = []
hold = None
# Iterate the tr that contains a team's box score
for item in soup(name="tr", attrs={"align": "center", "class": "ysptblclbg5"}):
# Easy
team = item.b.a.string
# Get the box scores since we're industrious
boxscore = []
for quarter in item(name="td", attrs={"class": "yspscores"}):
boxscore.append(int(quarter.string))
# Final score
sub = item(name="span", attrs={"class": "yspscores"})[0]
if sub.b:
# Winning score
final = int(sub.b.string)
else:
data = sub.string.replace(" ", "")
if ":" in data:
# Catch TV: XXX and 0:00pm ET
final = None
else:
try: final = int(data)
except: final = None
if on_first_team:
hold = { team : (boxscore, final) }
on_first_team = False
else:
hold[team] = (boxscore, final)
scores.append(hold)
on_first_team = True
for game in scores:
print "--- Game ---"
for team in game:
print team, game[team]
I would tweak this on Sunday to see how it operates, as it's really rough. Here's what it outputs as of right now:
--- Game ---
Green Bay ([0, 13, 14, 7], 34)
Detroit ([7, 0, 0, 5], 12)
--- Game ---
Oakland ([0, 0, 7, 0], 7)
Dallas ([3, 14, 0, 7], 24)
Look at that, I snagged box scores too... for a game that hasn't happened yet, we get:
--- Game ---
Washington ([], None)
Philadelphia ([], None)
Anyway, a peg for you to jump from. Good luck.
There's a Flash-based auto-updating scoreboard thing at the top of nfl.com. Some monitoring of its network traffic finds:
http://www.nfl.com/liveupdate/scorestrip/ss.xml
That will probably be a bit easier to parse than the HTML scoreboard.
Look into tagsoup. If you want to end up with a DOM tree or a SAX stream in Java, it's the ticket. If you just want to extract specific information, Beautiful Soup is a Beautiful Thing.
精彩评论