开发者

How to fix non-compliant HTML so Expat will parse it (htmltidy not working)

I'm trying to scrape information from http://www.nfl.com/scores (in particular, find out when a game is over so my computer can stop recording it). I can download the HTML easily enough, and it makes this claim about compliance with standards:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">

But

  1. An attempt to parse it with Expat produces the error not well-formed (invalid token).

  2. The W3C's online validation service reports 399 Errors and 121 warnings.

  3. I tried to run HTML tidy (just called tidy) on my Linux system with the -xml option, but tidy reports 56 warnings and 117 errors and is unable to recover a good XML file. The errors look like this:

    line 409 column 122 - Warning: unescaped & or unknown entity "&role"
    ...
    line 409 column 172 - Warning: unescaped & or unknown entity "&tabSeq"
    ...
    line 1208 column 65 - Error: unexpected </td> in <br>
    line 1209 column 57 - Error: unexpected </tr> in <br>
    line 1210 col开发者_开发百科umn 49 - Error: unexpected </table> in <br>
    

    But when I check the input, the "unknown entities" appear to be part of a properly quoted URL, so I don't know if a double quote is missing somewhere or what.

I know that there is something out there that can parse this stuff because both Firefox and w3m display something reasonable. What tool will fix the non-compliant HTML so that I can parse it with Expat?


They're using some kind of Javascript on the score boxes, so you're going to have to play more clever tricks (line breaks mine):

/* box of awesome */
// iscurrentweek ? true;
(new nfl.scores.Game('2009112905','54635',{state:'pre',container:'scorebox-2009112905',
wrapper:'sb-wrapper-2009112905',template:($('scorebox-2009112905').innerHTML),homeabbr:'NYJ',
awayabbr:'CAR'}));

However, to answer your question, BeautifulSoup parses it (seemingly) fine:

fp = urlopen("http://www.nfl.com/scores")
data = ""
while 1:
    r = fp.read()
    if not r:
        break
    data += r
fp.close()

soup = BeautifulSoup(data)
print soup.contents[2].contents[1].contents[1]

Outputs:

<title>NFL Scores: 2009 - Week 12</title>

Might be easier to scrape Yahoo's NFL scoreboard, in my opinion...in fact, off to try it.


EDIT: Used your question as an excuse to get around to learning BeautifulSoup. Alex Martelli has been singing its praise, so I figured it worth a try -- man, am I impressed.

Anyway, I was able to cook up a rudimentary score scraper from the Yahoo! scoreboard, like so:

def main():
    soup = BeautifulSoup(YAHOO_SCOREBOARD)
    on_first_team = True
    scores = []
    hold = None

    # Iterate the tr that contains a team's box score
    for item in soup(name="tr", attrs={"align": "center", "class": "ysptblclbg5"}):
        # Easy
        team = item.b.a.string

        # Get the box scores since we're industrious
        boxscore = []
        for quarter in item(name="td", attrs={"class": "yspscores"}):
            boxscore.append(int(quarter.string))

        # Final score
        sub = item(name="span", attrs={"class": "yspscores"})[0]
        if sub.b:
            # Winning score
            final = int(sub.b.string)
        else:
            data = sub.string.replace("&nbsp;", "")
            if ":" in data:
                # Catch TV: XXX and 0:00pm ET
                final = None
            else:
                try: final = int(data)
                except: final = None

        if on_first_team:
            hold = { team : (boxscore, final) }
            on_first_team = False
        else:
            hold[team] = (boxscore, final)
            scores.append(hold)
            on_first_team = True

    for game in scores:
        print "--- Game ---"
        for team in game:
            print team, game[team]

I would tweak this on Sunday to see how it operates, as it's really rough. Here's what it outputs as of right now:

--- Game ---
Green Bay ([0, 13, 14, 7], 34)
Detroit ([7, 0, 0, 5], 12)
--- Game ---
Oakland ([0, 0, 7, 0], 7)
Dallas ([3, 14, 0, 7], 24)

Look at that, I snagged box scores too... for a game that hasn't happened yet, we get:

--- Game ---
Washington ([], None)
Philadelphia ([], None)

Anyway, a peg for you to jump from. Good luck.


There's a Flash-based auto-updating scoreboard thing at the top of nfl.com. Some monitoring of its network traffic finds:

http://www.nfl.com/liveupdate/scorestrip/ss.xml

That will probably be a bit easier to parse than the HTML scoreboard.


Look into tagsoup. If you want to end up with a DOM tree or a SAX stream in Java, it's the ticket. If you just want to extract specific information, Beautiful Soup is a Beautiful Thing.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜