How to fix non-compliant HTML so Expat will parse it (htmltidy not working)

2022-12-13 05:01 问答作者：

I'm trying to scrape information from http://www.nfl.com/scores (in particular, find out when a game is over so my computer can stop recording it). I can download the HTML easily enough, and it makes this claim about compliance with standards:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">

But

An attempt to parse it with Expat produces the error not well-formed (invalid token).
The W3C's online validation service reports 399 Errors and 121 warnings.
I tried to run HTML tidy (just called tidy) on my Linux system with the -xml option, but tidy reports 56 warnings and 117 errors and is unable to recover a good XML file. The errors look like this:
```
line 409 column 122 - Warning: unescaped & or unknown entity "&role"
...
line 409 column 172 - Warning: unescaped & or unknown entity "&tabSeq"
...
line 1208 column 65 - Error: unexpected </td> in <br>
line 1209 column 57 - Error: unexpected </tr> in <br>
line 1210 col开发者_开发百科umn 49 - Error: unexpected </table> in <br>
```
But when I check the input, the "unknown entities" appear to be part of a properly quoted URL, so I don't know if a double quote is missing somewhere or what.

I know that there is something out there that can parse this stuff because both Firefox and w3m display something reasonable. What tool will fix the non-compliant HTML so that I can parse it with Expat?

They're using some kind of Javascript on the score boxes, so you're going to have to play more clever tricks (line breaks mine):

/* box of awesome */
// iscurrentweek ? true;
(new nfl.scores.Game('2009112905','54635',{state:'pre',container:'scorebox-2009112905',
wrapper:'sb-wrapper-2009112905',template:($('scorebox-2009112905').innerHTML),homeabbr:'NYJ',
awayabbr:'CAR'}));

However, to answer your question, BeautifulSoup parses it (seemingly) fine:

fp = urlopen("http://www.nfl.com/scores")
data = ""
while 1:
    r = fp.read()
    if not r:
        break
    data += r
fp.close()

soup = BeautifulSoup(data)
print soup.contents[2].contents[1].contents[1]

Outputs:

<title>NFL Scores: 2009 - Week 12</title>

Might be easier to scrape Yahoo's NFL scoreboard, in my opinion...in fact, off to try it.

EDIT: Used your question as an excuse to get around to learning BeautifulSoup. Alex Martelli has been singing its praise, so I figured it worth a try -- man, am I impressed.

Anyway, I was able to cook up a rudimentary score scraper from the Yahoo! scoreboard, like so:

def main():
    soup = BeautifulSoup(YAHOO_SCOREBOARD)
    on_first_team = True
    scores = []
    hold = None

    # Iterate the tr that contains a team's box score
    for item in soup(name="tr", attrs={"align": "center", "class": "ysptblclbg5"}):
        # Easy
        team = item.b.a.string

        # Get the box scores since we're industrious
        boxscore = []
        for quarter in item(name="td", attrs={"class": "yspscores"}):
            boxscore.append(int(quarter.string))

        # Final score
        sub = item(name="span", attrs={"class": "yspscores"})[0]
        if sub.b:
            # Winning score
            final = int(sub.b.string)
        else:
            data = sub.string.replace("&nbsp;", "")
            if ":" in data:
                # Catch TV: XXX and 0:00pm ET
                final = None
            else:
                try: final = int(data)
                except: final = None

        if on_first_team:
            hold = { team : (boxscore, final) }
            on_first_team = False
        else:
            hold[team] = (boxscore, final)
            scores.append(hold)
            on_first_team = True

    for game in scores:
        print "--- Game ---"
        for team in game:
            print team, game[team]

I would tweak this on Sunday to see how it operates, as it's really rough. Here's what it outputs as of right now:

--- Game ---
Green Bay ([0, 13, 14, 7], 34)
Detroit ([7, 0, 0, 5], 12)
--- Game ---
Oakland ([0, 0, 7, 0], 7)
Dallas ([3, 14, 0, 7], 24)

Look at that, I snagged box scores too... for a game that hasn't happened yet, we get:

--- Game ---
Washington ([], None)
Philadelphia ([], None)

Anyway, a peg for you to jump from. Good luck.

There's a Flash-based auto-updating scoreboard thing at the top of nfl.com. Some monitoring of its network traffic finds:

http://www.nfl.com/liveupdate/scorestrip/ss.xml

That will probably be a bit easier to parse than the HTML scoreboard.

Look into tagsoup. If you want to end up with a DOM tree or a SAX stream in Java, it's the ticket. If you just want to extract specific information, Beautiful Soup is a Beautiful Thing.

继续阅读：expat-parser htmltidy xml

How to fix non-compliant HTML so Expat will parse it (htmltidy not working)

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？