Can't parse a table child with xpath
I'm parsing a site with some messy html, they're 130 subsites and the only one that fails is the last one. The part in which fails is the bolded one. I get an empty list when I should be getting 3(parent and 2 childs). All sites have the same structure so I don't have a c开发者_开发百科lue how to solve this.
from lxml.html import parse
# get a list of the urls of the foods to parse
main_site = "http://www.whfoods.com/foodstoc.php"
doc = parse(main_site).getroot()
doc.make_links_absolute()
sites = doc.xpath('/html/body//div[@class="full3col"]/ul/li/a/@href')
for site in sites:
doc = parse(site).getroot()
**table = doc.xpath("descendant::table[1]")[0]**
#food info list
table.xpath("//tr/td/table/tr/td/b/text()")
# food nutrients list
table.xpath("//tr/td/table[1]/tr/td/text()")
This is an html excerpt of the site that fails( click here if you want to see it complete):
<html>
<head>
<body>
<div id=mainpage">
<div id="subcontent">
(40+ <p> tags with things inside)
<p>
<table>
<tbody>
<tr>
<td>
<table>
<tbody>
<tr>
<td>
<b>Food's name<br>other things</b>
</td>
</tr>
<tr>
Heads of the table(not needed)
</tr>
<tr>
<td>nutrient name</td>
<td>dv</td>
<td>density</td>
<td>rating</td>
</tr>
</tbody>
</table>
<table> Not needed
...
All remaining closing tags
According to validator.w3.org when pointed at http://www.whfoods.com/genpage.php?tname=foodspice&dbid=97
:
Line 253, column 147: non SGML character number 150
…ed mushrooms by Liquid Chromatography Mass Spectroscopy. The 230th ACS Natio…
The problem character is between "Chromatography" and "Mass". The page is declared to be encoded in ISO-8859-1, but as often happens in that case, it is lying:
>>> import unicodedata as ucd
>>> ucd.name(chr(150).decode('cp1252'))
'EN DASH'
Perhaps lxml is being picky about this also (Firefox doesn't care).
精彩评论