Problem with XPath
Here's a link:
http://www.covers.com/pageLoader/pageLoader.aspx?page=/data/nba/results/2010-2011/boxscore819588.html
I'm using HTML Agility Pack and I would like to extract, say, the 188 from the 'Odds' column. My editor gives /html/body/form/div/div[2]/div/table/tr/td[2]/div/table/tr[3]/td[7]
wh开发者_开发技巧en asked for path. I tried that path with various of omissions of body or html, but neither of them return any results when passed to .DocumentNode.SelectNodes()
. I also tried with the //
at the beginning (which, I assume, is the root of the document tree). What gives?
EDIT:
Code:
WebClient client = new WebClient();
string html = client.DownloadString(url);
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
foreach(HtmlNode node in doc.DocumentNode.SelectNodes("/some/xpath/expression"))
{
Console.WriteLine("[" + node.InnerText + "]");
}
When scraping sites, you can't rely safely on the exact XPATH given by tools as in general, they are too restrictive, and in fact catch nothing most of the time. The best way is to have a look at the HTML and determine something more resilient to changes.
Here is a piece of code that works with your example:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.Load(your html);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//a[text()='MIA']/ancestor::tr/td[7]"))
{
Console.WriteLine(node.InnerText.Trim());
}
It outputs 188
.
The way it works is:
- select an A element with inner text set to "MIA"
- find the parent TR element of this A element
- get to the seventh TD of this TR element
- and then we use InnerText property of that TD element
Try this:
/html/body/form/div/div[2]/div/table/*/tr/td[2]/div/table/*/tr[3]/td[7]
The * catch the mandatory <tbody>
element that is part of the DOM representation of tables even if it is not denoted in the HTML.
Other than that, it's more robust to select by ID, CSS class name or some other unique property instead of by hierarchy and document structure:
//table[@class='data']//tr[3]/td[7]
By default HtmlAgilityPack treats form tag differently (because form tags can overlap), so you need to remove form tag from xpath, for examle: /html/body//div/div[2]/div/table/tr/td[2]/div/table/tr[3]/td[7]
Other way is to force HtmlAgilityPack to treat form tag as others:
HtmlNode.ElementsFlags.Remove("form");
精彩评论