Screen scraping with HTMLAgility help please
Last night when I asked about screen scraping I was given an excellent article link and has got me to this point. I have a few questions however. I will post my code as well as the html source below. I am trying to grab the data between the data tables, and then send the data to an sql table. I have found success in grabbing Description Widget 3.5 ect... Last Modified By Joe however because the 1st 2 /tr also contains img src=/......" alt="00721408" the numbers do not get grabbed. I am stuck as to how to alter the code so that all the data in the table is grabbed. 2nd, What do I need to do next in order to prepare the data to be sent to a sql table. My code is as follows:
using System;
using Sys开发者_如何学运维tem.Collections.Generic;
using System.Linq;
using System.Text;
using HtmlAgilityPack;
using System.Windows.Forms;
namespace ConsoleApplication1
{
}
class Program
{
static void Main(string[] args)
{
// Load the html document
var webGet = new HtmlWeb();
var doc = webGet.Load("http://localhost");
// Get all tables in the document
HtmlNodeCollection tables = doc.DocumentNode.SelectNodes("//table");
// Iterate all rows in the first table
HtmlNodeCollection rows = tables[0].SelectNodes(".//tr");
for (int i = 0; i < rows.Count; ++i)
{
// Iterate all columns in this row
HtmlNodeCollection cols = rows[i].SelectNodes(".//td");
for (int j = 0; j < cols.Count; ++j)
{
// Get the value of the column and print it
string value = cols[j].InnerText;
Console.WriteLine(value);
}
}
}
}
<table class="data">
<tr><td>Part-Num</td><td width="50"></td><td><img src="/partcode/number/072140" alt="072140"/></td></tr>
<tr><td>Manu-Number</td><td width="50"></td><td><img src="/partcode/manu/00721408" alt="00721408" /></td></tr>
<tr><td>Description</td><td></td><td>Widget 3.5</td></tr>
<tr><td>Manu-Country</td><td></td><td>United States</td></tr>
<tr><td>Last Modified</td><td></td><td>26 Jan 2011, 8:08 PM</td></tr>
<tr><td>Last Modified By</td><td></td><td>
Manu
</td></tr>
</table>
<p>
</body></html>
While fragile something like this would work in your case - basically just including the text content of all image alt
attributes:
// Iterate all rows in the first table
HtmlNodeCollection rows = tables[0].SelectNodes(".//tr");
for (int i = 0; i < rows.Count; ++i)
{
// Iterate all columns in this row
HtmlNodeCollection cols = rows[i].SelectNodes(".//td");
for (int j = 0; j < cols.Count; ++j)
{
var images = cols[j].SelectNodes("img");
if(images!=null)
foreach (var image in images)
{
if(image.Attributes["alt"]!=null)
Console.WriteLine(image.Attributes["alt"].Value);
}
// Get the value of the column and print it
string value = cols[j].InnerText;
Console.WriteLine(value);
}
}
I'm a litte confused as to what data you're trying to obtain however...
you could try:
SelectNodes("//td[text()='Description']/../child::*[3]")
whose inner text should be "Widget 3.5"
SelectNodes("//td[text()='Manu-Country']/../child::*[3]")
whose inner text should be "United States"
etc. etc.
Btw just as a shameless plug, you should check out : systemhtml.codeplex.com It's yet another html parser.
精彩评论