Scraping html tables in .NET and taking care of colspans
I am trying to scrape HTML tables in my .NET application, however I came across tables that are aggressively using colspan and rowspan attributes on cells causing me headache. I was wondering if there is a library available that can convert a table into an array of strings and taking care of colspan e.g if colspan=5 on a TD element then it will use the value of the TD for the next 5 elements
<table>
<tr>
<td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
<td>5</td>
</tr>
<tr>
<td colspan=4>1</td>
<td>2</td>
</tr></table>
the output would be an array of the following:
[1,2,3,4,5] [1,开发者_高级运维1,1,1,2]
you may be able to use ParseControl
, which would make the whole thing fairly trivial, since you can access the Colspan property.
You could put it in a XmlDocument and then loop through it. Not sure if that's the best solution, but it works. Maybe LINQ to XML?
精彩评论