Screen Scraping HTML with C# [closed]
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this questionI have开发者_Go百科 been given the task at work of screen scraping one of our legacy web apps to extract certain data from the code. The data is formatted and "should" be displayed exactly the same every time. I am just not sure how to go about doing this. It's a full html file with header and footer navigations but in the middle of all this is the data I need.
I need to extract the Company Name value, Contact Name, Telephone, email address, etc.
Here is an example of what the code looks like:
...html above here
<br /><br />
<table cellpadding="0" cellspacing="12" border="0">
<tr>
<td valign="top" align="center">
<!-- Company Info -->
<table cellpadding="0" cellspacing="0" border="0">
<tr>
<td class="black">
<table cellspacing="1" cellpadding="0" border="0" width="370">
<tr>
<th>ABC INDUSTRIES</th>
</tr>
<tr>
<td class="search">
<table cellpadding="5" cellspacing="0" border="0" width="100%">
<tr>
<td>
<table cellpadding="1" cellspacing="0" border="0" width="100%">
<tr>
<td align="center" colspan="2"><hr></td>
</tr>
<tr>
<td align="right" nowrap><b><font color="FF0000">Contact Person <img src="/images/icon_contact.gif" align="absmiddle"> :</font></b></td>
<td align="left" width="100%"> Joe Smith</td>
</tr>
<tr>
<td align="right" nowrap><b><font color="FF0000">Phone Number <img src="/images/icon_phone.gif" align="absmiddle"> :</font></b></td>
<td align="left" width="100%"> 555-555-5555</td>
</tr>
<tr>
<td align="right" nowrap><b><font color="FF0000">E-mail Address <img src="/images/icon_email.gif" align="absmiddle"> :</font></b></td>
<td align="left" width="100%"> <a HREF="mailto:joe@joe.com">joe@joe.com</a></td>
</tr>
more...
There is more code on the screen in a different table structure that I also need to pull.
Are you just looking for suggestions on how to accomplish this? The HTML Agility Pack is probably going to be your best bet for DOM parsing in general. There may be a good bit of tinkering and trial and error to maintain your screen scrape (there usually is for that sort of thing), but that library is pretty good for parsing HTML.
Technically, any XML parsing (even native LINQ to XML) should do the trick, but websites have a nasty habit of not being well-formed so you may run into small headaches here and there.
In recent projects, I successfully used the WebRequest
and related classed to download the HTML from an URL and then SgmlReader parser to actually get access to the structured content.
If the page comments and table layout code are the same whenever called, I would pull the page into a string and use a series of .IndexOf and .Substring functions to parse out the data. Use the IndexOf function to find the starting and ending indexes of each field. Use these field indexes in the Substring function to grab the data.
It's not pretty but gets the job done.
HtmlDocument can be used to process HTML documents. See following examples:
http://weblogs.asp.net/grantbarrington/archive/2009/10/15/screen-scraping-in-c.aspx
http://www.stupidiocy.com/development/web-scraping-using-c/
If you have the HTML stored in a string you can always use Regular Expressions with capture groups to parse the information you need.
精彩评论