Screen Scraping HTML with C# [closed]

2023-02-02 11:27 问答作者：

Closed. This question needs details or clarity. It is not currently accepting answers.

Want to improve this question? Add details and clarify the problem by editing this post.

Closed 8 years ago.

Improve this question

I have开发者_Go百科 been given the task at work of screen scraping one of our legacy web apps to extract certain data from the code. The data is formatted and "should" be displayed exactly the same every time. I am just not sure how to go about doing this. It's a full html file with header and footer navigations but in the middle of all this is the data I need.

I need to extract the Company Name value, Contact Name, Telephone, email address, etc.

Here is an example of what the code looks like:

...html above here

<br /><br />
<table cellpadding="0" cellspacing="12" border="0">
    <tr>
        <td valign="top" align="center">
            <!-- Company Info -->

            <table cellpadding="0" cellspacing="0" border="0">
                <tr>
                    <td class="black">
                        <table cellspacing="1" cellpadding="0" border="0" width="370">
                            <tr>
                                <th>ABC INDUSTRIES</th>
                            </tr>
                            <tr>
                                <td class="search">

                                    <table cellpadding="5" cellspacing="0" border="0" width="100%">
                                        <tr>
                                            <td>
                                                <table cellpadding="1" cellspacing="0" border="0" width="100%">
                                                   <tr>
                                                        <td align="center" colspan="2"><hr></td>
                                                    </tr>
                                                    <tr>
                                                        <td align="right" nowrap><b><font color="FF0000">Contact Person&nbsp;<img src="/images/icon_contact.gif" align="absmiddle">&nbsp;:</font></b></td>
                                                        <td align="left" width="100%">&nbsp;Joe Smith</td>
                                                    </tr>
                                                    <tr>
                                                        <td align="right" nowrap><b><font color="FF0000">Phone Number&nbsp;<img src="/images/icon_phone.gif" align="absmiddle">&nbsp;:</font></b></td>
                                                        <td align="left" width="100%">&nbsp;555-555-5555</td>
                                                    </tr>
                                                    <tr>
                                                        <td align="right" nowrap><b><font color="FF0000">E-mail Address&nbsp;<img src="/images/icon_email.gif" align="absmiddle">&nbsp;:</font></b></td>
                                                        <td align="left" width="100%">&nbsp;<a HREF="mailto:joe@joe.com">joe@joe.com</a></td>
                                                    </tr>
                                                    more...

There is more code on the screen in a different table structure that I also need to pull.

Are you just looking for suggestions on how to accomplish this? The HTML Agility Pack is probably going to be your best bet for DOM parsing in general. There may be a good bit of tinkering and trial and error to maintain your screen scrape (there usually is for that sort of thing), but that library is pretty good for parsing HTML.

Technically, any XML parsing (even native LINQ to XML) should do the trick, but websites have a nasty habit of not being well-formed so you may run into small headaches here and there.

In recent projects, I successfully used the WebRequestand related classed to download the HTML from an URL and then SgmlReader parser to actually get access to the structured content.

If the page comments and table layout code are the same whenever called, I would pull the page into a string and use a series of .IndexOf and .Substring functions to parse out the data. Use the IndexOf function to find the starting and ending indexes of each field. Use these field indexes in the Substring function to grab the data.

It's not pretty but gets the job done.

HtmlDocument can be used to process HTML documents. See following examples:

http://weblogs.asp.net/grantbarrington/archive/2009/10/15/screen-scraping-in-c.aspx

http://www.stupidiocy.com/development/web-scraping-using-c/

If you have the HTML stored in a string you can always use Regular Expressions with capture groups to parse the information you need.

继续阅读：screen-scraping

Screen Scraping HTML with C# [closed]

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？