Any thoughts on why I can't scrape a site?

2022-12-17 05:45 问答作者：

I am building a site that need to scrape information from a partner site. Now my scraping code works great with other sites but not this one. It is a regular .html site. My thoughts is that it might be generated some how with php (site is build with php).

I have no idea I am just taking a guess about the generated part and I would need your pros help on this. If it matters here is my code I use. The htmlDocument is htmlAgilityPack but that has nothing to do with it. Result is null on the site I try.

        string result;
        var objRequest = System.Net.HttpWebRequest.Create(strUrl);
        var objResponse = objRequest.GetResponse();

        using (var sr = new StreamReader(objResponse.GetResponseStream()))
        {
            result = sr.ReadToEnd();
            sr.Close();

            var doc = new HtmlDocument();
            doc.LoadHtml(result);                

            foreach (var c in doc.DocumentNode.SelectNodes("//a[@href]"))
            {
                litStatus.Text 开发者_运维知识库+= c.Attributes["href"].Value + "<br />";
            }
        }

EDIT:

this is from the w3 validator, might have something with this?

Sorry, I am unable to validate this document because on line 422 it contained one or more bytes that I cannot interpret as utf-8 (in other words, the bytes found are not valid values in the specified Character Encoding). Please check both the content of the file and the character encoding indication.

The error was: utf8 "\xA9" does not map to Unicode

I would start by seeing what response I got from something simple like wget or using a tool like http://www.fiddler2.com/fiddler2/">Fiddler to test the response and check any headers you are getting back.

Sometimes sites will return different responses from different agent strings and so on, so you may need to adjust your request headers and masquerade as a different browser to get the data you are looking for. If you are using Fiddler on the same machine that is running the script you should be able to see exactly what is different between a request for the page from your browser and a request for the page from your script.

There may even be a simple 302 redirect or something like that going on that your code isn't following.

If you can access the page with a browser then you will definitely be able to access it by sending exactly the same request as your browser would send.

Edit- Fiddler is slightly trickier to use from your own code because it behaves as a proxy- it sets itself up with regular browsers, but you would manually have to tell your code to run through a proxy on 127.0.0.1 port 8888 in order for Fiddler to see your results.

To troubleshoot, check the value of objResponse.StatusCode and objResponse.StatusDescription:

string result;
var objRequest = System.Net.HttpWebRequest.Create(strUrl);
var objResponse = (System.Net.HttpWebResponse) objRequest.GetResponse();

Console.WriteLine(objResponse.StatusCode);
Console.WriteLine(objResponse.StatusDescription);
...

The problem appears to be the character in the comment on line 421:

<!-- KalenderMx v1.4 � by shiba-design.de -->

which is outside of the declared character encoding iso-8859-1:

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

You might try running the parsed document string through a filter to convert or remove the offending characters in the string before evaluating it with the htmlAgilityPack LoadHtml().

继续阅读：asp.net screen-scraping

Any thoughts on why I can't scrape a site?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？