Any thoughts on why I can't scrape a site?
I am building a site that need to scrape information from a partner site. Now my scraping code works great with other sites but not this one. It is a regular .html site. My thoughts is that it might be generated some how with php (site is build with php).
I have no idea I am just taking a guess about the generated part and I would need your pros help on this. If it matters here is my code I use. The htmlDocument is htmlAgilityPack but that has nothing to do with it. Result is null on the site I try.
string result;
var objRequest = System.Net.HttpWebRequest.Create(strUrl);
var objResponse = objRequest.GetResponse();
using (var sr = new StreamReader(objResponse.GetResponseStream()))
{
result = sr.ReadToEnd();
sr.Close();
var doc = new HtmlDocument();
doc.LoadHtml(result);
foreach (var c in doc.DocumentNode.SelectNodes("//a[@href]"))
{
litStatus.Text 开发者_运维知识库+= c.Attributes["href"].Value + "<br />";
}
}
EDIT:
this is from the w3 validator, might have something with this?
Sorry, I am unable to validate this document because on line 422 it contained one or more bytes that I cannot interpret as utf-8 (in other words, the bytes found are not valid values in the specified Character Encoding). Please check both the content of the file and the character encoding indication.
The error was: utf8 "\xA9" does not map to Unicode
I would start by seeing what response I got from something simple like wget or using a tool like http://www.fiddler2.com/fiddler2/">Fiddler to test the response and check any headers you are getting back.
Sometimes sites will return different responses from different agent strings and so on, so you may need to adjust your request headers and masquerade as a different browser to get the data you are looking for. If you are using Fiddler on the same machine that is running the script you should be able to see exactly what is different between a request for the page from your browser and a request for the page from your script.
There may even be a simple 302 redirect or something like that going on that your code isn't following.
If you can access the page with a browser then you will definitely be able to access it by sending exactly the same request as your browser would send.
Edit- Fiddler is slightly trickier to use from your own code because it behaves as a proxy- it sets itself up with regular browsers, but you would manually have to tell your code to run through a proxy on 127.0.0.1 port 8888 in order for Fiddler to see your results.
To troubleshoot, check the value of objResponse.StatusCode and objResponse.StatusDescription:
string result;
var objRequest = System.Net.HttpWebRequest.Create(strUrl);
var objResponse = (System.Net.HttpWebResponse) objRequest.GetResponse();
Console.WriteLine(objResponse.StatusCode);
Console.WriteLine(objResponse.StatusDescription);
...
The problem appears to be the character in the comment on line 421:
<!-- KalenderMx v1.4 � by shiba-design.de -->
which is outside of the declared character encoding iso-8859-1:
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
You might try running the parsed document string through a filter to convert or remove the offending characters in the string before evaluating it with the htmlAgilityPack LoadHtml()
.
精彩评论