Masking your web scraping activities to look like normal browser surfing activities?
I'm using the Html Agility Pack and I keep getting this error. "The remote server returned开发者_StackOverflow社区 an error: (500) Internal Server Error." on certain pages.
Now I'm not sure what this is, as I can use Firefox to get to these pages without any problems.
I have a feeling the website itself is blocking and not sending a response. Is there a way I can make my HTML agility pack call more like a call that is being called from FireFox?
I've already set a timer in there so it only sends to the website every 20 seconds.
Is there any other method I can use?
Set a User-Agent similar to a regular browser. A User agent is a http header being passed by the http client(browser) to identify itself to the server.
There are a lot of ways servers can detect scraping and its really just an arms race between the scraper and the scrapee(?), depending on how bad one or the other wants to access/protect data. Some of the things to help you go undetected are:
- Make sure all http headers sent over are the same as a normal browser, especially the user agent and the url referrer.
- Download all images and css scripts like a normal browser would, in the order a browser would.
- Make sure any cookies that are set are sent over with each subsequent request
- Make sure requests are throttled according to the sites robots.txt
- Make sure you aren't following any no-follow links because the server could be setting up a honeypot where they stop serving your ip requests
- Get a bunch of proxy servers to vary your ip address
- Make sure the site hasn't started sending you captcha's because they think you are a robot.
Again, the list could go on depending on how sophisticated the server setup is.
精彩评论