Masking your web scraping activities to look like normal browser surfing activities?

2023-03-10 23:23 问答作者：

I'm using the Html Agility Pack and I keep getting this error. "The remote server returned开发者_StackOverflow社区 an error: (500) Internal Server Error." on certain pages.

Now I'm not sure what this is, as I can use Firefox to get to these pages without any problems.

I have a feeling the website itself is blocking and not sending a response. Is there a way I can make my HTML agility pack call more like a call that is being called from FireFox?

I've already set a timer in there so it only sends to the website every 20 seconds.

Is there any other method I can use?

Set a User-Agent similar to a regular browser. A User agent is a http header being passed by the http client(browser) to identify itself to the server.

There are a lot of ways servers can detect scraping and its really just an arms race between the scraper and the scrapee(?), depending on how bad one or the other wants to access/protect data. Some of the things to help you go undetected are:

Make sure all http headers sent over are the same as a normal browser, especially the user agent and the url referrer.
Download all images and css scripts like a normal browser would, in the order a browser would.
Make sure any cookies that are set are sent over with each subsequent request
Make sure requests are throttled according to the sites robots.txt
Make sure you aren't following any no-follow links because the server could be setting up a honeypot where they stop serving your ip requests
Get a bunch of proxy servers to vary your ip address
Make sure the site hasn't started sending you captcha's because they think you are a robot.

Again, the list could go on depending on how sophisticated the server setup is.

继续阅读：html-agility-pack web-scraping

Masking your web scraping activities to look like normal browser surfing activities?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？