开发者

meta refresh download/ c# crawler

I am trying to create a crawler to download some content from a web site.

Assuming that the url to consume is something like

clickUrl ="http://www.example.com/idocs-nph/search/pdfViewerForm.html?args=5C7QrtC22wGYK2xFpSwMnXdtvSoClrL8xJKSjjboeVQpCCmqt4mgGEHlbmahCJFQEmRQwePEviF8EeCoaT0MAKztT3Sb63xk3VkL3PiCQ3RLoVYQqjKiogfu8Gq1RKKQmyoZK8o4WQM0kj-3nPY6gOqNXOY8VS4VhacAYKom_mBgul0xmRvgLA..";

on a web browser, the download is performed returning html containing a refresh META

<meta http-equiv="REFRESH" content="0;url=http://www.example.com/idocs-nph/search/pdfViewerForm.html?args=5C7QrtC22wGYK2xFpSwMnXdtvSoClrL8xJKSjjboeVQpCCmqt4mgGEHlbmahCJFQEmRQwePEviF8E开发者_运维问答eCoaT0MAKztT3Sb63xk3VkL3PiCQ3TmKpPQrAvPZQfu8Gq1RKKQmyoZK8o4WQMl05IxFu8XBzuJ49RIAPXJ8d-HneKenBQ-TKbP_e17qQ.."/>

and the browser asks for a file name to save the file

On my crawler code,

I open a WebRequest to the clickURL,

HttpWebRequest req = (HttpWebRequest)WebRequest.Create(clickUrl);

I detect and follow the REFRESH url using a new WebRequest, but the response is a new html containing also REFRESH META , not the actual file (pretty recursive)

HttpWebRequest does not contain any cookies


It is very likely that the site is checking for cookies. The reason this happens is that when you send someone a link of the download file, the recipient will still get directed to the site before he or she can download the file.

Sourceforge does something interesting here which may help. If go to download a file from Sourceforge, you appear at a page such as which you describe. However, if you open the exact same page using wget, you'll see that it does actually load the file. It detects you're not a normal browser and sends you the file (the HTML isn't going to do any good anyway with wget; it's not going to look at the advertisements).

I suggest you try the following. When you find a page that has such a redirect, redirect to it. If you then detect that you get the same contents back, try it again without a User-Agent. Maybe that will actually give you the file.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜