开发者

Is there a way to get files from a webserver when directory listing is deactivated?

I try to build a "crawler" or a "atuomatic downloader" for each file is based on a webserver / webpage.

So in my oppinion there are two ways:

1) Directory Listing is enabled. Than its easy, read out the data that is in the listing and download every file you see.

2) Directory listing is disabled. What then? The only idea is have to brute force filenames and see the reaction of the server (e.g.: 404 for no file, 403 for a found directory, and data for the correct found da开发者_如何学JAVAta).

Is my idea right? Is there a better way?


You can always parse the HTML and look and follow ('crawl') the links you get. This the way most crawlers are implemented.

Check these libraries out that could help you do it:

  1. .NET: Html Agility Pack

  2. Python: Beautiful Soup

  3. PHP: HTMLSimpleDom

ALWAYS look for robots.txt in the site's root and make sure you respect the site's rules on what pages are allowed to be be crawled.


You shouldn't index the pages that the web master prevents you to.

this is all Robots.txt is about.

you should check for SiteMap file, which is described Here in each folder

it is usually sitemap.xml or sometimes it's name is mentioned in Robots.txt

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜