开发者

Web spider/crawler in C# Windows.forms

I have created a web crawler in VC#. The crawler indexes certain information from .nl sites by brute-forcing all of the possible .nl addresses, starting with http://aa.nl to (theoretically) http://zzzzzzzzzzzzzzzzzzzz.nl.

It works all right except that it takes incredibly long time only to go through the two-letter domains - aa, ab ... zz. I calculated how long it would take me to go through all of the domains in this fashion and I got about a thousand years.

I tried to accelerate this by threading but with 1300 threads running at the same time, WebClient just kept failing, making the resultant data file too inaccurate to be usable.

I do not have access to anything else that a 5Mb/s internet connection, E6300 Core2duo and 2GB of 533@667mhz RAM on Win7.

Does anybody have an idea what to do to make this work? Any开发者_JAVA技巧 idea will do. Thank you


The combinatorial explosion makes this impossible to do (unless you can wait several months at the very least). What I would try instead is to contact SIDN, who is the authority for the .nl TLD and ask them for the list.


IMO such implementation of a web crawler is not appropriate

  1. The number of pings you need to do for one crawl is ~ 1029
  2. Say every ping takes 200ms
  3. Time for processing 100 ms

Total time estimate 3*104*1029 ms ~ 3*1023 years. Please correct me if I am wrong.

If you want to take advantage of threading you need to have a dedicated core per each thread. Each thread will at least take 1+ MB of your memory.

Threading will not help you here, you will be able to hypotheoretically reduce the time to ~ 3*1020 years

Exceptions that you get are likely to be the result of the thread synchronization issues.


The HTTP support in .Net has a maximum concurrent connections limit of around 8 by default I think (somewhere around that figure anyway)

If you create more HTTP requests many of them will be forced to wait for an available connection and as a result will time out long before they ever get one leading valid URIs to appear invalid.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜