Web spider/crawler in C# Windows.forms
I have created a web crawler in VC#. The crawler indexes certain information from .nl sites by brute-forcing all of the possible .nl addresses, starting with http://aa.nl to (theoretically) http://zzzzzzzzzzzzzzzzzzzz.nl.
It works all right except that it takes incredibly long time only to go through the two-letter domains - aa, ab ... zz. I calculated how long it would take me to go through all of the domains in this fashion and I got about a thousand years.
I tried to accelerate this by threading but with 1300 threads running at the same time, WebClient just kept failing, making the resultant data file too inaccurate to be usable.
I do not have access to anything else that a 5Mb/s internet connection, E6300 Core2duo and 2GB of 533@667mhz RAM on Win7.
Does anybody have an idea what to do to make this work? Any开发者_JAVA技巧 idea will do. Thank you
The combinatorial explosion makes this impossible to do (unless you can wait several months at the very least). What I would try instead is to contact SIDN, who is the authority for the .nl TLD and ask them for the list.
IMO such implementation of a web crawler is not appropriate
- The number of pings you need to do for one crawl is ~ 1029
- Say every ping takes 200ms
- Time for processing 100 ms
Total time estimate 3*104*1029 ms ~ 3*1023 years. Please correct me if I am wrong.
If you want to take advantage of threading you need to have a dedicated core per each thread. Each thread will at least take 1+ MB of your memory.
Threading will not help you here, you will be able to hypotheoretically reduce the time to ~ 3*1020 years
Exceptions that you get are likely to be the result of the thread synchronization issues.
The HTTP support in .Net has a maximum concurrent connections limit of around 8 by default I think (somewhere around that figure anyway)
If you create more HTTP requests many of them will be forced to wait for an available connection and as a result will time out long before they ever get one leading valid URIs to appear invalid.
精彩评论