Language recommendations for an efficient web crawler
I'm looking for a language for writin开发者_运维问答g an efficient web crawler. Things I value:
- expressive language (don't make me just through static typing hoops)
- useful libraries (a css selector based html parser would be nice)
- minimal memory footprint
- dependable language runtime & libraries
I tried node.js. I like node in theory. Javascript is very expressive. You can use jQuery to parse html. Node's async nature lets me crawl many urls in parallel without dealing with threads. V8 is nice and fast for parsing.
In practice, node isn't working out for me. My process constantly crashes. Bus Errors, exceptions in the event manager ... etc.
I've done a fair bit of Ruby dev, so I wouldn't mind using Ruby 1.9's coroutines (fibers?) as long as I won't face similar issues with VM / library stability.
Additional suggestions?
Use Node.js, and fix whatever is crashing it. It's been running on my Ubuntu box without any problems for months.
For the library, I recommend to use YUI3 instead of jQuery, it easily lets you build a webcrawler/scraper in a couple of minutes, if you don't believe me watch this Talk from YUIConf2010, it's 40 minutes but it's all about code.
Dav Glass did a great job of showing how easy it is and how little code you need, yes there were some issues with different version of jsdom in the talk, but the talk was given at the beginning of November, so much of that should have been fixed already.
You can check out all the stuff from the talk at his GitHub page.
And here's his scraper that gets the current news headlines from Digg.
Seriously it's more than worth the effort making Node.js run on your system, since in the end you got all the awesomeness of YUI3 on the server side.
I'm pretty sure any language has something built that can handle it. Are you sure node.js isn't crashing because of a problem in your code? Why not use Ruby if you're comfortable with it?
There's also BeautifulSoup (Python), which you might consider if you main hurdle is the HTML parsing.
Go with the language most familiar to you or the language you want to learn most. You can write a web crawler in any language.
I've personally developed crawlers in Java, Ruby, and Perl. All of these languages met your requirements. (Yes, even the crawler in Java had a reasonable memory footprint.) Of these, Java was my favorite because it boasted the most mature HTTP and HTML libraries. If I find myself writing another, I want to try Python next.
The first algorithmic problem you'll face is the task of efficiently identifying the pages you've already visited. This index of URLs can grow very large and must support fast lookups and insertions. A common database index will work in early crawler prototypes but will quickly prove to be the bottleneck.
python and BeautifulSoup, easy to learn and very efficient.
精彩评论