Language recommendations for an efficient web crawler

2023-01-28 14:53 问答作者：

I'm looking for a language for writin开发者_运维问答g an efficient web crawler. Things I value:

expressive language (don't make me just through static typing hoops)
useful libraries (a css selector based html parser would be nice)
minimal memory footprint
dependable language runtime & libraries

I tried node.js. I like node in theory. Javascript is very expressive. You can use jQuery to parse html. Node's async nature lets me crawl many urls in parallel without dealing with threads. V8 is nice and fast for parsing.

In practice, node isn't working out for me. My process constantly crashes. Bus Errors, exceptions in the event manager ... etc.

I've done a fair bit of Ruby dev, so I wouldn't mind using Ruby 1.9's coroutines (fibers?) as long as I won't face similar issues with VM / library stability.

Additional suggestions?

Use Node.js, and fix whatever is crashing it. It's been running on my Ubuntu box without any problems for months.

For the library, I recommend to use YUI3 instead of jQuery, it easily lets you build a webcrawler/scraper in a couple of minutes, if you don't believe me watch this Talk from YUIConf2010, it's 40 minutes but it's all about code.

Dav Glass did a great job of showing how easy it is and how little code you need, yes there were some issues with different version of jsdom in the talk, but the talk was given at the beginning of November, so much of that should have been fixed already.

You can check out all the stuff from the talk at his GitHub page.
And here's his scraper that gets the current news headlines from Digg.

Seriously it's more than worth the effort making Node.js run on your system, since in the end you got all the awesomeness of YUI3 on the server side.

I'm pretty sure any language has something built that can handle it. Are you sure node.js isn't crashing because of a problem in your code? Why not use Ruby if you're comfortable with it?

There's also BeautifulSoup (Python), which you might consider if you main hurdle is the HTML parsing.

Go with the language most familiar to you or the language you want to learn most. You can write a web crawler in any language.

I've personally developed crawlers in Java, Ruby, and Perl. All of these languages met your requirements. (Yes, even the crawler in Java had a reasonable memory footprint.) Of these, Java was my favorite because it boasted the most mature HTTP and HTML libraries. If I find myself writing another, I want to try Python next.

The first algorithmic problem you'll face is the task of efficiently identifying the pages you've already visited. This index of URLs can grow very large and must support fast lookups and insertions. A common database index will work in early crawler prototypes but will quickly prove to be the bottleneck.

python and BeautifulSoup, easy to learn and very efficient.

继续阅读：asynchronous javascript node.js web-crawler

Language recommendations for an efficient web crawler

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？