can a crawler be written entirely in javascript?
I was wondering - can a crawler be written entirely in javascript? That way, the crawler is only called when a user needs the information and everything is run from the individ开发者_如何学Pythonual user's computer.
if crawler is written server side - doesn't that also run the risk of the IP being blocked?
First off, before talking details, you must understand that crawling is extremely slow. Getting any kind of meaningful web indexing takes minutes if you're looking on one site, and days at the very least if you're looking into multiple sources (often weeks, months or years). Serving a search by crawling live is not viable at all.
As for details, there's nothing preventing one from writing a crawler in Javascript. However, not in Browser-embedded javascript, at least not without a server-side proxy due to the cross-origin policy.
It's possible to write a crawler in javascript, using, for example, Node.JS. However, you probably won't be able to write one in a user's browser. This is because:
- The browser security model restricts the javascript to accessing only your own domain, so you can only index your own site.
- Each user will need to re-crawl your entire site each time, meaning lots of time (minutes, hours, or even days depending on the size of the site) spent crawling before the user's query can be answered, as well as lots of bandwidth usage because this is multiplied over your entire userbase. Not to mention the user's browser might not allow your JS enough storage for its index.
There are ways to deal with the cross domain problem. Search for "Access-Control-Allow-Origin" and you'll see how.
The easiest way to implement such a crawler is to write addon(firefox) or extension(chrome), then inject your javascript code into the visited page. That way, you'll see exactly the same thing as the document author sees. You can simply call document,body.innerText, then post the content to your server for indexing.
I myself have such a crawler working, with several browsers on different ip address crawling.
精彩评论