Running web-fetches from within a Hadoop cluster
A blog post - http://petewarden.typepad.com/searchbrowser/2011/05/using-hadoop-with-external-api-calls.html 开发者_如何学Python- suggests calling external systems (querying the twitter API, or crawling webpages) from within a Hadoop cluster.
For the system I'm currently developing, there are both fast, and slow(bulk) sub-systems. Data is fetched from Twitter's API -also for quick, individual retrievals. This can be hundreds of thousands (even millions) of external requests per day. The content of web pages are also retrieved for further processing - with at least the same scale of requests.
Aside from potential side-effects to the external source (changing data so it's different on the next request), what would be the pluses, or minuses of using Hadoop in such a way? Is it a valid and useful method of bulk, and/or fast retrieval of data?
The plus: it's a super easy way to distribute the work that needs to be done.
The minus: due to the way that Hadoop recovers from failures, you need to be very careful about managing what is and isn't run (which you can definitely do, it's just something to watch out for). If a reduce fails, for example, then all of the map jobs that feed that partition must also be rerun. Obviously this would most likely be a no-reducer job, but this is still true of mappers...what happens if half of the calls run, then the job fails, so it is rescheduled?
You could use some sort of high-throughput system to manage the calls that are actually made or somesuch. But it definitely can be appropriately used for this.
精彩评论