Hadoop: Processing large serialized objects

2023-01-03 10:38 问答作者：

I am working on development of an application to process (and开发者_开发问答 merge) several large java serialized objects (size of order GBs) using Hadoop framework. Hadoop stores distributes blocks of a file on different hosts. But as deserialization will require the all the blocks to be present on single host, its gonna hit the performance drastically. How can I deal this situation where different blocks have to cant be individually processed, unlike text files ?

There's two issues: one is that each file must (in the initial stage) be processed in whole: the mapper that sees the first byte must handle all the rest of that file. The other problem is locality: for best efficiency, you'd like all the blocks for each such file to reside on the same host.

Processing files in whole:

One simple trick is to have the first-stage mapper process a list of filenames, not their contents. If you want 50 map jobs to run, make 50 files each with that fraction of the filenames. This is easy and works with java or streaming hadoop.

Alternatively, use a non-splittable input format such as NonSplitableTextInputFormat.

For more details, see "How do I process files, one per map?" and "How do I get each of my maps to work on one complete input-file?" on the hadoop wiki.

Locality:

This leaves a problem, however, that the blocks you are reading from are disributed all across the HDFS: normally a performance gain, here a real problem. I don't believe there's any way to chain certain blocks to travel together in the HDFS.

Is it possible to place the files in each node's local storage? This is actually the most performant and easiest way to solve this: have each machine start jobs to process all the files in e.g. /data/1/**/*.data (being as clever as you care to be about efficiently using local partitions and number of CPU cores).

If the files originate from a SAN or from say s3 anyway, try just pulling from there directly: it's built to handle the swarm.

A note on using the first trick: If some of the files are much larger than others, put them alone in the earliest-named listing, to avoid issues with speculative execution. You might turn off speculative execution for such jobs anyway if the tasks are dependable and you don't want some batches processed multiple times.

It sounds like your input file is one big serialized object. Is that the case? Could you make each item its own serialized value with a simple key?

For example, if you were wanting to use Hadoop to parallelize the resizing of images you could serialize each image individually and have a simple index key. Your input file would be a text file with the key values pairs being index key and then serialized blob would be the value.

I use this method when doing simulations in Hadoop. My serialized blob is all the data needed for the simulation and the key is simply an integer representing a simulation number. This allows me to use Hadoop (in particular Amazon Elastic Map Reduce) like a grid engine.

I think the basic (unhelpful) answer is that you can't really do this, since this runs directly counter to the MapReduce paradigm. Units of input and output for mappers and reducers are records, which are relatively small. Hadoop operates in terms of these, not file blocks on disk.

Are you sure your process needs everything on one host? Anything that I'd describe as a merge can be implemented pretty cleanly as a MapReduce where there is no such requirement.

If you mean that you want to ensure certain keys (and their values) end up on the same reducer, you can use a Partitioner to define how keys are mapped onto reducer instances. Depending on your situation, this may be what you really are after.

I'll also say it kind of sounds like you are trying to operate on HDFS files, rather than write a Hadoop MapReduce. So maybe your question is really about how to hold open several SequenceFiles on HDFS, read their records and merge, manually. This isn't a Hadoop question then, but, still doesn't need blocks to be on one host.

继续阅读：object performance

Hadoop: Processing large serialized objects

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？