Hadoop: Mapping binary files
Typically in a the input file is capable of being 开发者_Python百科partially read and processed by Mapper function (as in text files). Is there anything that can be done to handle binaries (say images, serialized objects) which would require all the blocks to be on same host, before the processing can start.
Stick your images into a SequenceFile; then you will be able to process them iteratively, using map-reduce.
To be a bit less cryptic: Hadoop does not natively know anything about text and not-text. It just has a class that knows how to open an input stream (hdfs handles sticthing together blocks on different nodes, to make them appear as one large files). On top of that, you have an Reader and an InputFormat that knows how to determine where in that stream records start, where they end, and how to find the beginning of the next record if you are dropped somewhere in the middle of the file. TextInputFormat is just one implementation, which treats newlines as record delimiter. There is also a special format called a SequenceFile that you can write arbitrary binary records into, and then get them back out. Use that.
精彩评论