XML Processing in hadoop
I have nearly 200+ xml files in the hdfs. I use the XmlInputFormat (of mahout) to stream the elements. The mapper is able to get the xml contents and process it. But the problem is only the first xml file alone is getting processed. But when we process large number of small 开发者_如何学Gotext files, after the 1st file is processed, the next files will be passed on to the mapper by Hadoop. Let me know if this is not the default behaviour with xml files and what should be done to iterate over the entire set of xml files. Thanks.
I had good luck using the normal XmlStreamRecordReader class and then looping over the standard input(with Python, Hadoop Streaming API).
How big are the files, and are you running this on a single system or a multi-node cluster? What is the HDFS block size set to?
精彩评论