How to specify tab as a record separator for hadoop input text file?
The input file to my hadoop M/R job is a text file in which the records are separated by tab character '\t' instead of newline '\n'. How can I instruct hadoop to split using the tab character as by default it splits around newlines and each line in the text file is taken as a record.
One way to do it is to use a custom input format class that uses a filter stream to convert all tabs in the original stream to newlines. But this does not look elegant.
Another way would be to use java.util.Scanner
with tab as the separator. But I 开发者_高级运维cannot figure out how to use the java.util.Scanner
class in the input format classes.
What is the best approach and alternatives?
Values '\r' and '\n' hard-coded in org.apache.hadoop.util.LineReader class, so you can't use TextInputFormat with tab-separated records. But it is not difficult to implement own InputFormat with special LineReader class. The simplest solution is to copy-paste TextInputFormat, LineRecordReader and LineReader classes, move them to your package and change LineReader implementation.
精彩评论