How can I use multiple input files as a input file?
I want to use multiple files (actually 2 files) as a input files.
they are having same patterns of data. 开发者_如何学JAVAfinally, I wanna get to diff datas from two input files.
for example, in a A input file,
A 1
B 2
C 3
in a B input file,
A 1
C 3
D 4
In the end, I wanna generate an output file like
B 2
(yes, this is the result from A - B
).
How could I reach this situation on a hadoop?
Sure, especially if you don't care about the order of the lines.
First, have your mapper emit (line, filename)
pairs:
File A:
(0, "A 1")→("A 1", A)
(4, "B 2")→("B 2", A)
(8, "C 3")→("C 3", A)
File B:
(0, "A 1")→("A 1", B)
(4, "C 3")→("C 3", B)
(8, "D 4")→("D 4", B)
(This assumes you're using TextInputFormat
as the InputFormat, so the incoming key is the position in the file. You can get the filename with ((FileSplit) context.getInputSplit()).getPath()
in the map function.)
In the reduce phase, Hadoop will collect the values (filenames) associated with each key (line), and pass this to your reducer. In your reducer, you should only emit lines that have just the filename, A, and don't emit anything for the others:
("A 1",{A,B})→nothing
("B 2",{A})→"B 2"
("C 3",{A,B})→nothing
("D 4",{B})→nothing
The result will be just the lines that are in only file A.
精彩评论