开发者

How can I use multiple input files as a input file?

I want to use multiple files (actually 2 files) as a input files.

they are having same patterns of data. 开发者_如何学JAVAfinally, I wanna get to diff datas from two input files.

for example, in a A input file,

A 1
B 2
C 3

in a B input file,

A 1
C 3
D 4

In the end, I wanna generate an output file like

B 2

(yes, this is the result from A - B).

How could I reach this situation on a hadoop?


Sure, especially if you don't care about the order of the lines.

First, have your mapper emit (line, filename) pairs:

File A:
(0, "A 1")→("A 1", A)
(4, "B 2")→("B 2", A)
(8, "C 3")→("C 3", A)
File B:
(0, "A 1")→("A 1", B)
(4, "C 3")→("C 3", B)
(8, "D 4")→("D 4", B)

(This assumes you're using TextInputFormat as the InputFormat, so the incoming key is the position in the file. You can get the filename with ((FileSplit) context.getInputSplit()).getPath() in the map function.)

In the reduce phase, Hadoop will collect the values (filenames) associated with each key (line), and pass this to your reducer. In your reducer, you should only emit lines that have just the filename, A, and don't emit anything for the others:

("A 1",{A,B})→nothing
("B 2",{A})→"B 2"
("C 3",{A,B})→nothing
("D 4",{B})→nothing

The result will be just the lines that are in only file A.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜