开发者

Having two sets of input combined on hadoop

I have a rather simple hadoop question which I'll try to present with an example

say you have a list of strings and a large file and you want each mapper to process a piece of the file and one of the strings in a grep like program.

how are you supposed to do that? I am under the impression that the number of mappers is a result of the inputSplits produced. I could run subsequent jobs, one for each string, but it seems kinda... messy?

edit: I am not actually trying to build a grep map reduce version. I used it as an example of having 2 different inputs to a mapper. Let's just say that I lists A and B and would like for a mapper to work on 1 element from list A and 1 element from list B

So given that the problem experiences no data dependency that would result in the need for chaining jobs, is my only option to somehow share all of list A on all mappers and then input 1 element of list B to each mapper?

What I am trying to do is built some type of a prefixed look-up structure for my data. So I have a giant text and a set of s开发者_JAVA技巧trings. This process has a strong memory bottleneck, therefore I was after 1 chunk of text/1 string per mapper


Mappers should be able to work independent and w/o side effects. The parallelism can be, that a mapper tries to match a line with all patterns. Each input is only processed once!

Otherwise you could multiply each input line with the number of patterns. Process each line with a single pattern. And run the reducer afterwards. A ChainMapper is the solution of choice here. But remember: A line will appear twice, if it matches two patterns. Is that what you want?

In my opinion you should prefer the first scenario: Each mapper processes a line independently and checks it against all known patterns.

Hint: You can distribute the patterns with the DistributedCache feature to all mappers! ;-) Input should be splitted with the InputLineFormat


a good friend had a great epiphany: what about chaining 2 mappers?

in the main, run a job that fires up a mapper (no reducer). The input is the list of strings, and we can arrange things so that each mapper gets one string only.

in turn, the first mapper starts a new job, where the input is the text. It can communicate the string by setting a variable in the context.


Regarding your edit: In general a mapper is not used to process 2 elements at once. He shall only process one element a time. The job should be designed in a way, that there could be a mapper for each input record and it would still run correctly!

Of course it is suitable, that the mapper needs some supporting information to process the input. This information can be by-passed with the Job Configuration (Configuration.setString() for example). A larger set of data shall be passed via the distributed cache.

Did you have a look on one of these options? I'm not sure if I fully understood your problem, so please check by yourself if that would work ;-)

BTW: A appreciating vote for my well investigated previous answer would be nice ;-)

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜