Having two sets of input combined on hadoop

2022-12-28 04:09 问答作者：

I have a rather simple hadoop question which I'll try to present with an example

say you have a list of strings and a large file and you want each mapper to process a piece of the file and one of the strings in a grep like program.

how are you supposed to do that? I am under the impression that the number of mappers is a result of the inputSplits produced. I could run subsequent jobs, one for each string, but it seems kinda... messy?

edit: I am not actually trying to build a grep map reduce version. I used it as an example of having 2 different inputs to a mapper. Let's just say that I lists A and B and would like for a mapper to work on 1 element from list A and 1 element from list B

So given that the problem experiences no data dependency that would result in the need for chaining jobs, is my only option to somehow share all of list A on all mappers and then input 1 element of list B to each mapper?

What I am trying to do is built some type of a prefixed look-up structure for my data. So I have a giant text and a set of s开发者_JAVA技巧trings. This process has a strong memory bottleneck, therefore I was after 1 chunk of text/1 string per mapper

Mappers should be able to work independent and w/o side effects. The parallelism can be, that a mapper tries to match a line with all patterns. Each input is only processed once!

Otherwise you could multiply each input line with the number of patterns. Process each line with a single pattern. And run the reducer afterwards. A ChainMapper is the solution of choice here. But remember: A line will appear twice, if it matches two patterns. Is that what you want?

In my opinion you should prefer the first scenario: Each mapper processes a line independently and checks it against all known patterns.

Hint: You can distribute the patterns with the DistributedCache feature to all mappers! ;-) Input should be splitted with the InputLineFormat

a good friend had a great epiphany: what about chaining 2 mappers?

in the main, run a job that fires up a mapper (no reducer). The input is the list of strings, and we can arrange things so that each mapper gets one string only.

in turn, the first mapper starts a new job, where the input is the text. It can communicate the string by setting a variable in the context.

Regarding your edit: In general a mapper is not used to process 2 elements at once. He shall only process one element a time. The job should be designed in a way, that there could be a mapper for each input record and it would still run correctly!

Of course it is suitable, that the mapper needs some supporting information to process the input. This information can be by-passed with the Job Configuration (Configuration.setString() for example). A larger set of data shall be passed via the distributed cache.

Did you have a look on one of these options? I'm not sure if I fully understood your problem, so please check by yourself if that would work ;-)

BTW: A appreciating vote for my well investigated previous answer would be nice ;-)

继续阅读：input map mapper

Having two sets of input combined on hadoop

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？