开发者

Hadoop/Pig regular expression matching

This is kind of an odd situation, but I'm looking for a way to filter using something like MATCHES but on a list of unknown patterns (of unknown length).

That is, if the given input is two files, one with numbers A:

xxxx

yyyy

zzzz

zzyy

...etc...

And the other with patterns B:

xx.*

yyy.*

...etc...

How can I filter the first input, by all of the patterns in the second?

If I knew all the patterns beforehand, I could A = FILTER A BY (num MATCHES 'somepattern.*' OR num MATCHES 'someotherpattern'....);

The problem is that I don't know them beforehand, and sin开发者_Python百科ce they're patterns and not simple strings, I cannot just use joins/groups (at least as far as I can tell). Maybe a strange nested FOREACH...thing? Any ideas at all?


If you use the | which operates as an OR you can construct a pattern out of the individual patterns.

(xx.*|yyy.*|zzzz.*)

This will do a check to see if it matches any of the patterns.

Edit: To create the combined regex pattern:
* Create a string starting with (
* Read in each line (assuming each line is a pattern) and append it to a string followed by a |
* When done reading lines, remove the last character (which will be an unneeded |)
* Append a )

This will create a regex pattern to check all the patterns in the input file. (Note: It's assumed the file contains valid patterns)

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜