开发者

Match lines with pattern n times in the same line

I have a file and I need to filter lines that have (or don't have) N occurrences of a pattern. I.e., if my pattern is the letter o and I what to match lines where the letter o occurs exactly 4 times, the expression should match the first of the following example lines but not the others:

foo foo  
foo  
foo foo foo   

I thouth I could do it with a regex in vim, or sed, awk, or any other tool. I've googled and haven't found anyone that has done a similar thing. Probably will have do a script or something similar to parse each line. Does anyone have done a simila开发者_JAVA技巧r thing?

Thanks


You can use a regex like below:

(?=(.*o){4})(?!(.*o){5,}).*

Regexr - http://regexr.com?2toro

This should work with any pattern you want. For instance, you want to find lines with exactly four foos in it, use:

(?=(.*foo){4})(?!(.*foo){5,}).*

Regexr - http://regexr.com?2tosa


A Perl one-liner :

perl -ne 'print if(tr/o/o/ == 4)' foo_file


perl -lnwe '@c=$_=~/o/g;if(scalar(@c)==4){print $_}' file_to_parse


In awk...

awk '{ if (gsub(/o/, "o") == 4) print }' # lines that matched
awk '{ if (gsub(/o/, "o") != 4) print }' # lines that didn't

If you're going to be doing this over and over with different patterns/match counts, and pattern isn't a regular expression, you could also do something like...

awk -v pattern=o -v matches=4 '{ if (gsub(pattern, pattern) == matches) print }'


If you want to write code, then you can construct a DFA based string matching or i would tell you to have a look at the shift or string matching algorithm, which you can easily write. Then you can input the string to the proper datastructure as per the algorithm needs. Read http://en.wikipedia.org/wiki/Shift_Or_Algorithm for the shift-or string matching algorithm.


It's possible, but not easy.

For the single letter case, an expression such as ^[^o]*o[^o]*o[^o]*o[^o]*o[^o]*$ would work. It basically looks for "not o" (zero or more) followed by "o" four times, and allows extra "not o" characters at the end.

But longer expressions are bit of a problem. For example, in order not to find the word "foo", you have to allow "f" and "fo" but not "foo". So to find a line with exactly twice "foo", you have to allow the line "ffofofoofoffoffoofoffofofo" which is not so easy to define.

To match "anything but 'foo'" you could use the expression ([^f]|f[^o]|fo[^o])* which allows "f" and "fo" and other things, but not "foo". But you can see how this can become annoying if the word is longer and you have to match it four times.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜