开发者

Regex unordered matches

This feels like it should be an extremely simple thing to do with regex but I can't quite seem to figure it out.

I would like to write a regex which checks to see if a list of certain words appear in a document, in any order, along with any of a set of other words in any order.

In boolean logic the check would be: If allOfTheseWords are in this text and atLeastOneOfTheseWords are in this text, return true.

Example

I'm searching for (john and barbara) with (happy or sad). Order does not matter.

"Happy birthday john from barbara" => VA开发者_运维知识库LID
"Happy birthday john"              => INVALID

I simply cannot figure out how to get the and part to match in an orderless way, any help would be appreciated!


You don't really want to use a regex for this unless the text is very small, which from your description I doubt.

A simple solution would be to dump all the words into a HashSet, at which point checking to see if a word is present becomes a very quick and easy operation.


If you want to do it with regex, I'd try positive lookahead:

// searching for (john and barbara) with (happy or sad)
"^(?=.*\bjohn\b)(?=.*\bbarbara\b).*\b(happy|sad)\b"

The performance should be comparable to doing a full text search for each of the words in the allOfTheseWords group separately.


If you really need a single regex, then it would be very large and very slow due to backtracking. For your particular example of (John AND Barbara) AND (Happy or Sad), it would start like this:

\bJohn\b.*?\bBarbara\n.*?\bHappy\b|\bJohn\b.*?\bBarbara\n.*?\bSad\b|......

You'd ultimately need to put all combinations in the regex. Something like:

JBH, JBS, JHB, JSB, HJB, SJB, BJH, BJS, BHJ, BSJ, HBJ, SBJ

Again backtracking would be prohibitive, as would the explosion in the number of cases. Stay away from regexes here.


With your example, this is a regex that may help you :

Regex

(?:happy|sad).*?john.*?barbara|
(?:happy|sad).*?barbara.*?john|
barbara.*?john.*?(?:happy|sad)|
john.*?barbara.*?(?:happy|sad)|
barbara.*?(?:happy|sad).*?john|
john.*?(?:happy|sad).*?barbara

Output

happy birthday john from barbara => Matched
Happy birthday john              => Not matched

As mentionned in other responses, a regex may not be well suited here.


It might be possible to do it with regexp, but it would be so complicated that it's better to use some different way (for example using a HashSet, as mentioned in the other answers).

One way to do it with regex would be to calculate all the permutations of the words which you are looking for, and then write a regex which mentions all those permutations. With 2 words there would be 2 permutations, as in (.*foo.*bar.*)|(.*bar.*foo.*) (plus word boundaries), with 3 words there would be 6 permutations, and quite soon the number of permutations would be larger than your input file.


If your data is relatively constant, and you are planning on searching a lot, using Apache Lucene will ensure better peformance.

Using information retrieval techniques, you will first index all your documents/sentences, and then search for your words, in your example you would want to search for "+(+john +barbara) +(sad happy)" [or "(john AND barbarar) AND (sad OR HAPPY)" ]

this approach will consume some time when indexing, however, searching will be much faster then any regex/hashset approach (since you don't need to iterate over all documents...)

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜