Regex to find all substrings and longest substring
I would usually do something like this using a string libray. But I wonder if it can be done using a regex.
I want to do the following: Given a search string:
Seattle is awesome
I want to find all substrings of it in a given sentence. So applying a regex to the following sentence
Seattle Seattle is awesome is awesome awesome is Seattle
Should give me
Seattle, Seattle is awesome, is awesome, awesome, is, Seattle
One constraint that might be helpful is that the sentence will always have only the words present in the search string and whitespaces in between.
Note If there's a match, it shou开发者_C百科ld be the longest possible string. So like in the above example the matches shouldn't be single words rather longest possible substrings. Order among words also needs to be maintained. That's why
awesome is Seattle
in the sentence above gives us
awesome, is and Seattle
I am not sure if something like this can be done with a regex though, since it is greedy. Would appreciate any insight into this! I'm familiar with both C# and Java and could use either of their regex libraries.
I don't think you can do this with a regex. Wikipedia has a good article on the longest common subsequence problem.
There is no good way to express such a pattern directly with a regex.
You'll need to list all allowed combinations:
Seattle is awesome|Seattle is|Seattle|is awesome|is|awesome
or more succinctly:
Seattle( is( awesome)?)?|is( awesome)?|awesome
You can programmatically transform your input string to this format.
Can you describe your problem a little further? It sounds a lot more like a search engine than simple string matching. I would highly recommend checking out Apache Lucene -- it has a bit of a learning curve, but it is a great little tool for smart searching. It handles lots of things that are gotchas when dealing with search. You can set up the scoring of hits to do pretty much exactly what you describe.
In Java, not tested. This returns an iterator on lists of strings. Each list is a matching subsequence. Just put spaces between the members of the list to print. If this is getting used a lot, the use of intern() may be bad.
static Iterator<List<String>> getSequences(String squery, String starget)
{
List<String> query = Arrays.asList(squery.split(" "));
for ( int i = 0; i < query.size(); i++)
query.set(i, query.get(i).intern());
List<String> target = Arrays.asList(starget.split(" "));;
for ( int i = 0; i < target.size(); i++)
target.set(i, target.get(i).intern());
// Because the strings are all intern'ed, this HashSet acts like we want -
// If two lists are the same sequence of words, they are equal.
// It's used to remove duplicates.
HashSet<List<String>> ret = new HashSet<List<String>>();
for ( int qBegin = 0; qBegin < query.size(); qBegin++ ) {
for ( int tBegin = 0; tBegin < target.size(); tBegin++ ) {
for ( int iCursor = 0;
iCursor < min(query.size()-qBegin, target.size()- tBegin);
iCursor++) {
if ( query.get(qBegin+iCursor)==target.get(tBegin+iCursor) )
ret.add(query.subList(qBegin, qBegin+iCursor+1));
else break;
}
}
}
return ret.iterator();
}
static int min(int a, int b) { return (a<b)? a:b; }
精彩评论