Split string into words
I am looking for the most efficient algorithm to form 开发者_开发问答all possible combinations of words from a string. For example:
Input String: forevercarrot
Output:
forever carrot
forever car rot
for ever carrot
for ever car rot
(All words should be from a dictionary).
I can think of a brute force approach. (find all possible substrings and match) but what would be better ways?
Use a prefix tree for your list of known words. Probably libs like myspell
already do so. Try using a ready-made one.
Once you found a match (e.g. 'car'), split your computation: one branch starts to look for a new word ('rot'), another continues to explore variants of current beginning ('carrot').
Effectively you maintain a queue of pairs (start_position, current_position)
of offsets into your string every time you split the computation. Several threads can pop from this queue in parallel and try to continue a word that starts from start_position
and is already known up to current_position
of the pair, but does not end there. When a word is found, it is reported and another pair is popped from the queue. When it's impossible, no result is generated. When a split occurs, a new pair is added to the end of the queue. Initially the queue contains a (0,0)
.
See this question which has even better answers. It's a standard dynamic programming problem:
How to split a string into words. Ex: "stringintowords" -> "String Into Words"?
A psuedocode implementation, exploiting the fact that every part of the string needs to be a word, we can't skip anything. We work forward from the start of the string until the first bit is a word, and then generate all possible combinations of the rest of the string. Once we've done that, we keep going along until we find any other possibilities for the first word, and so on.
allPossibleWords(string s, int startPosition) {
list ret
for i in startPosition..s'length
if isWord(s[startPosition, i])
ret += s[startPostion, i] * allPossibleWords(s, i)
return ret
}
The bugbear in this code is that you'll end up repeating calculations - in your example, you'll end up having to calculate allPossibleWords("carrot")
twice - once in ["forever", allPossibleWords["carrot"]]
and once in ["for", "ever", allPossibleWords["carrot"]]
. So memoizing this is something to consider.
Input String: forevercarrot
Output:
forever carrot forever car rot for ever carrot for ever car rot
program :
#include<iostream>
#include<string>
#include<vector>
#include<string.h>
void strsplit(std::string str)
{
int len=0,i,x,y,j,k;
len = str.size();
std::string s1,s2,s3,s4,s5,s6,s7;
char *c = new char[len+1]();
char *b = new char[len+1]();
char *d = new char[len+1]();
for(i =0 ;i< len-1;i++)
{
std::cout<<"\n";
for(j=0;j<=i;j++)
{
c[j] = str[j];
b[j] = str[j];
s3 += c[j];
y = j+1;
}
for( int h=i+1;h<len;h++){
s5 += str[h];
}
s6 = s3+" "+s5;
std::cout<<" "<<s6<<"\n";
s5 = "";
for(k = y;k<len-1;k++)
{
d[k] = str[k];
s1 += d[k];
s1 += " ";
for(int l = k+1;l<len;l++){
b[l] = str[l];
s2 += b[l];
}
s4 = s3+" "+s1+s2;
s7 = s4;
std::cout<<" "<<s4<<"\n";
s3 = "";s4 = "";
}
s1 = "";s3 = "";
}
}
int main(int argc, char* argv[])
{
std::string str;
if(argc < 2)
std::cout<<"Usage: "<<argv[0]<<" <InputString> "<<"\n";
else{
str = argv[1];
strsplit(str);
}
return 0;
}
精彩评论