Parse Phrases into different word pairings
I am trying to figure out what the best method would be for me to use to parse word phrases passed to me and build different groupings based on those phrases.
Example XML:
<root>
<keyword value=""My First Phrase""/>
<keyword value=""My First Phrase Again""/>
<keyword value=""My First Phrase Again and Again""/>
</root>
So I would extract these out of the xml:
My First Phrase
My First Phrase Again
My First Phrase Again and Again
I would then like to build these new phrases from the original:
My First Phrase
My First
First Phrase
My
First
Phrase
My First Phrase Again
My First Phrase
First Phrase Again
My First
First Phrase
Phrase Again
My
First
Phrase
Again
This would let me break down the phrases and build a sort of ranking out of those开发者_StackOverflow社区 words. I am have built some lists and iterated over them, but it isn't work the way I would expect.
So for the ranking I mean this:
My First Phrase Again Rank: 1 (Exact Match)
My First Phrase Rank: 2
First Phrase Again Rank: 2
My First Rank: 3
First Phrase Rank: 3
Phrase Again Rank: 3
My Rank: 4
First Rank: 4
Phrase Rank: 4
Again Rank: 4
Not sure what the best approach would be to parse this data.
Thanks,
S
It sounds like you're looking at developing a grammar. Your rankings look like they're the same as the depth of their tokens in a parse tree. Your terminal symbols would be any word, and your start symbols would be the sentences listed in your root
element.
For instance:
S -> X Y
X -> M F
Y -> P A
M -> "My"
F -> "First"
P -> "Phrase"
A -> "Again"
In this instance, the depth of "My First Phrase Again" would be 0 in the parse tree, the depth of "My First" and "Phrase Again" would be 1, and the depth of "My", "First", "Phrase", and "Again" would be 2.
I would start looking around for grammar parsers. There are a lot of these available since they're used in writing compilers. Alternatively you could try and write your own. Context-free grammars are fairly simple to implement; All you really need is a stack and a way to interpret and operate on your grammar rules. There is a lot of literature on this since it is a well-studied area of computer science.
You need a suffix array, but rather than separating by character, separate by the " " token. http://en.wikipedia.org/wiki/Suffix_array
There is a good description of this in Programming Pearls.
If i understand correctly your definition of 'rank', you can solve it with something like this:
public class PhraseRanking : IEnumerable<KeyValuePair<string, int>>
{
private readonly Dictionary<string, int> _ranking;
public PhraseRanking()
{
_ranking = new Dictionary<string, int>();
}
public PhraseRanking(string phrase)
: this()
{
var words = phrase.Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);
var sb = new StringBuilder(phrase.Length);
for(int i = words.Length; i > 0; --i)
{
int rank = words.Length - i + 1;
int lastFirstWordIndex = words.Length - i;
for(int j = 0; j <= lastFirstWordIndex; ++j)
{
sb.Clear();
int lastWordIndex = j + i - 1;
for(int k = j; k <= lastWordIndex; ++k)
{
sb.Append(words[k]);
if(k != lastWordIndex) sb.Append(' ');
}
_ranking[sb.ToString()] = rank;
}
}
}
public int this[string phrase]
{
get { return _ranking[phrase]; }
}
public int Count
{
get { return _ranking.Count; }
}
public IEnumerator<KeyValuePair<string, int>> GetEnumerator()
{
return _ranking.GetEnumerator();
}
System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
{
return _ranking.GetEnumerator();
}
}
Usage:
var ranking = new PhraseRanking("My First Phrase Again");
var sb = new StringBuilder();
foreach(var rank in ranking)
{
sb.AppendLine(rank.Value.ToString() + ": " + rank.Key);
}
MessageBox.Show(sb.ToString());
Output:
1: My First Phrase Again
2: My First Phrase
2: First Phrase Again
3: My First
3: First Phrase
3: Phrase Again
4: My
4: First
4: Phrase
4: Again
精彩评论