开发者

parsing words in a continuous string

If a have a string with words and no spaces, how should I parse those words given that I have a dictionary/list that contains those words?

For example, if my string is "thisisastringwithwords" how could I use a dictionary to create an output "this is a string with words"?

I hear that using the data structure Tries could help but maybe if someone could help with the pse开发者_运维知识库udo code? For example, I was thinking that maybe you could index the dictionary into a trie structure, then follow each char down the trie; problem is, I'm unfamiliar with how to do this in (pseudo)code.


I'm assuming that you want an efficient solution, not the obvious one where you repeatedly check if your text starts with a dictionary word.

If the dictionary is small enough, I think you could try and modify the standard KMP algorithm. Basically, build a finite-state machine on your dictionary which consumes the text character by character and yields the constructed words.

EDIT: It appeared that I was reinventing tries.


I already did something similar. You cannot use a simple dictionary. The result will be messy. It depends if you only have to do this once or as whole program.

My solution was to:

  1. Connect to a database with working words from a dictionary list (for example online dictionary)
  2. Filter long and short words in dictionary and check if you want to trim stuff (for example don't use words with only one character like 'I')
  3. Start with short words and compare your bigString with the database dictionary.

Now you need to create a "table of possibility". Because a lot of words can fit into 100% but are wrong. As longer the word as more sure you are, that this word is the right one.

It is cpu intensive but it can work precise in the result. So lets say, you are using a small dictionary of 10,000 words and 3,000 of them are with a length of 8 characters, you need to compare your bigString at start with all 3,000 words and only if result was found, it is allowed to proceed to the next word. If you have 200 characters in your bigString you need about (2000chars / 8 average chars) = 250 full loops minimum with comparation.

For me, I also did a small verification of misspelled words into the comparation.

example of procedure (don't copy paste)

    Dim bigString As String = "helloworld.thisisastackoverflowtest!"

    Dim dictionary As New List(Of String) 'contains the original words. lets make it case insentitive
    dictionary.Add("Hello")
    dictionary.Add("World")
    dictionary.Add("this")
    dictionary.Add("is")
    dictionary.Add("a")
    dictionary.Add("stack")
    dictionary.Add("over")
    dictionary.Add("flow")
    dictionary.Add("stackoverflow")
    dictionary.Add("test")
    dictionary.Add("!")


    For Each word As String In dictionary
        If word.Length < 1 Then dictionary.Remove(word) 'remove short words (will not work with for each in real)
        word = word.ToLower 'make it case insentitive
    Next

    Dim ResultComparer As New Dictionary(Of String, Double) 'String is the dictionary word. Double is a value as percent for a own function to weight result

    Dim i As Integer = 0 'start at the beginning
    Dim Found As Boolean = False
    Do
        For Each word In dictionary
            If bigString.IndexOf(word, i) > 0 Then
                ResultComparer.Add(word, MyWeightOfWord) 'add the word if found, long words are better and will increase the weight value 
                Found = True
            End If
        Next
        If Found = True Then
            i += ResultComparer(BestWordWithBestWeight).Length
        Else
            i += 1
        End If
    Loop


I told you that it seems like an impossible task. But you can have a look at this related SO question - it may help you.


If you are sure you have all the words of the phrase in the dictionary, you can use that algo:

String phrase = "thisisastringwithwords";
String fullPhrase = "";
Set<String> myDictionary;
do {
    foreach(item in myDictionary){
        if(phrase.startsWith(item){
            fullPhrase += item + " ";
            phrase.remove(item);
            break;
        }
    }
} while(phrase.length != 0);

There are so many complications, like, some items starting equally, so the code will be changed to use some tree search, BST or so.


This is the exact problem one has when trying to programmatically parse languages like Chinese where there are no spaces between words. One method that works with those languages is to start by splitting text on punctuation. This gives you phrases. Next you iterate over the phrases and try to break them into words starting with the length of the longest word in your dictionary. Let's say that length is 13 characters. Take the first 13 characters from the phrase and see if it is in your dictionary. If so, take it as a correct word for now, move forward in the phrase and repeat. Otherwise, shorten your substring to 12 characters, then 11 characters, etc.

This works extremely well, but not perfectly because we've accidentally put in a bias towards words that come first. One way to remove this bias and double check your result is to repeat the process starting at the end of the phrase. If you get the same word breaks you can probably call it good. If not, you have an overlapping word segment. For example, when you parse your sample phrase starting at the end you might get (backwards for emphasis)

words with string a Isis th

At first, the word Isis (Egyptian Goddess) appears to be the correct word. When you find that "th" is not in your dictionary, however, you know there is a word segmentation problem nearby. Resolve this by going with the forward segmentation result "this is" for the non-aligned sequence "thisis" since both words are in the dictionary.

A less common variant of this problem is when adjacent words share a sequence which could go either way. If you had a sequence like "archand" (to make something up), should it be "arc hand" or "arch and"? The way to determine is to apply a grammar checker to the results. This should be done to the whole text anyway.


Ok, I will make a hand wavy attempt at this. The perfect(ish) data structure for your problem is (as you've said a trie) made up of the words in the dictionary. A trie is best visualised as a DFA, a nice state machine where you go from one state to the next on every new character. This is really easy to do in code, a Java(ish) style class for this would be :

Class State 
{
   String matchedWord;
   Map<char,State> mapChildren;
}

From hereon, building the trie is easy. Its like having a rooted tree structure with each node having multiple children. Each child is visited on one character transition. The use of a HashMap kind of structure trims down time to look up character to next State mappings. Alternately if all you have are 26 characters for the alphabet, a fixed size array of 26 would do the trick as well.

Now, assuming all of that made sense, you have a trie, your problem still isn't fully solved. This is where you start doing things like regular expressions engines do, walk down the trie, keep track of states which match to a whole word in the dictionary (thats what I had the matchedWord for in the State structure), use some backtracking logic to jump to a previous match state if the current trail hits a dead end. I know its general but given the trie structure, the rest is fairly straightforward.


If you have dictionary of words and need a quick implmentation this can be solved efficiently with dynamic programming in O(n^2) time, assuming the dictionary lookups are O(1). Below is some C# code, the substring extraction could and dictionary lookup could be improved.

public static String[] StringToWords(String str, HashSet<string> words)
{      
  //Index of char - length of last valid word
  int[] bps = new int[str.Length + 1];

  for (int i = 0; i < bps.Length; i++)      
    bps[i] = -1;

  for (int i = 0; i < str.Length; i++)
  {
    for (int j = i + 1; j <= str.Length ; j++)
    {
      if (bps[j] == -1)
      {
        //Destination cell doesn't have valid backpointer yet
        //Try with the current substring
        String s = str.Substring(i, j - i);
        if (words.Contains(s))
          bps[j] = i;
      }
    }        
  }      

  //Backtrack to recovery sequence and then reverse 
  List<String> seg = new List<string>();
  for (int bp = str.Length; bps[bp] != -1 ;bp = bps[bp])      
    seg.Add(str.Substring(bps[bp], bp - bps[bp]));      
  seg.Reverse();
  return seg.ToArray();
}

Building a hastset with the word list from /usr/share/dict/words and testing with

foreach (var s in StringSplitter.StringToWords("thisisastringwithwords", dict))
    Console.WriteLine(s);

I get the output "t hi sis a string with words". Because as others have pointed out this algorithm will return a valid segmentation (if one exists), however this may not be the segmentation you expect. The presence of short words is reducing the segmentation quality, you might be able to add heuristic to favour longer words if two valid sub-segmentation enter an element.

There are more sophisticated methods that finite state machines and language models that can generate multiple segmentations and apply probabilistic ranking.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜