开发者

Place dots where a word is misspelled

I'm creating a web app in PHP where people can try to translate words they need to learn for school.

For example, someone needs to translate the Dutch word 'weer' to 'weather' in English, but unfortunately he types 开发者_StackOverflow中文版'whether'. Because he almost typed the right word, I want to give him another try, with dots '.' on the places where he made a mistake:

Language A:   weer
Language B:   weather
Input:        whether
Output:       w..ther

Or, for example

Language A:   fout
Language B:   mistake
Input:        mitake
Output:       mi.take

Or:

Language A:   echt
Language B:   genuine
Input:        genuinely
Output:       genuinely (almost good, shorten the word a little bit)

But, if the input differs too much from the desired translation, I don't want to get output like ........

I heard about Levenshtein distance and I think I need an algorithm that is much like that one, but I don't know how to place dots on the right place instead of echoing how many operations are to be done.

So, how can I return the misspelled word with dots on the places where someone made a mistake?


First, take a look at the Levenshtein algorithm on wikipedia. Then go on and look at the examples and the resulting matrix d on the article page:

                *k*     *i*     *t*     *t*     *e*     *n*
        >0       1       2       3       4       5       6 
*s*      1      >1       2       3       4       5       6 
*i*      2       2      >1       2       3       4       5 
*t*      3       3       2      >1       2       3       4 
*t*      4       4       3       2      >1       2       3 
*i*      5       5       4       3       2      >2       3 
*n*      6       6       5       4       3       3      >2 
*g*      7       7       6       5       4       4      >3 

The distance is found on lower right corner of the matrix, d[m, n]. But from there it is now possible to follow backtrack the minimum steps to the upper left of of the matrix, d[1, 1]. You just go left, up-left, or up at each step whichever minimizes the path. In the example above, you'd find the path marked by the '>' signs:

  s i t t i n g      k i t t e n
0 1 1 1 1 2 2 3    0 1 1 1 1 2 2 3
  ^       ^   ^      ^       ^   ^
  changes in the distance, replace by dots

Now you can find on the minimum path at which locations d[i,j] the distance changes (marked by ^ in the example above), and for those letters you put in the first (or second) word a dot at position i (or j respectively).

Result:

  s i t t i n g      k i t t e n
  ^       ^   ^      ^       ^   ^
  . i t t . n .      . i t t . n . 


The terminology you are looking for is called "edit distance." Using something like Levenshtein distance will tell you the number of operations needed to transform one string into the other (insertions, deletion, substitutions, etc).

Here is a list of other "editing distance" algorithms.

Once you decide that a word is "close enough" (i.e. it doesn't exceed some threshold of edits needed), you can show where the edits need to occur (by showing dots).

So how do you know where to put the dots?

The interesting thing about the "Levenshtein distance" is it uses a M x N matrix with one word on each axis (see the sample matrix in Levenshtein article). Once you create the matrix, you can determine which letters require "additional edits" to be correct. That's where you put the dots. If the letter requires "no additional edits," you simply print the letter. Pretty cool.


I think you need to do a multi step process not just Levenshtein. First I would check if the input word is a form of target word. That would catch your 3rd example and also no worry about adding dots. You could also use this step to catch synonyms. The next step is to check the length difference of the two strings.

If the difference is 0 you can do a letter for letter comparison to place the dots. If you don't want to show all dots then you should keep a count of dots placed and once over the limit display some error message. (Sorry that was incorrect)

If the difference is shows the input is longer you need to check for a letter to be delete would fix the problem. Here you can use Levenshtein to see how close they are if they are too far away show your error message if it is in range you will need to do the steps of Levenshtein in reverse and mark the changes somehow. Not sure how you want to show that a letter needs deleted.

If the difference shows the input is shorter you can do use Levenshtein distance to see if the two words are close enough or show the error. Then do the steps in reverse again adding dots for insertions and dots for substitutions.

Actually the last two steps can be combined into one function that runs through the algorithm and remembers the insert delete or substitution and changes the output accordingly.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜