Real world typo statistics? [closed]
Where can I find some real world typo statistics?
I'm trying to match people's input text to internal objects, and people tend to make spelling mistakes.
There are 2 kinds of mistakes:typos
- "Helllo" instead of "Hello" / "Satudray" instead of "Saturday" etc.Spelling
- "Shikago" instead of "Chicago"
I use Damerau-Levenshtein distance for the typos and Double Metaphone for spelling (Python implementations here and here).
I want to focus on the Damerau-Levenshtein (or simply edit-distance
). The textbook implementations always use '1' for the weight of deletions, insertions substitutions and transpositions. While this is simple and allows for nice algorithms it doesn't match "reality" / "real-world probabilities".
Examples:
- I'm sure the likelihood of "Helllo" ("Hello") is greater than "Helzlo", yet they are both 1 edit distance away.
- "Gello" is closer than "Qello" to "Hello" on a QWERTY keyboard.
- Unicode transliterations: What is the "real" distance between "München" and "Munchen"?
What should the "real world" weights be for deletions, insertions, substitutions, and transpositions?
Even Norvig's very cool spell corrector uses non-weighted edit distance.
BTW- I'm sure the weights need to be functions and not simple floats (per the above examples)...
I can adjust the algorithm, but where can I "learn" these weights? I don't have access to Google-scale data...
Should I just guess them?
EDIT - trying to answer user questions:
- My current non-weighted algorithm fails often when faced with typos for the above reasons. "Return on Tursday": every "real person" can easily tell Thursday is more likely than Tuesday, yet they are both 1-edit-distance away! (Yes, I do log and measure my performance).
- I'm developing an NLP Travel Search engine, so my dictionary contains ~25K destinations (expected to grow to 100K), Time Expressions ~200 (expected 1K), People expressions ~100 (expected 300), Money Expressions ~100 (expected 500), "glue logic words" ("from", "beautiful", "apartment") ~2K (expected 10K) and so on...
- Usage of the edit distance is different for each of the above word-groups. I try to "auto-correct when obvious", e.g. 1 edit distance away from only 1 other word in the dictionary. I have many other hand-tuned rules, e.g. Double Metaphone fix which is not more than 2 edit distance away from a dictionary word with a length > 4... The list of rules continues to grow as I learn from real world input.
- "How many pairs of dictionary entries are within your threshold?": well, that depends on the "fancy weighting system" and on real world (future) input, doesn't it? Anyway, I have extensive unit tests so that every change I make to the system only makes it better (based on past inputs, of course). Most sub-6 letter words are within 1 edit distance from a word that is 1 edit distance away from another dictionary entry.
- Today when there are 2 dicti开发者_如何学JAVAonary entries at the same distance from the input I try to apply various statistics to better guess which the user meant (e.g. Paris, France is more likely to show up in my search than Pārīz, Iran).
- The cost of choosing a wrong word is returning semi-random (often ridiculous) results to the end-user and potentially losing a customer. The cost of not understanding is slightly less expensive: the user will be asked to rephrase.
- Is the cost of complexity worth it? Yes, I'm sure it is. You would not believe the amount of typos people throw at the system and expect it to understand, and I could sure use the boost in Precision and Recall.
Possible source for real world typo statistics would be in the Wikipedia's complete edit history.
http://download.wikimedia.org/
Also, you might be interested in the AWB's RegExTypoFix
http://en.wikipedia.org/wiki/Wikipedia:AWB/T
I would advise you to check the trigram alogrithm. In my opinion it works better for finding typos then edit distance algorithm. It should work faster as well and if you keep dictionary in postgres database you can make use of index.
You may find useful stackoverflow topic about google "Did you mean"
Probability Scoring for Spelling Correction by Church and Gale might help. In that paper, the authors model typos as a noisy channel between the author and the computer. The appendix has tables for typos seen in a corpus of Associated Press publications. There is a table for each of the following kinds of typos:
- deletion
- insertion
- substitution
- transposition
For example, examining the insertion table, we can see that l was incorrectly inserted after l 128 times (the highest number in that column). Using these tables, you can generate the probabilities you're looking for.
If the research is your interest I think continuing with that algorithm, trying to find decent weights would be fruitful.
I can't help you with typo stats, but I think you should also play with python's difflib. Specifically, the ratio() method of SequenceMatcher. It uses an algorithm which the docs http://docs.python.org/library/difflib.html claim is well suited to matches that 'look right', and may be useful to augment or test what you're doing.
For python programmers just looking for typos it is a good place to start. One of my coworkers has used both Levenshtein edit distance and SequenceMatcher's ratio() and got much better results from ratio().
Some questions for you, to help you determine whether you should be asking your "where do I find real-world weights" question:
Have you actually measured the effectiveness of the uniform weighting implementation? How?
How many different "internal objects" do you have -- i.e. what is the size of your dictionary?
How are you actually using the edit distance e.g. John/Joan, Marmaduke/Marmeduke, Featherstonehaugh/Featherstonhaugh: is that "all 1 error" or is it 25% / 11.1% / 5.9% difference? What threshold are you using?
How many pairs of dictionary entries are within your threshold (e.g. John vs Joan, Joan vs Juan, etc)? If you introduced a fancy weighting system, how many pairs of dictionary entries would migrate (a) from inside the threshold to outside (b) vice versa?
What do you do if both John and Juan are in your dictionary and the user types Joan?
What are the penalties/costs of (1) choosing the wrong dictionary word (not the one that the user meant) (2) failing to recognise the user's input?
Will introducing a complicated weighting system actually reduce the probabilities of the above two error types by sufficient margin to make the complication and slower speed worthwhile?
BTW, how do you know what keyboard the user was using?
Update:
"""My current non-weighted algorithm fails often when faced with typos for the above reasons. "Return on Tursday": every "real person" can easily tell Thursday is more likely than Tuesday, yet they are both 1-edit-distance away! (Yes, I do log and measure my performance)."""
Yes, Thursday -> Tursday by omitting an "h", but Tuesday -> Tursday by substituting "r" instead of "e". E and R are next to each other on qwERty and azERty keyboards. Every "real person" can easily guess that Thursday is more likely than Tuesday. Even if statistics as well as guesses point to Thursday being more likely than Tuesday (perhaps omitting h will cost 0.5 and e->r will cost 0.75), will the difference (perhaps 0.25) be significant enough to always pick Thursday? Can/will your system ask "Did you mean Tuesday?" or does/will it just plough ahead with Thursday?
精彩评论