Anomaly in text
Let me explain with an example. We have the following text:
"Comme Il Faut was founded in 1927. The tobacco company is most well known for its reputation of producing customized private label brands for its partners worldwide".
This is normal text. But the following text:
"CommeIlF开发者_如何学JAVAautwasfounded in 1927. The tobacco companyi most wellknown foritsreputation of producing customizedprivatelabelbrands foritspartners worldwide"
This is text anomaly: typos, words without a space, maybe something else.
How to search for such anomalies?
What algorithms are there for this (statistical)?It is desirable that the result was a percentage: for example, 80% of the anomalies.
Thanks.
Construct a Trie tree with all the known words in the dictionary. Take each word that apears in your text and try to find it in the Trie tree. If you don't find it then try to match prefix of length-k. If you find a match then you apply the same procedure to the rest k characters. It's recursive and it could catch more than two concatenated words
Another simple method is to use the edit distance algorithm. This algorithm calculates the minimum number of edit operations (insert, delete or replace) that have to be performed to transform the string into the other string. With some additional logic you can easily get this algorithm to output the operations as well.
This however assumes you have both the correct and the broken string. If you only have the broken string this get's a lot harder. In that case I would suggest you either try the trie approach mentioned before, or you use some external library like ispell to have it handle this logic. You could have a look at the code for ispell or it's variants to see how complicated such a task might get.
A couple of links that could be helpful:
http://www.codeproject.com/KB/cs/spellcheckdemo.aspx
http://www.codeproject.com/KB/recipes/spellcheckparser.aspx
精彩评论