Anomaly in text

2023-02-24 19:42 问答作者：

Let me explain with an example. We have the following text:

"Comme Il Faut was founded in 1927. The tobacco company is most well known for its reputation of producing customized private label brands for its partners worldwide".

This is normal text. But the following text:

"CommeIlF开发者_如何学JAVAautwasfounded in 1927. The tobacco companyi most wellknown foritsreputation of producing customizedprivatelabelbrands foritspartners worldwide"

This is text anomaly: typos, words without a space, maybe something else.

How to search for such anomalies?

What algorithms are there for this (statistical)?

It is desirable that the result was a percentage: for example, 80% of the anomalies.

Thanks.

Construct a Trie tree with all the known words in the dictionary. Take each word that apears in your text and try to find it in the Trie tree. If you don't find it then try to match prefix of length-k. If you find a match then you apply the same procedure to the rest k characters. It's recursive and it could catch more than two concatenated words

Another simple method is to use the edit distance algorithm. This algorithm calculates the minimum number of edit operations (insert, delete or replace) that have to be performed to transform the string into the other string. With some additional logic you can easily get this algorithm to output the operations as well.

This however assumes you have both the correct and the broken string. If you only have the broken string this get's a lot harder. In that case I would suggest you either try the trie approach mentioned before, or you use some external library like ispell to have it handle this logic. You could have a look at the code for ispell or it's variants to see how complicated such a task might get.

A couple of links that could be helpful:

http://www.codeproject.com/KB/cs/spellcheckdemo.aspx
http://www.codeproject.com/KB/recipes/spellcheckparser.aspx

继续阅读：algorithm text-processing

Anomaly in text

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？