Scoring a string based on how English-like it is

2023-03-24 11:27 问答作者：

I'm not sure how exactly to word this question, so here's an example:

string1 = "THEQUICKBROWNFOX" string2 = "KLJHQKJBKJBHJBJLSDFD"

I want a function that would score string1 higher than string2 and a million other gibberish strings. Note the lack of spaces, so this is a character-by-character function, not word-by-word.

In the 90s I wrote a trigram-scoring function in Delphi and populated it with trigrams from Huck Finn, and I'm considering porting the code to C or Python or kludging it into a stand-alone tool, but there开发者_StackOverflow must be more efficient ways by now. I'll be doing this millions of times, so speed is nice. I tried the Reverend.Thomas Beyse() python library and trained it with some all-caps-strings, but it seems to require spaces between words and thus returns a score of []. I found some Markov Chain libraries, but they also seemed to require spaces between words. Though from my understanding of them, I don't see why that should be the case...

Anyway, I do a lot of cryptanalysis, so in the future scoring functions that use spaces and punctuation would be helpful, but right now I need just ALLCAPITALLETTERS.

Thanks for the help!

I would start with a simple probability model for how likely each letter is, given the previous (possibly-null, at start-of-word) letter. You could build this based on a dictionary file. You could then expand this to use 2 or 3 previous letters as context to condition the probabilities if the initial model is not good enough. Then multiply all the probabilities to get a score for the word, and possibly take the Nth root (where N is the length of the string) if you want to normalize the results so you can compare words of different lengths.

I don't see why a Markov chain couldn't be modified to work. I would create a text file dictionary of sorts, and read that in to initially populate the data structure. You would just be using a chain of n letters to predict the next letter, rather than n words to predict the next word. Then, rather than randomly generating a letter, you would likely want to pull out the probability of the next letter. For instance if you had the current chain of "TH" and the next letter was "E", you would go to your map, and see the probability that an "E" would follow "TH". Personally I would simply add up all of these probabilities while looping through the string, but how to exactly create a score from the probability is up to you. You could normalize it for string length, to let you compare short and long strings.

Now that I think about it, this method would favor strings with longer words, since a dictionary would not include phrases. Then again, you could populate the dictionary with not only single words, but short phrases with the spaces removed as well. Then the scoring would not only score based on how english the seperate words are, but how english series of words are. It's not a perfect system, but it would provide consistent scoring.

I don't know how it works, but Mail::SpamAssassin::Plugin::TextCat analyzes email and guesses what language it is (with dozens of languages supported).

The Index of Coincidence might be of help here, see https://en.wikipedia.org/wiki/Index_of_coincidence.

For a start just compute the difference of the IC to the expected value of 1.73 (see Wikipedia above). For an advanced usage you might want to calculate the expected value yourself using some example language corpus.

I'm thinking that maybe you could apply some text-to-speech synthesis ideas here. In particular, if a speech synthesis program is able to produce a pronunciation for a word, then that can be considered "English."

The pre-processing step is called grapheme-to-phoneme conversion, and typically leads to probabilities of mapping strings to sounds.

Here's a paper that describes some approaches to this problem. (I don't claim this paper is authoritative, as it just was a highly ranked search result, and I don't really have expertise in this area.)

继续阅读：c cryptanalysis cryptography python

Scoring a string based on how English-like it is

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？