开发者

Scoring a string based on how English-like it is

I'm not sure how exactly to word this question, so here's an example:

string1 = "THEQUICKBROWNFOX" string2 = "KLJHQKJBKJBHJBJLSDFD"

I want a function that would score string1 higher than string2 and a million other gibberish strings. Note the lack of spaces, so this is a character-by-character function, not word-by-word.

In the 90s I wrote a trigram-scoring function in Delphi and populated it with trigrams from Huck Finn, and I'm considering porting the code to C or Python or kludging it into a stand-alone tool, but there开发者_StackOverflow must be more efficient ways by now. I'll be doing this millions of times, so speed is nice. I tried the Reverend.Thomas Beyse() python library and trained it with some all-caps-strings, but it seems to require spaces between words and thus returns a score of []. I found some Markov Chain libraries, but they also seemed to require spaces between words. Though from my understanding of them, I don't see why that should be the case...

Anyway, I do a lot of cryptanalysis, so in the future scoring functions that use spaces and punctuation would be helpful, but right now I need just ALLCAPITALLETTERS.

Thanks for the help!


I would start with a simple probability model for how likely each letter is, given the previous (possibly-null, at start-of-word) letter. You could build this based on a dictionary file. You could then expand this to use 2 or 3 previous letters as context to condition the probabilities if the initial model is not good enough. Then multiply all the probabilities to get a score for the word, and possibly take the Nth root (where N is the length of the string) if you want to normalize the results so you can compare words of different lengths.


I don't see why a Markov chain couldn't be modified to work. I would create a text file dictionary of sorts, and read that in to initially populate the data structure. You would just be using a chain of n letters to predict the next letter, rather than n words to predict the next word. Then, rather than randomly generating a letter, you would likely want to pull out the probability of the next letter. For instance if you had the current chain of "TH" and the next letter was "E", you would go to your map, and see the probability that an "E" would follow "TH". Personally I would simply add up all of these probabilities while looping through the string, but how to exactly create a score from the probability is up to you. You could normalize it for string length, to let you compare short and long strings.

Now that I think about it, this method would favor strings with longer words, since a dictionary would not include phrases. Then again, you could populate the dictionary with not only single words, but short phrases with the spaces removed as well. Then the scoring would not only score based on how english the seperate words are, but how english series of words are. It's not a perfect system, but it would provide consistent scoring.


I don't know how it works, but Mail::SpamAssassin::Plugin::TextCat analyzes email and guesses what language it is (with dozens of languages supported).


The Index of Coincidence might be of help here, see https://en.wikipedia.org/wiki/Index_of_coincidence.

For a start just compute the difference of the IC to the expected value of 1.73 (see Wikipedia above). For an advanced usage you might want to calculate the expected value yourself using some example language corpus.


I'm thinking that maybe you could apply some text-to-speech synthesis ideas here. In particular, if a speech synthesis program is able to produce a pronunciation for a word, then that can be considered "English."

The pre-processing step is called grapheme-to-phoneme conversion, and typically leads to probabilities of mapping strings to sounds.

Here's a paper that describes some approaches to this problem. (I don't claim this paper is authoritative, as it just was a highly ranked search result, and I don't really have expertise in this area.)

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜