Search for most occurring patterns in a non language text file
I'm not completely sure this answer belongs here but I'm looking to find patterns into an ascii file.
The file itself is composed of alphanumeric characters and I want to just check for repeating patterns in the file, disregarding of separators and disregarding of natural language words or meaning, just get the most used repeated sequences.
I d开发者_StackOverflow社区on't seem to find any program already developed that can do just that (as all seem to work with words, not just sets of characters). Do you know of any application that can do that?
If there's not such an application, how would you recommend I approach at coding one?
I'm not aware of any existent program to do it, so I can only recommend coding solution. You will have to build a bit modified Trie with counter of occurrences on its leafs. Then the task becomes trivial: from all leafs find one with the max counter; path from the root to this leaf will be a subsequence (pattern) you searches for.
Also FYI: Longest common substring problem
(I know this question is for SO and my answer must be a comment, but I just haven't enough reputation to leave comments.)
After some searching I finally found Textanz which analyses the text and gives you a frequency count and a distribution pattern for most repeating substrings.
精彩评论