Software to identify patterns in text files
I work on some software that parses large text files and inse开发者_StackOverflowrts data into a database. Every time we get a new client, we have to write new parsing code for their text files.
I'm looking for some software to help simplify analyzing the text files. It would be nice to have some software that could identify patterns in the file.
I'm also open to any general purpose parsing libraries (.NET) that may simplify the job. Or any other relevant software.
Thanks.
More Specific
I open a text file with some magic software that shows me repeating patterns that it has identified. Really I'm just looking for any tools that developers have used to help them parse files. If something has helped you do this, please tell me about it.
Well, likely not exactly what you are looking for, but clone detection might be the right kind of idea.
There are a variety of such detectors. Some work only one raw lines of text, and that might apply directly to you.
Some work only on the works ("tokens") that make up the text, for some definition of "token". You'd have to define what you mean by tokens to such tools.
But you seem to want something that discovers the structure of the text and then looks for repeating blocks with some parametric variation. I think this is really hard to do, unless you know sort of what that structure is in advance.
Our CloneDR does this for programming language source code, where the "known structure" is that of the programming language itself, as described specifically by the BNF grammar rules.
You probably don't want to Java-biased duplicate detection on semi-structured text. But if you do know something about the structure of the documents, you could write that down as a grammar, and our CloneDR tool would then pick it up.
精彩评论