Trying to figure out a basic pattern finding algorithm
Long story short, I have some data that I need to find patterns in. The data is something like this开发者_如何学Go (each character represents an immutable chunk): dabababacdacdacdab
I want to be able to break this data into the following kinds of chunks:
d (repeated 1x) ab (repeated 3x) a (1x) cda (3x) b (1x)
I'm familiar with basic run length encoding, but I don't really know how to do it when the length of the "thing" may be variable (in my example, cda is 3 chunks of data, but d is one chunk). Am I making any sense? Your help would be appreciated.
The main difficulty here is the "greediness" of the algorithm and the ambiguity it can produce. Given your example, what is the rule to tell the program not to represent your string as:
d X 1 ababab X 1 <<<<<----- here's the problem cda X 1 b X 1
Or worse yet: d X 1 abababcdab X 1
You see what I mean. So you you need to establish a set of rules that the algo will follow, and then the code will kinda write itself. T gain insight on this, you might try looking at some of the regular expression parsing code in grep.c, although this may be more advanced than what you need.
As a start, consider an algorithm that: 1. Limits how far ahead it will scan (i.e. set a maximum substring length) 2. Favors longer substrings, subject to #1
For example, take "aaaaaaaaaaaaaaaa" (which is 16 a's). It could be: a X 16 aa X 8 aaaa X 4 aaaaaaaa X 2 aaaaaaaaaaaaaaaa X 1
If you set a maximum match length of 4, and favor the longest match, then the answer is unambiguous: aaaa X 4. Now you can write an algorithm.
The main difficulty here is the "greediness" of the algorithm and the ambiguity it can produce. Given your example, what is the rule to tell the program not to represent your string as:
d X 1 ababab X 1 cda X 1 b X 1
Or worse yet: d X 1 abababcdab X 1
You see what I mean. So you you need to establish a set of rules that the algo will follow, and then the code will kinda write itself. To gain insight on this, you might try looking at some of the regular expression parsing code in grep.c, although this may be more advanced than what you need.
As a start, consider an algorithm that:
- Limits how far ahead it will scan (i.e. set a maximum substring length)
- Favors longer substrings, subject to #1
For example, take "aaaaaaaaaaaaaaaa" (which is 16 a's).
It could be any of these:
a X 16
aa X 8
aaaa X 4
aaaaaaaa X 2
aaaaaaaaaaaaaaaa X 1
If you set a maximum match length of 4, and favor the longest match, then the answer is unambiguous: aaaa X 4. Now you can write an algorithm.
BTW the comment on LZW (Lempel-Ziv-Welch) algowithm is spot-on; that's what that algo does, although it dynamically builds a dictionary as it scans, and tries to express later items in terms of earlier ones. For example:
aaaaaaaaaaaaaaaa (16 of them again)
= bbbbbbbb (where b = aa)
= cccc (where c = bb)
= dd (where d = cc)
= e (where e = dd)
This is not exactly how LZW works but you get the idea.
精彩评论