Trying to figure out a basic pattern finding algorithm

2023-03-22 00:43 问答作者：

Long story short, I have some data that I need to find patterns in. The data is something like this开发者_如何学Go (each character represents an immutable chunk): dabababacdacdacdab

I want to be able to break this data into the following kinds of chunks:

d (repeated 1x) ab (repeated 3x) a (1x) cda (3x) b (1x)

I'm familiar with basic run length encoding, but I don't really know how to do it when the length of the "thing" may be variable (in my example, cda is 3 chunks of data, but d is one chunk). Am I making any sense? Your help would be appreciated.

The main difficulty here is the "greediness" of the algorithm and the ambiguity it can produce. Given your example, what is the rule to tell the program not to represent your string as:

d X 1 ababab X 1 <<<<<----- here's the problem cda X 1 b X 1

Or worse yet: d X 1 abababcdab X 1

You see what I mean. So you you need to establish a set of rules that the algo will follow, and then the code will kinda write itself. T gain insight on this, you might try looking at some of the regular expression parsing code in grep.c, although this may be more advanced than what you need.

As a start, consider an algorithm that: 1. Limits how far ahead it will scan (i.e. set a maximum substring length) 2. Favors longer substrings, subject to #1

For example, take "aaaaaaaaaaaaaaaa" (which is 16 a's). It could be: a X 16 aa X 8 aaaa X 4 aaaaaaaa X 2 aaaaaaaaaaaaaaaa X 1

If you set a maximum match length of 4, and favor the longest match, then the answer is unambiguous: aaaa X 4. Now you can write an algorithm.

The main difficulty here is the "greediness" of the algorithm and the ambiguity it can produce. Given your example, what is the rule to tell the program not to represent your string as:

d X 1 ababab X 1 cda X 1 b X 1

Or worse yet: d X 1 abababcdab X 1

You see what I mean. So you you need to establish a set of rules that the algo will follow, and then the code will kinda write itself. To gain insight on this, you might try looking at some of the regular expression parsing code in grep.c, although this may be more advanced than what you need.

As a start, consider an algorithm that:

Limits how far ahead it will scan (i.e. set a maximum substring length)
Favors longer substrings, subject to #1

For example, take "aaaaaaaaaaaaaaaa" (which is 16 a's).
It could be any of these:

a X 16

aa X 8

aaaa X 4

aaaaaaaa X 2

aaaaaaaaaaaaaaaa X 1

If you set a maximum match length of 4, and favor the longest match, then the answer is unambiguous: aaaa X 4. Now you can write an algorithm.

BTW the comment on LZW (Lempel-Ziv-Welch) algowithm is spot-on; that's what that algo does, although it dynamically builds a dictionary as it scans, and tries to express later items in terms of earlier ones. For example:

aaaaaaaaaaaaaaaa (16 of them again)

= bbbbbbbb (where b = aa)

= cccc (where c = bb)

= dd (where d = cc)

= e (where e = dd)

This is not exactly how LZW works but you get the idea.

继续阅读：compression design-patterns

Trying to figure out a basic pattern finding algorithm

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？