Check if string is repetition of an unknown substring
I'm trying to write a regex or Ruby method which will find the longest repeated pattern in a string开发者_StackOverflow中文版. For example:
"abcabc" => "abc"
"cccc" => "c"
"abcd" => "abcd"
What is the best way to implement this? I naïvely tried /^(.*)*$/
but that won't work as it just matches the full string.
I can’t believe all these stuckinthebookheads have been telling you it can’t be done with a pattern. They do not know what they are talking about, as I am about to demonstrate. Believe me, if I can solve Diophantine equations of order one using patterns — AND I CAN! :) — then I can certainly do this simple little bit. In fact, this is very very easy to do with a pattern, provided that you will be content with the leftmost longest such match. For example, just use:
/(.+)\1+/
If that matches, then the string contains a repeated substring.
Integer Factorization with Patterns
That’s the same strategy that you use pattern matching to factor composite integers.
First, create a string that is the unary representation of the integer. That would be 1
for 1, 11
for 2, 111
for 3, 1111
for 4, etc. Given such a representation, the pattern to find the largest factor is:
/^(11+)\1+$/
where the first subgroup is the largest factor in unary, wherefore the length of the first group is that largest factor as a regular number. (However, that largest factor may not be prime.)
Similarly,
/^(11+?)\1+$/
is the same except that it now finds the smallest factor, which is of course guaranteed to be prime. I don’t know how to emulate Perl’s x
repetition operator in Ruby, so here is a quick demo of this idea using Perl:
$ perl -le 'for $n (@ARGV) { printf "%d is composite and its largest factor is %d.\n", $n, length($1) if ("1" x $n) =~ /^(11+)\1+$/ } ' 5 9 15 24 60 243 891
9 is composite and its largest factor is 3.
15 is composite and its largest factor is 5.
24 is composite and its largest factor is 12.
60 is composite and its largest factor is 30.
243 is composite and its largest factor is 81.
891 is composite and its largest factor is 297.
$ perl -le 'for $n (@ARGV) { printf "%d is composite and its smallest factor is %d.\n", $n, length($1) if ("1" x $n) =~ /^(11+?)\1+$/ } ' 5 9 15 24 60 243 891
9 is composite and its smallest factor is 3.
15 is composite and its smallest factor is 3.
24 is composite and its smallest factor is 2.
60 is composite and its smallest factor is 2.
243 is composite and its smallest factor is 3.
891 is composite and its smallest factor is 3.
A good pattern to find such things in the dictionary is
/(\w+)\1+/i
so that you do the backreference case insensitively.
Trolling the Dictionary
This is a quick way to find such things in the dictionary list:
$ perl -MEnglish -nle 'print "$PREMATCH<$MATCH>$POSTMATCH" while /(\w+)(\1+)/gi' /usr/share/dict/words
That finds things like:
b<oo>kkeeper
boo<kk>eeper
bookk<ee>per
when fed bookkeeper
. Sorted by substring length, the longest dictionary matches are:
12 ambi<lateralatera>lly
12 <gastrogastro>tomy
12 <killeekillee>
12 <quadriquadri>c
12 <tangantangan>
10 <Bellabella>
10 hydr<ophobophob>ia
10 <kerrikerri>
10 <kinnikinni>ck
10 m<ethylethyl>acetic
10 <micromicro>farad
10 <micromicro>n
10 <philophilo>sophos
10 <quinaquina>
10 <strumstrum>
10 <supersuper>abundance
10 <supersuper>abundant
10 <supersuper>abundantly
10 <supersuper>b
10 <supersuper>ior
10 <Wallawalla>
8 <acetacet>ic
8 ali<bangbang>
8 <anapanap>a
8 <antianti>body
Sneaky Lookaheads => Sneakaheads
However, it is the leftmost longest such occurring substring. You have to be a lot sneakier to figure out all such substrings even in the fact of overlaps. For example:
2 a<dd>ititious
4 add<itit>ious
4 addi<titi>ous
6 <alfalf>a
6 a<lfalfa>
2 a<ll>ele
4 al<lele>
12 ambi<lateralatera>lly
12 ambil<ateralateral>ly
2 ambilateralatera<ll>y
6 <assass>inatress
2 a<ss>assinatress
2 assa<ss>inatress
2 assassinatre<ss>
2 Ca<rr>aran
4 Car<rara>n
The trick for those is to load up your groups inside a lookahead, turning it into a sneakahead. For example:
/(?=(\w+)(\1+))/i
would be enough to load up the first two groups with the entire match. However, you probably need to keep the prematch and postmatch parts around too, perhaps like this:
/(?=(.*?)(\w+)(\2+)(.*))/i
now you can do a progressive match to sneakahead and find all such matches, even overlaps! The list I gave above was generated using this:
$ perl -nle 'print length($2 . $3), " $`.$1<$2$3>$4" while /(?=(.*?)(\w+)(\2+)(.*))/gi' /usr/share/dict/words | perl -pe 's/\.//g' | uniq
I am pretty sure that the same approach should translate into Ruby without any trouble, since it’s really a property of the matching engine not of Perl per se.
Bonus Solution
Still wondering about those Diophantine equations, eh? :) Run this in Perl:
# solve for 12x + 15y + 16z = 281, maximizing x
if (($X, $Y, $Z) =
(('o' x 281) =~ /^(o*)\1{11}(o*)\2{14}(o*)\3{15}$/))
{
($x, $y, $z) = (length($X), length($Y), length($Z));
print "One solution is: x=$x; y=$y; z=$z.\n";
} else {
print "No solution.\n";
}
and wonder of wonders, it prints out
One solution is: x=17; y=3; z=2.
As with factoring composite numbers, you can change how you weight these using minimal matching quantifiers. Because the first o*
was greedy, x was allowed to grow as
large as it could. Changing one or more *
quantifiers to *?
,
+
, or +?
can produce different solutions:
('o' x 281) =~ /^(o+)\1{11}(o+)\2{14}(o+)\3{15}$/
# One solution is: x=17; y=3; z=2
('o' x 281) =~ /^(o*?)\1{11}(o*)\2{14}(o*)\3{15}$/
# One solution is: x=0; y=7; z=11.
('o' x 281) =~ /^(o+?)\1{11}(o*)\2{14}(o*)\3{15}$/
# One solution is: x=1; y=3; z=14.
Isn’t that simply incredible? But it’s true. Run the code yourself and you’ll see.
And yes, if anyone is having déjà lu, you’re right, you indeed have read all this before — because I already wrote it up in The Perl Cookbook, several long aeons ago. It all still holds true.
NB: Credit for this technique must go to (M. Douglas) Doug McIlroy of Bell Labs for first demonstrating this wonder.
I knew it couldn't be that complicated, so I thought it over and found a solution:
def unrepeat(str)
n = str.size
newstr = str
n.times do |i|
newstr = newstr[-1] + newstr[0..-2]
if newstr == str
return i + 1
end
end
end
This will return the length of the repeated pattern. It finds this by generating rotations of the string.
精彩评论