Find commonly occurring sequences in a string
How can I find commonly occurring sequenc开发者_高级运维es in a string in Java?
The string is a long sequence of digits and I want to find the most commonly occurring sequences of digits.
I guess that depends on how long the sequences are that you are looking for.
What I would do is use a Guava Multiset
, iterate over the sequence, write all subsequences to the Multiset and sort that by occurrence. Here's a sample implementation:
public static String getMostFrequentSequence(final String input, final int patternLength) {
final Multiset<String> multiset = HashMultiset.create();
final int length = patternLength < 0 ? input.length() : Math.min(patternLength, input.length());
for (int l = 2; l < length; l++) {
for (int o = 0; o < input.length() - l; o++) {
multiset.add(input.substring(o, o + l));
}
}
return Ordering.from(new Comparator<Entry<String>>() {
public int compare(final Entry<String> o1, final Entry<String> o2) {
return
ComparisonChain.start()
.compare(o1.getCount(), o2.getCount())
.compare(o1.getElement(), o2.getElement())
.result();
}
}).max(multiset.entrySet()).getElement();
}
And about performance: This test method takes about a second on my machine for unlimited length and about 25 milliseconds when I limit the pattern length to 12 chars
public static void main(final String[] args) throws Exception {
final StringBuilder sb = new StringBuilder();
final Random random = new Random();
for (int i = 0; i < 1000; i++) {
sb.append(random.nextInt(10));
}
final long t1 = System.currentTimeMillis();
final String input = sb.toString();
System.out.println(input);
System.out.println(getMostFrequentSequence(input, -1));
System.out.println(System.currentTimeMillis() - t1);
final long t2 = System.currentTimeMillis();
System.out.println(getMostFrequentSequence(input, 12));
System.out.println(System.currentTimeMillis() - t2);
}
For a given length of numbers, you could place then all in an ArrayList, sort them and count the number of duplicates (they will be next to each other) You will always have less than 1000 entries.
精彩评论