Suffix Array Implementation in Java
I'm looking to write an efficient n-order Markov chain method to generate random text strings given a set of example text. I currently have a Java implementation that u开发者_JAVA百科ses several layers of Maps, but it's clunky. A suffix array would be perfect for my needs, but I'm not clear if that can be efficiently implemented in Java.
In C I might do something like:
char exampleText[MAX];
char *suffixArray[MAX];
...
while(n<MAX && suffixArray[n++] = &exampleText[n]);
sort(suffixArray);
This gets gnarly in Java since I'd have to take substrings of exampleText
, or turn suffixArray
into an array of indices, or something else.
Any suggestions for a good approach to this in Java?
String
will [typically] do that for you. (Typical implementations share backing arrays when created with substring
, although that is subject to change at any time.)
For anyone interested in more efficient ways of constructing the suffix array in Java, I once used a library called jsuffixarrays. The code is here. It offers a range of construction algorithms to choose from and I found it to work very well. E.g. to use the SKEW algorithm you do this:
import org.jsuffixarrays.Algorithm;
import org.jsuffixarrays.ISuffixArrayBuilder;
import org.jsuffixarrays.SuffixArrays;
import org.jsuffixarrays.SuffixData;
String exampleText = "....";
ISuffixArrayBuilder suxBuilder = Algorithm.SKEW.getDecoratedInstance();
SuffixData sux = SuffixArrays.createWithLCP(text,sux_builder);
/* Then, to access the suffix array: */
sux.getSuffixArray();
/* And, to access the LCP array: */
sux.getLCP();
You can build without the LCP array if don't need that.
You can make some variants form array of suffixes:
First:
public static String[] suffixes(String s)
{
int N = s.length();
String[] suffixes = new String[N];
for (int i = 0; i < N; i++)
suffixes[i] = s.substring(i, N);
return suffixes;
}
Second:
public static String[] suffixes(String s)
{
int N = s.length();
StringBuilder sb = new StringBuilder(s);
String[] suffixes = new String[N];
for (int i = 0; i < N; i++)
suffixes[i] = sb.substring(i, N);
return suffixes;
}
精彩评论