Java: Finding matches between Strings
Given 2 strings, I want to find the first match of at least four characters.
This is the code I currently have to do it. It works correctly, but I'm thinking there may be a better way to do this. Are there any screaming inefficiencies or bad practices in what I'm doing? Are there common libraries, like Apache Commons, that I should be taking advantage of but I'm not?
Don't worry about the Gene
class - it just contains the String in question. Also - GeneMatch()
signifies no match exists, whereas the GeneMatch
constructor with arguments signifies a match has been found.
Constants.MIN_MATCH
== 4, in this case.
public static GeneMatch findMatch(Gene g0, Gene g1) {
String g0DNA = g0.getDNA();
String g1DNA = g1.getDNA();
if (g0DNA.equals("") || g1DNA.equals("")) { //there won't be a match if one is empty
return new GeneMatch();
}
int g0Left = -1;
int g0Right = -1;
int g1Left = -1;
int g1Right = -1;
String window;
for (int inx = 0; inx <= g0DNA.length() - Constants.MIN_MATCH; inx++) {
window = g0DNA.substring(inx, inx + Constants.MIN_MATCH);
if (g1DNA.indexOf(window) != -1) {
g0Left = inx;
g0Right = inx + Constants.MIN_MATCH;
g1Left = g1DNA.indexOf(window);
g1Right = g1Left + Constants.MIN_MATCH;
/* grow the match to the right
* while the two right indices are less than the lengths of their respective strings, and the
* characters at the indices match, increment each index
*/
while (g0Right < g0DNA.length() && g1Right < g1DNA.length() && g0DNA.charAt(g0Right) == g1DNA.charAt(g1Right)) {
g0Right++;
g1Right++;
}
break; //we've already found a match, no need to continue sliding the window
}
}
//now that the indices are found, convert to Genes
if (g0Left == -1 || g0Right == -1 || g1Left == -1 || g1Right == -1) { //no m开发者_开发知识库atch found
return new GeneMatch();
}
Gene gL0 = new Gene(g0DNA.substring(0, g0Left));
Gene gL1 = new Gene(g1DNA.substring(0, g1Left));
Gene g0match = new Gene(g0DNA.substring(g0Left, g0Right));
Gene g1match = new Gene(g1DNA.substring(g1Left, g1Right));
Gene gR0 = new Gene(g0DNA.substring(g0Right));
Gene gR1 = new Gene(g1DNA.substring(g1Right));
//sanity check
assert g0DNA.equals(gL0.getDNA() + g0match.getDNA() + gR0.getDNA()) : "g0 didn't add up";
assert g1DNA.equals(gL1.getDNA() + g1match.getDNA() + gR1.getDNA()) : "g1 didn't add up";
return new GeneMatch(gL0, gR0, g0match, g1match, gL1, gR1);
}
Current approach
- Double g1DNA.indexOf(window) call - first call result can be stored and reused later;
- Unnecessary string objects construction during window = g0DNA.substring(inx, inx + Constants.MIN_MATCH);
- Unnecessary gL0, gL1, gR0, gR1 construction in case assertion is off;
- if (g0DNA.equals("") || g1DNA.equals("")) check can be improved in order to check that the strings has at least four symbols each;
- It always better to call equals() on constant, i.e. use "".equals(arg). It allows to avoid possible NPE if arg is null. It doesn't have much impact here, just a good coding policy to apply;
- There is String.isEmpty() method that can be used to replace "".equals(arg);
- No null check is performed for the DNA strings;
Improvements
- It's better to loop the shortest string, i.e. you should check dna1 and dna2 length and perform outer loop against the one with shorter length. That allows to minimize iterations number;
- You can avoid creating new string objects and operate in terms of characters. Moreover, you can modify the algorithm in order to work with any java.lang.CharSequence implementation;
- You can remember unmatched sequences, i.e. keep set of char sequences that were checked and proved to be unmatched in order to minimize the time of outer loop iteration. For example you iterate over the string that contains many 'b' chars. You check that the second string doesn't contain that char during first 'b' processing. You can remember that and stop subsequent 'b' processings eagerly;
- When you use String.indexOf() the search is performed from start of the string. That may be problem if the string to be search on is rather long. It may be worth to create a characters index for it. I.e. before finding the match you can iterate all target string characters and build mappings like 'character' -> 'set of indexes of their occurrence within the string'. That allows to perform the loop body check much faster in case of long strings;
General consideration There is no 'the one best algorithm' because 'the best' selection depends on input data profile and algorithm usage policy. I.e. if the algorithm is executed rarely and its performance impact is insignificant there is no point in spending a lot of time to its optimization and much better to write a simple code that is easy to maintain. If input strings are rather short there is no point in building characters index etc. In general just try to avoid preliminary optimization whenever possible and carefully consider all input data during choosing resulting algorithm if you really have a bottleneck there.
Looks quite okay to me. Just two minor things:
reuse the result of
g1DNA.indexOf(window)
instead of calling it twice (g1Left = g1DNA.indexOf(window);
)you don't have to check all 4 vars for being == -1 as you all set them at once anyway.
Looks good to me. One might go ahead and micro optimize in terms of assignments, but this is the job of the JIT compiler. If you feel the algorithm is too slow, try to profile it.
精彩评论