Quickest way to return list of Strings by using wildcard from collection in Java
I have set of 100000 String. And for example I want to get all strings starti开发者_JAVA技巧ng with "JO" from that set. What would be the best solution for that?
I was thinking Aho-Corasick but the implementation I have does not support wild cards.
If you want all the strings starting with a sequence you can add all the String into a NavigableSet like TreeSet and get the subSet(text, text+'\uFFFF')
will give you all the entries starting with text
This lookup is O(log n)
If you want all the Strings with end with a sequence, you can do a similar thing, except you have to reverse the String. In this case a TreeMap from reversed String to forward String would be a better structure.
If you want "x*z" you can do a search with the first set and take a union with the values of the Map.
if you want contains "x", you can use a Navigable<String, Set<String>> where the key is each String starting from the first, second, third char etc The value is a Set as you can get duplicates. You can do a search like the starts with structure.
Here's a custom matcher class that does the matching without regular expressions (it only uses regex in the constructor, to put it more precisely) and supports wildcard matching:
public class WildCardMatcher {
private Iterable<String> patternParts;
private boolean openStart;
private boolean openEnd;
public WildCardMatcher(final String pattern) {
final List<String> tmpList = new ArrayList<String>(
Arrays.asList(pattern.split("\\*")));
while (tmpList.remove("")) { /* remove empty Strings */ }
// these last two lines can be made a lot simpler using a Guava Joiner
if (tmpList.isEmpty())
throw new IllegalArgumentException("Invalid pattern");
patternParts = tmpList;
openStart = pattern.startsWith("*");
openEnd = pattern.endsWith("*");
}
public boolean matches(final String item) {
int index = -1;
int nextIndex = -1;
final Iterator<String> it = patternParts.iterator();
if (it.hasNext()) {
String part = it.next();
index = item.indexOf(part);
if (index < 0 || (index > 0 && !openStart))
return false;
nextIndex = index + part.length();
while (it.hasNext()) {
part = it.next();
index = item.indexOf(part, nextIndex);
if (index < 0)
return false;
nextIndex = index + part.length();
}
if (nextIndex < item.length())
return openEnd;
}
return true;
}
}
Here's some test code:
public static void main(final String[] args) throws Exception {
testMatch("foo*bar", "foobar", "foo123bar", "foo*bar", "foobarandsomethingelse");
testMatch("*.*", "somefile.doc", "somefile", ".doc", "somefile.");
testMatch("pe*", "peter", "antipeter");
}
private static void testMatch(final String pattern, final String... words) {
final WildCardMatcher matcher = new WildCardMatcher(pattern);
for (final String word : words) {
System.out.println("Pattern " + pattern + " matches word '"
+ word + "': " + matcher.matches(word));
}
}
Output:
Pattern foo*bar matches word 'foobar': true
Pattern foo*bar matches word 'foo123bar': true
Pattern foo*bar matches word 'foo*bar': true
Pattern foo*bar matches word 'foobarandsomethingelse': false
Pattern *.* matches word 'somefile.doc': true
Pattern *.* matches word 'somefile': false
Pattern *.* matches word '.doc': true
Pattern *.* matches word 'somefile.': true
Pattern pe* matches word 'peter': true
Pattern pe* matches word 'antipeter': false
While this is far from being production-ready, it should be fast enough and it supports multiple wild cards (including in the first and last place). But of course if your wildcards are only at the end, use Peter's answer (+1).
精彩评论