开发者

Fastest way to check list of crawler IPs via contains in Java

I got list of crawlers from following website: http://www.karavadra.net/blog/2010/list-of-crawlers-bots-and-their-ip-addresses/#respond

If you know better list of IPs that is regularly update please let me know.

Now I created object:

 private static final HashSet<String> list = new HashSet<String>(){{
        add("66.249.71.248");
        add("66.249.66.38");
        add("66.249.65.142"); // 331 more entires
 }}; 

And I'm checking the list via this method:

public static boolean isCrawler(String ip){
  return list.contains(ip);  
}

Please advise on how to improve this to be faster and more elegant solution. I use spring so beans are an option as well.

I would like to integrate update service on the list but I didn't find开发者_运维知识库 downloadable IP list that would be useful and parsing websites via Jsoup is never ideal solution.


If I understood you correctly I would like to make your contains() check faster.

Although I believe that contains() of HashSet works fine anyway I think that in you case you can improve it a little bit.

You are storing IP addresses as strings. IP address is actually number. Convert IP to number and put the result into set. This hopefully will work faster.

Here is how to convert IP to number:

public static Long ipToInt(String addr) {
        String[] addrArray = addr.split("\\.");

        long num = 0;
        for (int i=0;i<addrArray.length;i++) {
            int power = 3-i;

            num += ((Integer.parseInt(addrArray[i])%256 * Math.pow(256,power)));
        }
        return num;
    }

I took this code from http://teneo.wordpress.com/2008/12/23/java-ip-address-to-integer-and-back/


I think that you shouldn't use hashes here - 334 entries means that a binary search on a sorted list would take log2(334)=8,3837 steps, the hash function will propably take longer.

Use an ArrayList and initially sort it using Collections.sort(List list). When you want to check an IP, use Collections.binarySearch(List list, Object key) and check whether the return value is >=0 (which means that the IP is in the list).


You can use a bloom filter before looking up in the hashset.This can fasten things up.The bloom filter has a problem of false +ve's.So for all true the bloom filter returns you will have to look up in the hashset once again to make sure but all false you can be sure.Also you could replace your hashset with a radix tree/patricia trie for more compact storage.

Implementations:

  • Patricia-trie
  • Bloom filter
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜