开发者

Huge performance difference between Vector and HashSet

I have a program which fetches records from database (using Hibernate) and fills them in a Vector. There was an issue regarding the performance of the operation and I did a test with the Vector replaced by a HashSet. With 300000 records, the speed gain is immense - 45 mins to 2 mins!

So my question is, what is causing this huge difference? Is it just the point that all methods in Vector are synchronized or the point that internally Vector uses an array whereas HashSet does not? Or something else?

Th开发者_开发百科e code is running in a single thread.

EDIT: The code is only inserting the values in the Vector (and in the other case, HashSet).


If it's trying to use the Vector as a set, and checking for the existence of a record before adding it, then filling the vector becomes an O(n^2) operation, compared with O(n) for HashSet. It would also become an O(n^2) operation if you insert each element at the start of the vector instead of at the end.

If you're just using collection.add(item) then I wouldn't expect to see that sort of difference - synchronization isn't that slow.

If you can try to test it with different numbers of records, you could see how each version grows as n increases - that would make it easier to work out what's going on.

EDIT: If you're just using Vector.add then it sounds like something else could be going on - e.g. your database was behaving differently between your different test runs. Here's a little test application:

import java.util.*;

public class Test {
  public static void main(String[] args) {
    long start = System.currentTimeMillis();
    Vector<String> vector = new Vector<String>();
    for (int i = 0; i < 300000; i++) {
      vector.add("dummy value");
    }
    long end = System.currentTimeMillis();
    System.out.println("Time taken: " + (end - start) + "ms");
  }
}

Output:

Time taken: 38ms

Now obviously this isn't going to be very accurate - System.currentTimeMillis isn't the best way of getting accurate timing - but it's clearly not taking 45 minutes. In other words, you should look elsewhere for the problem, if you really are just calling Vector.add(item).

Now, changing the code above to use

vector.add(0, "dummy value"); // Insert item at the beginning

makes an enormous difference - it takes 42 seconds instead of 38ms. That's clearly a lot worse - but it's still a long way from being 45 minutes - and I doubt that my desktop is 60 times as fast as yours.


If you are inserting them at the middle or beginning instead of at the end, then the Vector needs to move them all along. Every insert. The hashmap, on the other hand, doesn't really care or have to do anything.


Vector is outdated and should not be used anymore. Profile with ArrayList or LinkedList (depends on how you use the list) and you will see the difference (sync vs unsync). Why are you using Vector in a single threaded application at all?


Vector is synchronized by default; HashSet is not. That's my guess. Obtaining a monitor for access takes time.

I don't know if there are reads in your test, but Vector and HashSet are both O(1) if get() is used to access Vector entries.


Under normal circumstances, it is totally implausible that inserting 300,000 records into a Vector will take 43 minutes longer than inserting the same records into a HashSet.

However, I think there is a possible explanation of what might be going on.

First, the records coming out of the database must have a very high proportion of duplicates. Or at least, they must be duplicates according to the semantics of the equals/hashcode methods of your record class.

Next, I think you must be pushing very close to filling up the heap.

So the reason that the HashSet solution is so much faster is that it is most of the records are being replaced by the set.add operation. By contrast the Vector solution is keeping all of the records, and the JVM is spending most of its time trying to squeeze that last 0.05% of memory by running the GC over, and over and over.

One way to test this theory is to run the Vector version of the application with a much bigger heap.


Irrespective, the best way to investigate this kind of problem is to run the application using a profiler, and see where all the CPU time is going.


import java.util.*;

public class Test {
  public static void main(String[] args) {
    long start = System.currentTimeMillis();
    Vector<String> vector = new Vector<String>();
    for (int i = 0; i < 300000; i++) {
       if(vector.contains(i)) {
         vector.add("dummy value");
       }
     }
    long end = System.currentTimeMillis();
    System.out.println("Time taken: " + (end - start) + "ms");
  }
}

If you check for duplicate element before insert the element in the vector, it will take more time depend upon the size of vector. best way is to use the HashSet for high performance, because Hashset will not allow duplicate and no need to check for duplicate element before inserting.


According to Dr Heinz Kabutz, he said this in one of his newsletters.

The old Vector class implements serialization in a naive way. They simply do the default serialization, which writes the entire Object[] as-is into the stream. Thus if we insert a bunch of elements into the List, then clear it, the difference between Vector and ArrayList is enormous.

import java.util.*;
import java.io.*;

public class VectorWritingSize {
  public static void main(String[] args) throws IOException {
    test(new LinkedList<String>());
    test(new ArrayList<String>());
    test(new Vector<String>());
  }

  public static void test(List<String> list) throws IOException {
    insertJunk(list);
    for (int i = 0; i < 10; i++) {
      list.add("hello world");
    }
    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    ObjectOutputStream out = new ObjectOutputStream(baos);
    out.writeObject(list);
    out.close();
    System.out.println(list.getClass().getSimpleName() +
        " used " + baos.toByteArray().length + " bytes");
  }

  private static void insertJunk(List<String> list) {
    for(int i = 0; i<1000 * 1000; i++) {
      list.add("junk");
    }
    list.clear();
  }
}

When we run this code, we get the following output:

LinkedList used 107 bytes
ArrayList used 117 bytes
Vector used 1310926 bytes

Vector can use a staggering amount of bytes when being serialized. The lesson here? Don't ever use Vector as Lists in objects that are Serializable. The potential for disaster is too great.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜