Constructing frequency table of Objects in ArrayList<Object[]>
I try to implement Incognito k-anonymization algorithm in Java. A part of this algorithm is the frequency set construction for a given table. The columns of the table vary each time so I decided to represent the table as an ArrayList of Object[] where Object[] size is the number of the columns. In this Object I store each row's values for each columns.
I try to construct the frequency table using the following method:
ArrayList<Object[]> table = new ArrayList<Object[]>();
....// table filling//.....
ArrayList<Object[]> frequencySet = new ArrayList<Object[]>();
for(int i=0;i<table.size();i++)
{
Integer count = 1;
int j = 0;
for(j=i+1;j<table.size();j++)
{
if(Arrays.equals(table.get(i), table.get(j)))
{
//System.out.println(i+" equals to "+j);
count++;
table.remove(j);
j = j-1;
}
}
int size = arguments.size()+1;
Object[] anObject = new Object[size];
开发者_JAVA百科 System.arraycopy(table.get(i), 0, anObject, 0, arguments.size());
anObject[size-1] = count;
frequencySet.add(anObject);
}
The problem is that the algorithm is very slow, and I figured out that the most of the time is consumed in this method. (For 100.000 data it needs 13min to run- I don't know if this is normal). Is there a faster way to construct frequency table?
Never use remove
on an ArrayList
, it's O(size()). Also, your count variable gets wrapped and unwrapped each time you increment it. Make its type int
and wrap it into Integer
only at the very end.
Without knowing anything about the type of Objects you store, I assume the methods equals
and hashCode
are redefined for them. Then the best thing that comes into mind is wrap the array of Object's into a class Row (it's a good thing to do anyway), redefine equals and hashCode for Row (using Arrays.equals and Arrays.hashCode) and count the occurrences of each Row in one pass using a
HashMap<Row, Integer> count;
for (Row row : table) {
if (count.containsKey(row)) {
count.put(row, count.get(row) + 1);
} else {
count.put(row, 1);
}
}
Sort them, and then count reoccurances with a loop after that. That will bring it down to O(n log n)
or use a hashtable to do your counting instead. That should be a linear time calculation.
精彩评论