Java count of items in an array (similar to a SQL aggregate function)
I am connecting to a sockets API that is very inflexible. It will return rows such as:
NAME, CITY, STATE, JOB, MONTH
But will have duplicates because it does not do any aggregation. I need to count the duplicate rows (which would be very easy in SQL, but not, as far as I know, in Java).
Example source data:
NAME, CITY, STATE, JOB, MONTH
John Doe, Denver, CO, INSTALLATION, 090301
John Doe, Denver, CO, INSTALLATION, 090301
John Doe, Denver, CO, INSTALLATION, 090301
Jane Doe, Phoenix, AZ, SUPPORT, 090301
Intended:
NAME, CITY, STATE, JOB, MONTH, COUNT
John Doe, Denver, CO, INSTALLATION, 090301, 3
Jane Doe, Phoenix, AZ, SUPPORT, 开发者_StackOverflow090301, 1
I can easily do this for approximately 100,000 return rows, but I am dealing with about 60 million in a month. Any ideas?
Edit: Unfortunately, the rows are not returned sorted... nor is there an option through the API to sort them. I get this giant mess of stuff that needs to be aggregated. Right now I use an ArrayList and do indexOf(new row) to find if the item already exists, but it gets slower the more rows that there are.
Edit: For clarification, this would only need to be run once a month, at the end of the month. Thank you for all of the responses
You could use a HashSet to store the previous row with the same contents. (assuming your Row objects have proper .hashValue() and .equals() methods implemented.
Something like this perhaps:
Set<Row> previousRows = new HashSet<Row>();
List<Row> rowsInOrder = new LinkedList<Row>();
Then in use (assuming further that you have an incrementCount() method to the Row class):
Row newRow = getNextRow();
if(!previousRows.contains(newRow)){
previousRows.put(newRow);
rowsInOrder.add(newRow);
}
previousRows.get(newRow).incrementCount();
If you don't care about the order in which the rows came in, you can get rid of the List and just use the Set.
Do you have the flexibility or is this an important enough a task to invest in something like Hadoop? With that size of data, you want to start thinking about it in terms of the "map-reducy" mindset.
Are you able to fit all the data in memory at once? If you are putting it in an ArrayList, it sounds like you can.
If that is the case, you can just use an implementation of MultiSet, such as the one in Google collections
Then, you could just do insert all your rows into the multiset as follows
Multiset<Row> rowMultiset = HashMultiset.create();
for (Row row: rows) {
rowMultiset.add(row);
}
And you can iterate through, with a count, using something like:
for (Multiset.Entry entry : rowMultiset.entrySet()) {
System.out.println("row: "+entry.getElement()+", count: "+entry.getCount());
}
If you don't want to use an external library, you can do something similar using a HashMap mapping rows to integers.
If it is NOT the case that all your rows fit into memory, I think the simplest approach is just to insert the data into a database and do a query. Databases are designed and optimized for large datasets which don't fit into memory.
Are the rows always returned sorted? ie. are the rows to be grouped always returned one after another? If the answer is yes:
1) Initialize a counter.
2) Keep track of the previous row that you just read and compare it to the current row. If it's the same, increment your counter. If it's different, record your row with the current counter value and reset the counter.
3) When you reach the last record, make sure to record the row with the current count.
This strategy will allow you to read in the large data sets in a stream and keep your program's memory footprint low while producing the more compact aggregate data you're after.
I can think of four ways to do this:
If you have enough memory to hold representations of 60 million rows in memory (less duplicates), use a
HashMap<Row, Integer>
to represent the counts.Store the rows in an RDB, and then use SQL to aggregate and count.
Write the rows to a big file and use classical merge sort it before counting the rows in a single pass.
Use something like Hadoop to spread the rows across multiple machine.
The fact that you are expecting to be accumulating counts over the period of a month or more suggests that you need to consider the possibility that your application will need to be restarted. That suggests that an RDB or file-based solution is required.
精彩评论