开发者

Counting occurrences of an item

I am trying to figure out an optimal solution (in Java) to the following problem:

In a first pass over some data, I count the number of occurrences of an item. Basically, I create a HashMap from item ID to integer and increment the integer every time I see an occurrence of the item. So开发者_如何学Python basically, I have a Map<Long,Integer> from itemID to count.

Now, what I need from this map is the top n item ids sorted by the count.

Apparently HashMap is not the optimal data structure here. Any ideas?

This is for some data mining stuff I am doing at work, so not a hw problem...


Actually, a HashMap is a reasonable solution here because you have to accumulate the totals. There's no way you can shortcut that, and no simple way to find the top N items until you know the counts for all items.

After you have the HashMap, there are several ways to do things. If the data is relatively small, create an array of itemId and count pairs, and sort by count in descending order. Then select the top N items.

If you have lots of items (in the hundreds of thousands), it's probably faster to use a min heap after you get the counts, the idea being that you put the first N items into the min heap and then only insert an item if its count is larger than the smallest item in the min heap.

You could keep things in order by count while you're going through adding things up, but every time you increment a counter you'll have to remove the thing from the collection and re-insert it. You're better off accumulating things in a HashMap where it's easy to look things up by ID, and then post-process to apply ordering by count.


I would sort the results after counting.

Map<Item,Integer> map = new HashMap<Item, Integer>();

... (fill the map, counting the occurences)

List<Map.Entry<Item, Integer>> list = new ArrayList<Map.Entry<Item, Integer>>(map.size());
list.addAll(map.entrySet());
Collections.sort(list, new Comparator<Map.Entry<Item, Integer>>() {
    public int compare(Map.Entry<Item, Integer>> left, Map.Entry<Item, Integer>> right) {
       // "-" to invert the order
       return - left.getValue().compareTo(right.getValue());
    }
});

Now list is a list with the items sorted (descending) by count, and .subList(0, n) will give you the first n ones.

If your n is much smaller than the total number of items, than this is not optimal - there is an better (but more complicated) algorithm to take only the best some of an unordered list, I think.


An obvious answer would be use a SortedMap for this. Make sure the comparable properties of the newly created map make the top item number one and you can just fetch the first element from it.


I think if you want to be able to get out the ids, count and still maintain the Map structure you will need create a Class to encapsulate you data.

public class DataPair implements Comparable<DataPair> {
    private long id;
    private Integer count;

    //Getters and setters

    public void increaseCount() {
        count++;
    }

    public int compareTo(DataPair dp) {
         return this.count.compareTo(dp.count);
    }

}

Then have a Map just like you've been using where:

Map<long, DataPair> m = new HashMap<long, DataPair>()

Then when you need to sort by count you can just get out the values and sort them while maintaining the ability to get the current count by id.

List<DataPair> list = new ArrayListM<DataPair>(m.values());
Collections.sort(list);

Then you will have your sorted counts and will still be able to get ids.


You can have a sorted map [sorted on the values] as follows:

Create a class Profile that will hold your data and the count [for temporary purposes].

Your Profile Class will look like this :

class Profile
{
    public String data;
    public Integer value;

    public int getValue()
    {
        return value;
    }
}

The method to sort according to values will be as follows :

public Map<String, Integer> sortMapByValues(final Map<String, Integer> passedMap)
    {
        List<Profile> tuples = new LinkedList<Profile>();

        Iterator<String> it = passedMap.keySet().iterator();

        while (it.hasNext())
        {
            String key = it.next();
            Integer val = passedMap.get(key);

            tuples.add(new Profile(key, val));
        }

        Collections.sort(tuples, new ProfileComparator());

        Map<String, Integer> sortedMap = new LinkedHashMap<String, Integer>();

        for (int i = 0; i < tuples.size(); i++)
        {
            sortedMap.put(tuples.get(i).getKey(), tuples.get(i).getValue());
        }

        return sortedMap;
    }

Now all you need is a Comparator implementation.

Your ProfileComparator class will look like this:

public final class ProfileComparator implements Comparator<Profile>
{
    public int compare(final Profile n1, final Profile n2)
    {
        if (n1.getValue() > n2.getValue())
        {
            return -1;
        }

        if (n2.getValue() > n1.getValue())
        {
            return 1;
        }

        return 0;
    }
}


Maybe TreeMap is more optional solution.

http://docs.oracle.com/javase/6/docs/api/java/util/TreeMap.html

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜