Counting occurrences of an item

2023-02-12 12:39 问答作者：

I am trying to figure out an optimal solution (in Java) to the following problem:

In a first pass over some data, I count the number of occurrences of an item. Basically, I create a HashMap from item ID to integer and increment the integer every time I see an occurrence of the item. So开发者_如何学Python basically, I have a Map<Long,Integer> from itemID to count.

Now, what I need from this map is the top n item ids sorted by the count.

Apparently HashMap is not the optimal data structure here. Any ideas?

This is for some data mining stuff I am doing at work, so not a hw problem...

Actually, a HashMap is a reasonable solution here because you have to accumulate the totals. There's no way you can shortcut that, and no simple way to find the top N items until you know the counts for all items.

After you have the HashMap, there are several ways to do things. If the data is relatively small, create an array of itemId and count pairs, and sort by count in descending order. Then select the top N items.

If you have lots of items (in the hundreds of thousands), it's probably faster to use a min heap after you get the counts, the idea being that you put the first N items into the min heap and then only insert an item if its count is larger than the smallest item in the min heap.

You could keep things in order by count while you're going through adding things up, but every time you increment a counter you'll have to remove the thing from the collection and re-insert it. You're better off accumulating things in a HashMap where it's easy to look things up by ID, and then post-process to apply ordering by count.

I would sort the results after counting.

Map<Item,Integer> map = new HashMap<Item, Integer>();

... (fill the map, counting the occurences)

List<Map.Entry<Item, Integer>> list = new ArrayList<Map.Entry<Item, Integer>>(map.size());
list.addAll(map.entrySet());
Collections.sort(list, new Comparator<Map.Entry<Item, Integer>>() {
    public int compare(Map.Entry<Item, Integer>> left, Map.Entry<Item, Integer>> right) {
       // "-" to invert the order
       return - left.getValue().compareTo(right.getValue());
    }
});

Now list is a list with the items sorted (descending) by count, and .subList(0, n) will give you the first n ones.

If your n is much smaller than the total number of items, than this is not optimal - there is an better (but more complicated) algorithm to take only the best some of an unordered list, I think.

An obvious answer would be use a SortedMap for this. Make sure the comparable properties of the newly created map make the top item number one and you can just fetch the first element from it.

I think if you want to be able to get out the ids, count and still maintain the Map structure you will need create a Class to encapsulate you data.

public class DataPair implements Comparable<DataPair> {
    private long id;
    private Integer count;

    //Getters and setters

    public void increaseCount() {
        count++;
    }

    public int compareTo(DataPair dp) {
         return this.count.compareTo(dp.count);
    }

}

Then have a Map just like you've been using where:

Map<long, DataPair> m = new HashMap<long, DataPair>()

Then when you need to sort by count you can just get out the values and sort them while maintaining the ability to get the current count by id.

List<DataPair> list = new ArrayListM<DataPair>(m.values());
Collections.sort(list);

Then you will have your sorted counts and will still be able to get ids.

You can have a sorted map [sorted on the values] as follows:

Create a class Profile that will hold your data and the count [for temporary purposes].

Your Profile Class will look like this :

class Profile
{
    public String data;
    public Integer value;

    public int getValue()
    {
        return value;
    }
}

The method to sort according to values will be as follows :

public Map<String, Integer> sortMapByValues(final Map<String, Integer> passedMap)
    {
        List<Profile> tuples = new LinkedList<Profile>();

        Iterator<String> it = passedMap.keySet().iterator();

        while (it.hasNext())
        {
            String key = it.next();
            Integer val = passedMap.get(key);

            tuples.add(new Profile(key, val));
        }

        Collections.sort(tuples, new ProfileComparator());

        Map<String, Integer> sortedMap = new LinkedHashMap<String, Integer>();

        for (int i = 0; i < tuples.size(); i++)
        {
            sortedMap.put(tuples.get(i).getKey(), tuples.get(i).getValue());
        }

        return sortedMap;
    }

Now all you need is a Comparator implementation.

Your ProfileComparator class will look like this:

public final class ProfileComparator implements Comparator<Profile>
{
    public int compare(final Profile n1, final Profile n2)
    {
        if (n1.getValue() > n2.getValue())
        {
            return -1;
        }

        if (n2.getValue() > n1.getValue())
        {
            return 1;
        }

        return 0;
    }
}

Maybe TreeMap is more optional solution.

http://docs.oracle.com/javase/6/docs/api/java/util/TreeMap.html

继续阅读：algorithm data-mining

Counting occurrences of an item

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？