How to get the most represented object from an array
I have an array with some objects, and there are several objects that are alike. E.g: fruit = [apple, orange, apple, banana, ba开发者_运维问答nana, orange, apple, apple]
What is the most efficient way to get the most represented object from this array? In this case it would be "apple" but how would you go out and calculate that in an efficient way?
Don't reinvent the wheel. In Python 2.7+ you can use the Counter class:
import collections
fruit=['apple', 'orange', 'apple', 'banana', 'banana', 'orange', 'apple', 'apple']
c=collections.Counter(fruit)
print(c.most_common(1))
# [('apple', 4)]
If you are using an older version of Python, then you can download Counter
here.
While it's good to know how to implement something like this yourself, it's also a good idea to get used to using Counter, since it is (or going to be) part of the standard library.
If the objects are hashable then you can use a dict to store the counts:
results = {}
for item in somelist:
if item not in results:
results[item] = 1
else
results[item] += 1
print max(results.iteritems(), key=operator.itemgetter(1))
Keep a dictionary of how often each object appears.
Walk through the list once, building this table. As you go, keep track of which object has appeared the most often so far.
This code is untested.
from collections import defaultdict
def mode(objects):
h = defaultdict(int)
max_f = 0
max_obj = None
for o in objects:
f = h[o] = h[o] + 1
if f > max_f:
max_f = f
max_obj = o
return max_obj
If the objects are not hashable, you can instead hash some unique feature of them, such as id(o)
.
You want an efficient method. Clearly it's possible in O(n) time, so any method that requires sorting the list is out as that would be O(n log(n)). It's not possible to do it faster than O(n) because even if you check the first n/2-1 elements, and they are all "apple", you don't know that the rest of the elements won't be bananas.
So given that we're looking for O(n), you must iterate over the list and keep a count of how many items of each type you've seen.
A defaultdict would be a simple way to implement this in practice.
>>> from collections import defaultdict
>>> d = defaultdict(int)
>>> for i in ['apple', 'banana', 'apple']:
... d[i] += 1
...
>>> d
defaultdict(<type 'int'>, {'apple': 2, 'banana': 1})
The best time you can hope to achieve here is O(n) - you'll always need to walk the entire array at least once. The easiest way will certainly be to build a histogram. If your dictionary structure (map of some kind) offers O(1) insert and retrieve then this is as easy as (groovy-ish pseudocode):
def histogram = new HashMap()
def maxObj = null
def maxObjCount = 0
objectList.each {
if(histogram.contains(it)) histogram.put(it, histogram.get(it)+1)
else histogram.put(it, 1)
if(histogram.get(it) > maxObjCount) {
maxObj = it
maxObjCount = histogram.get(it)
}
}
def count_reps(item, agg):
k = hash(item)
try:
agg[k] += 1
except KeyError:
agg[k] = 1
return agg
item_dict = reduce(your_array, {})
item_dict will contain the counts, then you can rate the popularity of each object.
Heres a different approach which essentially sorts the list and then processes it in a sorted order.
fruits = ['apple', 'orange', 'apple', 'banana', 'banana', 'orange', 'apple', 'apple']
max_fruit_count = 0
max_fruit = ''
current_fruit_count = 0
current_fruit = ''
for fruit in sorted(fruits) :
if fruit != current_fruit :
if current_fruit != max_fruit :
if current_fruit_count > max_fruit_count :
max_fruit = current_fruit
max_fruit_count = current_fruit_count
current_fruit = fruit
current_fruit_count = 1
else :
current_fruit_count += 1
if current_fruit_count > max_fruit_count :
max_fruit = current_fruit
max_fruit_count = current_fruit_count
print max_fruit, max_fruit_count
This is not O(n), but O(n^2), so it not may fit your bill as "most efficient way", but it's compact and avoids for
loops, which are rather slow in Python. It will be faster than the O(n) option up to 11 unique items.
def most_common(items):
s = set(items)
return max([(items.count(i), i) for i in s])[1]
As ~unutbu says: use collections.Counter Failing that, time your code. Here is my (likely inefficient) approach:
python -m timeit -s "fruit = ['apple']*4 + ['banana'] + ['orange']*2" \
"kL = set(fruit); L = [fruit.count(f) for f in kL]; D = dict(zip(kL,L)); \
sorted(D,key = lambda k: D[k],reverse=True)"
100000 loops, best of 3: 10.1 usec per loop
精彩评论