开发者

Sorting a list pairs by frequency of the pair elements

I'm completely new to Python and while trying various random bits and pieces I've struck upon a problem that I believe I've "solved", but the code doesn't feel right - I strongly suspect there is going to be a better way to get the desired result.

FYI - I'm using whatever the latest version of Python 3 is, on Windows.

Problem definition

Briefly, what I'm doing is sorting a list of pairs, such that the pairs containing the elements that appears in the fewest pairs are sorted to the front.

The pairs are in the form [i,j] with 0 <= i <= j < n, where n is a known maximum value for the elements. There are no duplicate pairs within the list.

The count of an element i is a simple count of the number of pairs (not pair elements) in the forms [i,j],[j,i] and [i,i] where j is any value that results in a valid pair.

In the sorted result, a pair [i,j] should appear before a pair [k,l] if count(i) < count(k) or count(i) == count(k) and count(j) < count(l) (If count(j) == count(l) the two can be in either order - I'm not bothered about the sort being stable, would be a bonus though).

In the sorted result, a pair [i,j] should appear before a pair [k,l] if

min(count(i),count(j)) &l开发者_开发问答t; min(count(k),count(l)) or

min(count(i),count(j)) == min(count(k),count(l)) and max(count(i),count(j)) < max(count(k),count(l)).

In otherwords, if the pair is [0,1] and 1 has a count of one, but 0 has a count of four hundred, the pair should still be at (or at least very near) the front of the list - they need sorting by the least frequent element in the pair.

Here's a contrived example I've built:

input   [[0,0],[1,2],[1,4],[2,2],[2,3],[3,3],[3,4]]

Here's the individual element counts and the source pairs they come from:

0: 1   [0,0]
1: 2   [1,2],[1,4]
2: 3   [1,2],[2,2],[2,3]
3: 3   [2,3],[3,3],[3,4]
4: 2   [1,4],[3,4]

And here's the result, along with the pair scores:

output: [[0,0],[1,4],[1,2],[3,4],[2,2],[2,3],[3,3]]
scores:   1     1-2   1-3   2-3   3     3     3

Here, 0 has a count of one (it appears in one pair, albeit twice) so comes first. 1 has a count of two, so appears second - with [1,4] before [1,2] because 4 has a count of two and 2 has a count of three, et cetera.

My current solution

As said, I believe this implimentation works accurately, but it just feels that there must be a better way to go about doing this. Anyway, here's what I've got so far:

#my implementation uncommented to reduce post size, see history for comments
def sortPairList( data , n ):
    count = []
    for i in range(0,n):
        count.append( 0 )

    #count up the data
    for p in data:
        count[p[0]] += 1
        if p[1] != p[0]:
            count[p[1]] += 1

    maxcount = 0
    for i in range(0,n):
        if count[i] > maxcount:
            maxcount = count[i]

    def elementFrequency(p):
        if count[ p[0] ] < count[ p[1] ]:
            return count[ p[0] ] + float(count[ p[1] ]) / (maxcount+1)
        else:
            return count[ p[1] ] + float(count[ p[0] ]) / (maxcount+1)

    data.sort( key=elementFrequency )

Any suggestions on a more "Python" way of doing this?

Or anything that's wrong with my current attempt?

New Test Case (see answer's comments)

input:    [[0,0],[0,3],[0,5],[0,7],[1,1],[1,2],[1,8],[2,4],[2,5],[3,4],[3,5],[3,9],[4,4],[4,7],[4,8],[6,8],[7,7],[7,9],[8,9]]
expected: [[6,8],[1,1],[1,2],[2,5],[0,5],[1,8],[3,5],[3,9],[7,9],[8,9],[2,4],[0,0],[0,3],[0,7],[7,7],[3,4],[4,7],[4,8],[4,4]]


I would probably use a Counter (needs Python ≥2.7 or ≥3.1) for tallying.

from collections import Counter
from itertools import chain
def sortPairList2(data):
    tally = Counter(chain(*map(set, data)))
    data.sort(key=lambda x: sorted(tally[i] for i in x))

Note that:

  1. You can create an anonymous function with lambda. For example,

    >>> c = 4
    >>> a = lambda p: p - c
    >>> a(7)
    3
    
  2. The sort key need not be a number. Anything comparable can be used as the return value of the key function. In my code, a list is used for ordering.

  3. There are many simpler idioms in Python for your original code.

    • The count can be initialized using count = [0] * n instead of that loop.
    • The maxcount can be obtained with the max function. maxcount = max(count)
  4. List comprehension is used a lot in Python. If your target is to transform an iterable into another iterable, prefer comprehension over loops.


>>> n = 4
>>> freqs = {i: sum(i in j for j in inp) for i in range(n+1)}
>>> def key(x):
    a, b = x
    return min(freqs[a], freqs[b]), max(freqs[a], freqs[b])

>>> sorted(inp, key=key)

P.S. Please note that input is a bad name for a variable as it shadows built-in.


While KennyTM solution works I tried to do it myself.

My solution precomputes frequencies and stores it in dictionary where str(n) is key. I had some trouble with changing compare function known from Python2 to key used with Python3, but I found recipe at ActiveState code

item_cnt = {}

def icount(n):
    return item_cnt[str(n)]

def add_item(n):
    sn = str(n)
    try:
        item_cnt[sn] += 1
    except KeyError:
        item_cnt[sn] = 1

# sort callback
def cmp_items(ij, kl):
    i, j = ij
    k, l = kl
    if icount(i) < icount(k) or icount(i) == icount(k) and icount(j) < icount(l):
        return -1
    return 1

input = [[0,0],[1,2],[1,4],[2,2],[2,3],[3,3],[3,4]]
# count all items
for (i, j) in input:
    add_item(i)
    add_item(j)

# works with Python 2.x
#input.sort(cmp_items)
# works with Python2.6 and Python 3.x
# to convert compare function to key look at:
# http://code.activestate.com/recipes/576653-convert-a-cmp-function-to-a-key-function/
input.sort(key=cmp_to_key(cmp_items))
print(input)


Similar to KennyTM's solution, but for Python 2.5 or greater:

import collections

def sort_by_occurence(sequences):
    tally = collections.defaultdict(int)
    for sequence in sequences:
        for item in sequence:
            tally[item] += 1
    sequences.sort(key=lambda x:map(tally.get, x))


pair_list = [[0,0],[1,2],[1,4],[2,2],[2,3],[3,3],[3,4]]
sort_by_occurence(pair_list)
print pair_list
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜