开发者

Using Groupby to Group a Column in an Access Table in Python

I have been playing with the groupby function from the itertools module for a while now (like days)

for k, g in groupby(data, keyfunc):

I am really having trouble understanding the syntax. I've tried a variety of different things, but I really don't know what to put in for "data" and "keyfunc" and get it to return the groups of data I want in a print statement.

What am I doing is looping through rows in an access table.

I set a variable to a cursor search (which is sorted) in the access table and pull out the column I want.

for row in cursor:
    print row.ROAD_TYPE

This returns:

TRUNK ROAD
TRUNK ROAD
TRUNK ROAD
TRUNK ROAD
COLLECTOR HIGHWAY
COLLECTOR HIGHWAY
ACCESS ROAD
ACCESS ROAD
ACCESS ROAD
ACCESS ROAD
ACCESS ROAD
ACCESS ROAD
ACCESS ROAD
MAJOR ROAD
MAJOR HIGHWAY

I want to group these values together and have it return the string value for me so it prints something like this:

TRUNK ROAD
TRUNK ROAD
TRUNK ROAD
TRUNK ROAD

COLLECTOR HIGHWAY
COLLECTOR HIGHWAY

ACCESS ROAD
ACCESS ROAD
ACCESS ROAD
ACCESS ROAD
ACCESS ROAD
ACCESS ROAD
ACCESS ROAD

MAJOR ROAD

MAJOR HIGHWAY

Finally, I want to group a second column based on these new groups so that I get something like this:

TRUNK ROAD  M1
TRUNK ROAD  M1

TRUNK ROAD  M2

TRUNK ROAD  M3


COLLECTOR HIGHWAY  M1

COLLECTOR HIGHWAY  M2


ACCESS ROAD  M1
ACCESS ROAD  M1

ACCESS ROAD  M3
ACCESS ROAD  M3

ACCESS ROAD  M7
ACCESS ROAD  M7

ACCESS ROAD  M8


MAJOR ROAD  M8


MAJOR HIGHWAY  M8

I know this is probably way less difficult than I've made it out for myself, I feel there's a simple answer, but I'm completely stumped and I can't seem to find an example on the internet that explains the groupby syntax in 开发者_运维百科a way that I understand. Please feel f


import itertools as it
for key, group in it.groupby(cursor, lambda row: row.ROAD_TYPE):
    for sec_col,pairs in it.groupby(group, lambda row: row.SECOND_COLUMN):
        for row in pairs:
            print('{t}  {s}'.format(t=row.ROAD_TYPE,s=row.SECOND_COLUMN))
        print
    print

Here are two examples to help grok groupby:

[list(g) for k, g in it.groupby('AAAABBBCCD')]
# [['A', 'A', 'A', 'A'], ['B', 'B', 'B'], ['C', 'C'], ['D']]

Above, all items which are the same, are grouped together.

Now we add a key function, keyfunc:

keyfunc=lambda x: x//3
data=range(13)
[list(g) for k,g in it.groupby(data,keyfunc)]
# [[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10, 11], [12]]
[k for k,g in it.groupby(data,keyfunc)]
# [0, 1, 2, 3, 4]

Now instead of grouping by the items in data, we group according to keyfunc(x) for each x in data.


The idea behind itertools.groupby is to solve the basic problem of: "I want to iterate through a sequence, analyze each thing in it, and depending on the analysis process each thing a different way". groupby separates the sequence into groups, but preserving the order of the original sequence.

from itertools import groupby
i = groupby(xrange(12), lambda v: v / 3)
results = [(x[0], list(x[1])) for x in i]
print str(results)

The above prints: [(0, [0, 1, 2]), (1, [3, 4, 5]), (2, [6, 7, 8]), (3, [9, 10, 11])]

You will notice that in order to get the results list I had to do some extra processing. What groupby actually gives you is meant for iteration:

i = groupby(xrange(12), lambda v: v / 3)
print str(i.next())

The above prints: (0, <itertools._grouper object at 0x020BF3D0>)

So the idea is that in your program you will say key, valueiter = i.next(), test the key, and then pass valueiter to the correct processing function/method.

So what's the value here? Well, you don't have to continue to test the key within your processing functions to tell when to stop. groupby will stop each "sub-iterator" for you automatically.

What groupby does not do, unlike the GROUP BY clause in SQL, is ensure that everything with the same key value from the original iterable is processed at once. Key values can repeat out of groupby. This is justified because the objective is to preserve the order of the original sequence, and not to accumulate everything in memory. For example:

i = groupby(xrange(6), lambda v: v % 3)  # note: modulo division
results = [(x[0], list(x[1])) for x in i]
print str(results)

The above prints: [(0, [0]), (1, [1]), (2, [2]), (0, [3]), (1, [4]), (2, [5])]. The key values repeat, and each sub-iterator produces only a single datum. This is the worst-case scenario, in terms of performance, for groupby, and it means that you have to follow a certain model in working with this tool.

So something like this:

i = groupby(xrange(12), lambda v: v / 3)
results = dict([(x[0], list(x[1])) for x in i])  # beware of dict() here!
print str(results)

is only correct if you know a priori that your key values will never repeat.


Your two groupby examples are done the exact same way, except for your choice of key

from itertools import groupby
from operator import attrgetter

for key, rows in groupby(cursor, key=attrgetter('ROAD_TYPE')):
    for row in rows:
        print row.ROAD_TYPE
    print

for key, rows in groupby(cursor, key=attrgetter('ROAD_TYPE', 'OTHER_COLUMN')):
    for row in rows:
        print row.ROAD_TYPE, row.OTHER_COLUMN
    print

In both cases key will be the result of attrgetter(), but you won't actually need it, as you will be iterating the rows grouped by that key. Of course, this all works correctly, if cursor is sorted by the same key that you group by.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜