Using Groupby to Group a Column in an Access Table in Python
I have been playing with the groupby function from the itertools module for a while now (like days)
for k, g in groupby(data, keyfunc):
I am really having trouble understanding the syntax. I've tried a variety of different things, but I really don't know what to put in for "data" and "keyfunc" and get it to return the groups of data I want in a print statement.
What am I doing is looping through rows in an access table.
I set a variable to a cursor search (which is sorted) in the access table and pull out the column I want.
for row in cursor:
print row.ROAD_TYPE
This returns:
TRUNK ROAD
TRUNK ROAD
TRUNK ROAD
TRUNK ROAD
COLLECTOR HIGHWAY
COLLECTOR HIGHWAY
ACCESS ROAD
ACCESS ROAD
ACCESS ROAD
ACCESS ROAD
ACCESS ROAD
ACCESS ROAD
ACCESS ROAD
MAJOR ROAD
MAJOR HIGHWAY
I want to group these values together and have it return the string value for me so it prints something like this:
TRUNK ROAD
TRUNK ROAD
TRUNK ROAD
TRUNK ROAD
COLLECTOR HIGHWAY
COLLECTOR HIGHWAY
ACCESS ROAD
ACCESS ROAD
ACCESS ROAD
ACCESS ROAD
ACCESS ROAD
ACCESS ROAD
ACCESS ROAD
MAJOR ROAD
MAJOR HIGHWAY
Finally, I want to group a second column based on these new groups so that I get something like this:
TRUNK ROAD M1
TRUNK ROAD M1
TRUNK ROAD M2
TRUNK ROAD M3
COLLECTOR HIGHWAY M1
COLLECTOR HIGHWAY M2
ACCESS ROAD M1
ACCESS ROAD M1
ACCESS ROAD M3
ACCESS ROAD M3
ACCESS ROAD M7
ACCESS ROAD M7
ACCESS ROAD M8
MAJOR ROAD M8
MAJOR HIGHWAY M8
I know this is probably way less difficult than I've made it out for myself, I feel there's a simple answer, but I'm completely stumped and I can't seem to find an example on the internet that explains the groupby syntax in 开发者_运维百科a way that I understand. Please feel f
import itertools as it
for key, group in it.groupby(cursor, lambda row: row.ROAD_TYPE):
for sec_col,pairs in it.groupby(group, lambda row: row.SECOND_COLUMN):
for row in pairs:
print('{t} {s}'.format(t=row.ROAD_TYPE,s=row.SECOND_COLUMN))
print
print
Here are two examples to help grok groupby:
[list(g) for k, g in it.groupby('AAAABBBCCD')]
# [['A', 'A', 'A', 'A'], ['B', 'B', 'B'], ['C', 'C'], ['D']]
Above, all items which are the same, are grouped together.
Now we add a key function, keyfunc
:
keyfunc=lambda x: x//3
data=range(13)
[list(g) for k,g in it.groupby(data,keyfunc)]
# [[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10, 11], [12]]
[k for k,g in it.groupby(data,keyfunc)]
# [0, 1, 2, 3, 4]
Now instead of grouping by the items in data, we group according to keyfunc(x)
for each x
in data
.
The idea behind itertools.groupby
is to solve the basic problem of: "I want to iterate through a sequence, analyze each thing in it, and depending on the analysis process each thing a different way". groupby
separates the sequence into groups, but preserving the order of the original sequence.
from itertools import groupby
i = groupby(xrange(12), lambda v: v / 3)
results = [(x[0], list(x[1])) for x in i]
print str(results)
The above prints: [(0, [0, 1, 2]), (1, [3, 4, 5]), (2, [6, 7, 8]), (3, [9, 10, 11])]
You will notice that in order to get the results
list I had to do some extra processing. What groupby
actually gives you is meant for iteration:
i = groupby(xrange(12), lambda v: v / 3)
print str(i.next())
The above prints: (0, <itertools._grouper object at 0x020BF3D0>)
So the idea is that in your program you will say key, valueiter = i.next()
, test the key
, and then pass valueiter
to the correct processing function/method.
So what's the value here? Well, you don't have to continue to test the key within your processing functions to tell when to stop. groupby
will stop each "sub-iterator" for you automatically.
What groupby
does not do, unlike the GROUP BY
clause in SQL, is ensure that everything with the same key value from the original iterable is processed at once. Key values can repeat out of groupby
. This is justified because the objective is to preserve the order of the original sequence, and not to accumulate everything in memory. For example:
i = groupby(xrange(6), lambda v: v % 3) # note: modulo division
results = [(x[0], list(x[1])) for x in i]
print str(results)
The above prints: [(0, [0]), (1, [1]), (2, [2]), (0, [3]), (1, [4]), (2, [5])]
. The key values repeat, and each sub-iterator produces only a single datum. This is the worst-case scenario, in terms of performance, for groupby
, and it means that you have to follow a certain model in working with this tool.
So something like this:
i = groupby(xrange(12), lambda v: v / 3)
results = dict([(x[0], list(x[1])) for x in i]) # beware of dict() here!
print str(results)
is only correct if you know a priori that your key values will never repeat.
Your two groupby
examples are done the exact same way, except for your choice of key
from itertools import groupby
from operator import attrgetter
for key, rows in groupby(cursor, key=attrgetter('ROAD_TYPE')):
for row in rows:
print row.ROAD_TYPE
print
for key, rows in groupby(cursor, key=attrgetter('ROAD_TYPE', 'OTHER_COLUMN')):
for row in rows:
print row.ROAD_TYPE, row.OTHER_COLUMN
print
In both cases key
will be the result of attrgetter(), but you won't actually need it, as you will be iterating the rows grouped by that key. Of course, this all works correctly, if cursor
is sorted by the same key that you group by.
精彩评论