how to calculate number of items in per user groupby item
How can I output a result like this:
user I R H
=================
atl001 2 1 0
cms017 1 2 1
lhc003 0 1 2
from a list like this:
atl001 I
atl001 I
cms017 H
atl001 R
lhc003 开发者_高级运维H
cms017 R
cms017 I
lhc003 H
lhc003 R
cms017 R
i.e. I want to calculate the number of I
, H
and R
per user. Just a note that I can't use groupby
from itertools
in this particular case. Thanks in advance for your help. Cheers!!
data='''atl001 I
atl001 I
cms017 H
atl001 R
lhc003 H
cms017 R
cms017 I
lhc003 H
lhc003 R
cms017 R'''
stats={}
for i in data.split('\n'):
user, irh = i.split()
u = stats.setdefault(user, {})
u[irh] = u.setdefault(irh, 0) + 1
print 'user I R H'
for user in sorted(stats):
stat = stats[user]
print user, stat.get('I', 0), stat.get('R', 0), stat.get('H', 0)
data = 112*'cms017 R\n'
data = data + '''atl001 I
cms017 R
atl001 I
cms017 H
atl001 R
lhcabc003 H
cms017 R
lhcabc003 H
lhcabc003 R
cms017 R
cms017 R
cms017 R'''
print data,'\n'
stats = {}
d = {'I':0,'R':1,'H':2}
L = 0
for line in data.splitlines():
user,irh = line.split()
stats.setdefault(user,[0,0,0])
stats[user][d[irh]] += 1
L = max(L, len(user))
LL = len(str(max(max(stats[user])
for user in stats )))
cale = ' %%%ds %%%ds %%%ds' % (LL,LL,LL)
ch = 'user'.ljust(L) + cale % ('I','R','H')
print '%s\n%s' % (ch, len(ch)*'=')
print '\n'.join(user.ljust(L) + cale % tuple(stats[user])
for user in sorted(stats.keys()))
result
user I R H
=====================
atl001 2 1 0
cms017 0 117 1
lhcabc003 0 1 2
.
Also:
data = 14*'cms017 R\n'
data = data + '''atl001 I
cms017 R
atl001 I
cms017 H
atl001 R
lhcabc003 H
cms017 R
lhcabc003 H
lhcabc003 R
cms017 R
cms017 R
cms017 R'''
print data,'\n'
Y = {}
L = 0
for line in data.splitlines():
user,irh = line.split()
L = max(L, len(user))
if (user,irh) not in Y:
Y.update({(user,'I'):0,(user,'R'):0,(user,'H'):0})
Y[(user,irh)] += 1
LL = len(str(max(x for x in Y.itervalues())))
cale = '%%-%ds %%%ds %%%ds %%%ds' % (L,LL,LL,LL)
ch = cale % ('user','I','R','H')
print '%s\n%s' % (ch, len(ch)*'=')
li = sorted(Y.keys())
print '\n'.join(cale % (a[0],Y[b],Y[c],Y[a])
for a,b,c in (li[x:x+3] for x in xrange(0,len(li),3)))
result
user I R H
==================
atl001 2 1 0
cms017 0 19 1
lhcabc003 0 1 2
.
PS:
The names of users are all justified in a number L of characters
In my code the columns, to avoid complexity as in the Sebastian's code, I, R , H are justified in the same number LL of characters, which is the max of all the results present in this columns
Well, using groupby
for this problem makes no sense anyway. For starters, your data isn't sorted (groupby
doesn't sort the groups for you), and the lines are very simple.
Just keep count as you process each line. I am assuming you don't know what flags you'll get:
from sets import Set as set # python2.3 compatibility
counts = {} # counts stored in user -> dict(flag=counter) nested dicts
flags = set()
for line in inputfile:
user, flag = line.strip().split()
usercounts = counts.setdefault(user, {})
usercounts[flag] = usercounts.setdefault(flag, 0) + 1
flags.add(flag)
Printing the info after that is a question of iterating over your counts structure. I am assuming usernames are always 6 characters long:
flags = list(flags)
flags.sort()
users = counts.keys()
users.sort()
print "user %s" % (' '.join(flags))
print "=" * (6 + 3 * len(flags))
for user in users:
line = [user]
for flag in flags:
line.append(counts[user].get(flag, 0))
print ' '.join(line)
All code above is untested, but should roughly work.
Here's a variant that uses nested dicts to count job statuses and computes max field widths before printing:
#!/usr/bin/env python
import fileinput
from sets import Set as set # python2.3
# parse job statuses
counter = {}
for line in fileinput.input():
user, jobstatus = line.split()
d = counter.setdefault(user, {})
d[jobstatus] = d.setdefault(jobstatus, 0) + 1
# print job statuses
# . find field widths
status_names = set([name for st in counter.itervalues() for name in st])
maxstatuslens = [max([len(str(i)) for st in counter.itervalues()
for n, i in st.iteritems()
if name == n])
for name in status_names]
maxuserlen = max(map(len, counter))
row_format = (("%%-%ds " % maxuserlen) +
" ".join(["%%%ds" % n for n in maxstatuslens]))
# . print header
header = row_format % (("user",) + tuple(status_names))
print header
print '='*len(header)
# . print rows
for user, statuses in counter.iteritems():
print row_format % (
(user,) + tuple([statuses.get(name, 0) for name in status_names]))
Example
$ python print-statuses.py <input.txt
user I H R
============
lhc003 0 2 1
cms017 1 1 2
atl001 2 0 1
Here's a variant that uses flat dictionary with a tuple (user, status_name)
as a key:
#!/usr/bin/env python
import fileinput
from sets import Set as set # python 2.3
# parse job statuses
counter = {}
maxstatuslens = {}
maxuserlen = 0
for line in fileinput.input():
key = user, status_name = tuple(line.split())
i = counter[key] = counter.setdefault(key, 0) + 1
maxstatuslens[status_name] = max(maxstatuslens.setdefault(status_name, 0),
len(str(i)))
maxuserlen = max(maxuserlen, len(user))
# print job statuses
row_format = (("%%-%ds " % maxuserlen) +
" ".join(["%%%ds" % n for n in maxstatuslens.itervalues()]))
# . print header
header = row_format % (("user",) + tuple(maxstatuslens))
print header
print '='*len(header)
# . print rows
for user in set([k[0] for k in counter]):
print row_format % ((user,) +
tuple([counter.get((user, status), 0) for status in maxstatuslens]))
The usage and output are the same.
As a hint:
Use a nested dictionary structure for counting the occurences:
user -> character -> occurences of the character for user
Writing the parser code and incrementing the counters and printing the result is up to you ...a good exercise.
精彩评论