python csv: getting subset
here is a snapshot of my csv:
alex 123f 1
harry fwef 2
alex sef 3
alex gsdf 4
alex wf35 6
harry sdfsdf 3
i would like to ge开发者_JAVA百科t the subset of this data where the occurrence of anything in the first column (harry, alex) is at least 4. so i want the resulting data set to be:
alex 123f 1
alex sef 3
alex gsdf 4
alex wf35 6
Clearly, you cannot decide which rows are interesting until you've seen all rows (since the very last row might be the one turning some count from three to four and thereby making some previously seen rows interesting, for example;-). So, unless your CSV file is horribly huge, suck it all into memory, first, as a list...:
import csv
with open('thefile.csv', 'rb') as f:
data = list(csv.reader(f))
then, do the counting -- Python 2.7 has a better way, but assuming you're still on 2.6 like most of us...:
import collections
counter = collections.defaultdict(int)
for row in data:
counter[row[0]] += 1
and finally do the selection loop...:
for row in data:
if counter[row[0]] >= 4:
print row
Of course, this prints each interesting row as a roughly-hewed list (with square brackets and quotes around the items), but it will be easy to format it in any way you might prefer.
if Python is not a must
$ gawk '{b[$1]++;c[++d,$1]=$0}END{for(i in b){if(b[i]>=4){for(j=1;j<=d;j++){print c[j,i]}}}}' file
And yes, 70MB file is fine.
精彩评论