Need more efficient way to parse out csv file in Python

2023-03-16 05:36 问答作者：

Here's a sample csv file

id, serial_no
2, 500
2, 501
2, 502
3, 600
3, 601

This is the output I'm looking for (list of serial_no withing a list of ids):

[2, [500,501,502]]
[3, [600, 601]]

I have implemented my solution but it's too much code and I'm sure there are better solutions out there. Still learning Python and I don't know all the tricks yet.

file = 'test.csv'

data = csv.reader(open(file))
fields = data.next()

for row in data:
  each_row = []     
    each_row.append(row[0])
    each_row.append(row[1])
    zipped_data.append(each_row)
for rec in zipped_data:
  if rec[0] not in ids:
    ids.append(rec[0])
for id in ids:
    for rec in zipped_data:
      if rec[0] == id:
        ser_no.append(rec[1])
  tmp.append(id)
  tmp.append(ser_no)
  print tmp
  tmp = []
  ser_no = []

**I've omitted var initializing for simplicity of code

print tmp

Gives me output I mentioned above. I know there's a better way to do this or pythonic way to do it. It's just too messy! Any suggestions would be great!

from collections import defaultdict

records = defaultdict(list)

file = 'test.csv'

data = csv.reader(open(file))
fields = data.next()

for row in data:
    records[row[0]].append(row[1])

#sorting by ids since keys don't maintain order
results = sorted(records.items(), key=lambda x: x[0])
print results

If the list of serial_nos need to be unique just replace defaultdict(list) with defaultdict(set) and records[row[0]].append(row[1]) with records[row[0]].add(row[1])

Instead of a list, I'd make it a collections.defaultdict(list), and then just call the append() method on the value.

result = collections.defaultdict(list)
for row in data:
  result[row[0]].append(row[1])

Here's a version I wrote, looks like there are plenty of answers for this one already though.

You might like using csv.DictReader, gives you easy access to each column by field name (from the header / first line).

#!/usr/bin/python
import csv

myFile = open('sample.csv','rb')
csvFile = csv.DictReader(myFile)
# first row will be used for field names (by default)

myData = {}

for myRow in csvFile:
    myId = myRow['id']
    if not myData.has_key(myId): myData[myId] = []
    myData[myId].append(myRow['serial_no'])

for myId in sorted(myData):
    print '%s %s' % (myId, myData[myId])

myFile.close()

Some observations:

0) file is a built-in (a synonym for open), so it's a poor choice of name for a variable. Further, the variable actually holds a file name, so...

1) The file can be closed as soon as we're done reading from it. The easiest way to accomplish that is with a with block.

2) The first loop appears to go over all the rows, grab the first two elements from each, and make a list with those results. However, your rows already all contain only two elements, so this has no net effect. The CSV reader is already an iterator over rows, and the simple way to create a list from an iterator is to pass it to the list constructor.

3) You proceed to make a list of unique ID values, by manually checking. A list of unique things is better known as a set, and the Python set automatically ensures uniqueness.

4) You have the name zipped_data for your data. This is telling: applying zip to the list of rows would produce a list of columns - and the IDs are simply the first column, transformed into a set.

5) We can use a list comprehension to build the list of serial numbers for a given ID. Don't tell Python how to make a list; tell it what you want in it.

6) Printing the results as we get them is kind of messy and inflexible; better to create the entire chunk of data (then we have code that creates that data, so we can do something else with it other than just printing it and forgetting it).

Applying these ideas, we get:

filename = 'test.csv'

with open(filename) as in_file:
    data = csv.reader(in_file)
    data.next() # ignore the field labels
    rows = list(data) # read the rest of the rows from the iterator

print [
    # We want a list of all serial numbers from rows with a matching ID...
    [serial_no for row_id, serial_no in rows if row_id == id]
    # for each of the IDs that there is to match, which come from making
    # a set from the first column of the data.
    for id in set(zip(*rows)[0])
]

We can probably do even better than this by using the groupby function from the itertools module.

example using itertools.groupby. This only works if the rows are already grouped by id

from csv import DictReader
from itertools import groupby
from operator import itemgetter

filename = 'test.csv'

# the context manager ensures that infile is closed when it goes out of scope
with open(filename) as infile:

    # group by id - this requires that the rows are already grouped by id
    groups = groupby(DictReader(infile), key=itemgetter('id'))

    # loop through the groups printing a list for each one
    for i,j in groups:
        print [i, map(itemgetter(' serial_no'), list(j))]

note the space in front of ' serial_no'. This is because of the space after the comma in the input file

继续阅读：csv python

Need more efficient way to parse out csv file in Python

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？