In python, how to parse a file into lists based on a specific value?
I have a large tab delimited text file, for example, call it john_file:
1 john1 23 54 54
2 john2 34 45 66 3 john3 35 43 54 4 john2 34 54 78开发者_StackOverflow社区5 john1 12 34 65
6 john3 34 55 66
What's a quick way to parse this file into 3 lists based on name(john1, 2 or 3)?
fh=open('john_file.txt','r').readlines()
john1_list=[]
for i in fh:
if i.split('\t')[1] == "john1":
john1_list.append(i)
Thanks in advance
from collections import defaultdict
d = defaultdict(list)
with open('john_file.txt') as f:
for line in f:
fields = line.split('\t')
d[fields[1]].append(line)
The individual lists are then in d['john1']
, d['john2']
etc
>>> from collections import defaultdict
>>> a = defaultdict(list)
>>> for line in '''1 john1 23 54 54
... 2 john2 34 45 66
... 3 john3 35 43 54
... 4 john2 34 54 78
... 5 john1 12 34 65
... 6 john3 34 55 66
... '''.split('\n'):
... data = filter(None, line.split())
... if data:
... a[data[1]].append(data)
...
>>> data
[]
>>> a
defaultdict(<type 'list'>, {'john1': [['1', 'john1', '23', '54', '54'], ['5', 'john1', '12', '34', '65']], 'john2': [['2', 'john2', '34', '45', '66'], ['4', 'john2', '34', '54', '78']], 'john3': [['3', 'john3', '35', '43', '54'], ['6', 'john3', '34', '55', '66']]})
You could do something like:
fh=open('john_file.txt','r').readlines()
john_lists={}
for i in fh:
j=i.split('\t')[1]
if j not in johns:
john_lists[j]=[]
johns[j].append(i)
This has the advantage of not depending on knowing in advance the possible values in the second column.
As others point out, you can also use the defaultdict
to do
from collections import defaultdict
fh=open('john_file.txt','r').readlines()
john_lists=defaultdict(list)
for i in fh:
j=i.split('\t')[1]
johns[j].append(i)
littletable makes this kind of simple slicing and dicing easy, making a list of objects accessible/queryable/pivotable by attribute, like a mini-in-memory database, but with even less overhead than SQLite.
from collections import namedtuple
from littletable import Table
data = """\
1 john1 23 54 54
2 john2 34 45 66
3 john3 35 43 54
4 john2 34 54 78
5 john1 12 34 65
6 john3 34 55 66"""
Record = namedtuple("Record", "id name length width height")
def makeRecord(s):
s = s.strip().split()
# convert all but name to ints, and build a Record instance
return Record(*(ss if i == 1 else int(ss) for i,ss in enumerate(s)))
# create a table and load it up
# (if this were CSV data, would be even simpler)
t = Table("data")
t.create_index("id", unique=True)
t.create_index("name")
t.insert_many(map(makeRecord, data.splitlines()))
# get a record by unique key
# (unique indexes return just the single record)
print t.id[4]
print
# get all records matching an indexed value
# (non-unique index retrievals return a new Table)
for d in t.name['john1']:
print d
print
# dump summary pivot tables
t.pivot('name').dump_counts()
print
t.create_index('length')
t.pivot('name length').dump_counts()
Prints:
Record(id=4, name='john2', length=34, width=54, height=78)
Record(id=1, name='john1', length=23, width=54, height=54)
Record(id=5, name='john1', length=12, width=34, height=65)
Pivot: name
john1 2
john2 2
john3 2
Pivot: name,length
12 23 34 35 Total
john1 1 1 0 0 2
john2 0 0 2 0 2
john3 0 0 1 1 2
Total 1 1 3 1 6
精彩评论