Python: How to loop through blocks of lines
How to go through blocks of lines separated by an empty line? The file looks like the following:
ID: 1
Name: X
FamilyN: Y
Age: 20
ID: 2
Name: H
FamilyN: F
Age: 23
ID: 3
Name: S
FamilyN: Y
Age: 13
开发者_JS百科
ID: 4
Name: M
FamilyN: Z
Age: 25
I want to loop through the blocks and grab the fields Name, Family name and Age in a list of 3 columns:
Y X 20
F H 23
Y S 13
Z M 25
Here's another way, using itertools.groupby.
The function groupy
iterates through lines of the file and calls isa_group_separator(line)
for each line
. isa_group_separator
returns either True or False (called the key
), and itertools.groupby
then groups all the consecutive lines that yielded the same True or False result.
This is a very convenient way to collect lines into groups.
import itertools
def isa_group_separator(line):
return line=='\n'
with open('data_file') as f:
for key,group in itertools.groupby(f,isa_group_separator):
# print(key,list(group)) # uncomment to see what itertools.groupby does.
if not key: # however, this will make the rest of the code not work
data={} # as it exhausts the `group` iterator
for item in group:
field,value=item.split(':')
value=value.strip()
data[field]=value
print('{FamilyN} {Name} {Age}'.format(**data))
# Y X 20
# F H 23
# Y S 13
# Z M 25
Use a generator.
def blocks( iterable ):
accumulator= []
for line in iterable:
if start_pattern( line ):
if accumulator:
yield accumulator
accumulator= []
# elif other significant patterns
else:
accumulator.append( line )
if accumulator:
yield accumulator
import re
result = re.findall(
r"""(?mx) # multiline, verbose regex
^ID:.*\s* # Match ID: and anything else on that line
Name:\s*(.*)\s* # Match name, capture all characters on this line
FamilyN:\s*(.*)\s* # etc. for family name
Age:\s*(.*)$ # and age""",
subject)
Result will then be
[('X', 'Y', '20'), ('H', 'F', '23'), ('S', 'Y', '13'), ('M', 'Z', '25')]
which can be trivially changed into whatever string representation you want.
If your file is too large to read into memory all at once, you can still use a regular expressions based solution by using a memory mapped file, with the mmap module:
import sys
import re
import os
import mmap
block_expr = re.compile('ID:.*?\nAge: \d+', re.DOTALL)
filepath = sys.argv[1]
fp = open(filepath)
contents = mmap.mmap(fp.fileno(), os.stat(filepath).st_size, access=mmap.ACCESS_READ)
for block_match in block_expr.finditer(contents):
print block_match.group()
The mmap trick will provide a "pretend string" to make regular expressions work on the file without having to read it all into one large string. And the find_iter()
method of the regular expression object will yield matches without creating an entire list of all matches at once (which findall()
does).
I do think this solution is overkill for this use case however (still: it's a nice trick to know...)
import itertools
# Assuming input in file input.txt
data = open('input.txt').readlines()
records = (lines for valid, lines in itertools.groupby(data, lambda l : l != '\n') if valid)
output = [tuple(field.split(':')[1].strip() for field in itertools.islice(record, 1, None)) for record in records]
# You can change output to generator by
output = (tuple(field.split(':')[1].strip() for field in itertools.islice(record, 1, None)) for record in records)
# output = [('X', 'Y', '20'), ('H', 'F', '23'), ('S', 'Y', '13'), ('M', 'Z', '25')]
#You can iterate and change the order of elements in the way you want
# [(elem[1], elem[0], elem[2]) for elem in output] as required in your output
If file is not huge you can read whole file with:
content = f.open(filename).read()
then you can split content
to blocks using:
blocks = content.split('\n\n')
Now you can create function to parse block of text. I would use split('\n')
to get lines from block and split(':')
to get key and value, eventually with str.strip()
or some help of regular expressions.
Without checking if block has required data code can look like:
f = open('data.txt', 'r')
content = f.read()
f.close()
for block in content.split('\n\n'):
person = {}
for l in block.split('\n'):
k, v = l.split(': ')
person[k] = v
print('%s %s %s' % (person['FamilyN'], person['Name'], person['Age']))
This answer isn't necessarily better than what's already been posted, but as an illustration of how I approach problems like this it might be useful, especially if you're not used to working with Python's interactive interpreter.
I've started out knowing two things about this problem. First, I'm going to use itertools.groupby
to group the input into lists of data lines, one list for each individual data record. Second, I want to represent those records as dictionaries so that I can easily format the output.
One other thing that this shows is how using generators makes breaking a problem like this down into small parts easy.
>>> # first let's create some useful test data and put it into something
>>> # we can easily iterate over:
>>> data = """ID: 1
Name: X
FamilyN: Y
Age: 20
ID: 2
Name: H
FamilyN: F
Age: 23
ID: 3
Name: S
FamilyN: Y
Age: 13"""
>>> data = data.split("\n")
>>> # now we need a key function for itertools.groupby.
>>> # the key we'll be grouping by is, essentially, whether or not
>>> # the line is empty.
>>> # this will make groupby return groups whose key is True if we
>>> care about them.
>>> def is_data(line):
return True if line.strip() else False
>>> # make sure this really works
>>> "\n".join([line for line in data if is_data(line)])
'ID: 1\nName: X\nFamilyN: Y\nAge: 20\nID: 2\nName: H\nFamilyN: F\nAge: 23\nID: 3\nName: S\nFamilyN: Y\nAge: 13\nID: 4\nName: M\nFamilyN: Z\nAge: 25'
>>> # does groupby return what we expect?
>>> import itertools
>>> [list(value) for (key, value) in itertools.groupby(data, is_data) if key]
[['ID: 1', 'Name: X', 'FamilyN: Y', 'Age: 20'], ['ID: 2', 'Name: H', 'FamilyN: F', 'Age: 23'], ['ID: 3', 'Name: S', 'FamilyN: Y', 'Age: 13'], ['ID: 4', 'Name: M', 'FamilyN: Z', 'Age: 25']]
>>> # what we really want is for each item in the group to be a tuple
>>> # that's a key/value pair, so that we can easily create a dictionary
>>> # from each item.
>>> def make_key_value_pair(item):
items = item.split(":")
return (items[0].strip(), items[1].strip())
>>> make_key_value_pair("a: b")
('a', 'b')
>>> # let's test this:
>>> dict(make_key_value_pair(item) for item in ["a:1", "b:2", "c:3"])
{'a': '1', 'c': '3', 'b': '2'}
>>> # we could conceivably do all this in one line of code, but this
>>> # will be much more readable as a function:
>>> def get_data_as_dicts(data):
for (key, value) in itertools.groupby(data, is_data):
if key:
yield dict(make_key_value_pair(item) for item in value)
>>> list(get_data_as_dicts(data))
[{'FamilyN': 'Y', 'Age': '20', 'ID': '1', 'Name': 'X'}, {'FamilyN': 'F', 'Age': '23', 'ID': '2', 'Name': 'H'}, {'FamilyN': 'Y', 'Age': '13', 'ID': '3', 'Name': 'S'}, {'FamilyN': 'Z', 'Age': '25', 'ID': '4', 'Name': 'M'}]
>>> # now for an old trick: using a list of column names to drive the output.
>>> columns = ["Name", "FamilyN", "Age"]
>>> print "\n".join(" ".join(d[c] for c in columns) for d in get_data_as_dicts(data))
X Y 20
H F 23
S Y 13
M Z 25
>>> # okay, let's package this all into one function that takes a filename
>>> def get_formatted_data(filename):
with open(filename, "r") as f:
columns = ["Name", "FamilyN", "Age"]
for d in get_data_as_dicts(f):
yield " ".join(d[c] for c in columns)
>>> print "\n".join(get_formatted_data("c:\\temp\\test_data.txt"))
X Y 20
H F 23
S Y 13
M Z 25
Use a dict, namedtuple, or custom class to store each attribute as you come across it, then append the object to a list when you reach a blank line or EOF.
simple solution:
result = []
for record in content.split('\n\n'):
try:
id, name, familyn, age = map(lambda rec: rec.split(' ', 1)[1], record.split('\n'))
except ValueError:
pass
except IndexError:
pass
else:
result.append((familyn, name, age))
Along with the half-dozen other solutions I already see here, I'm a bit surprised that no one has been so simple-minded (that is, generator-, regex-, map-, and read-free) as to propose, for example,
fp = open(fn)
def get_one_value():
line = fp.readline()
if not line:
return None
parts = line.split(':')
if 2 != len(parts):
return ''
return parts[1].strip()
# The result is supposed to be a list.
result = []
while 1:
# We don't care about the ID.
if get_one_value() is None:
break
name = get_one_value()
familyn = get_one_value()
age = get_one_value()
result.append((name, familyn, age))
# We don't care about the block separator.
if get_one_value() is None:
break
for item in result:
print item
Re-format to taste.
精彩评论