Text file parsing question in Python
I am new to python and I am trying to delete lines in a text file if I find the word "Lett." in the line. Here is a sample of the text file I am trying to parse:
<A>Lamb</A> <W>Let. Moxon</W>
<A>Lamb</A> <W>Danger Confound. Mor. w. Personal Deformity</W>
<A>Lamb</A> <W>Gentle Giantess</W>
<A>Lamb</A> <W>Lett., to Wordsw.</W>
<A>Lamb</A> <W>Lett., to Procter</W>
<A>Lamb</A> <W>Let. to Old Gentleman</W>
<A>Lamb</A> <W>Elia Ser.</W>
<A>Lamb</A> <W>Let. to T. Manning</W>
I know how to open the file but I am just uncertain of how to find 开发者_JAVA百科the matching text and then how to delete that line. Any help would be greatly appreciated.
f = open("myfile.txt", "r")
for line in f:
if not "Lett." in line: print line,
f.close()
or if you want to write the result to a file:
f = open("myfile.txt", "r")
lines = f.readlines()
f.close()
f = open("myfile.txt", "w")
for line in lines:
if not "Lett." in line: f.write(line)
f.close()
# Open input text
text = open('in.txt', 'r')
# Open a file to output results
out = open('out.txt', 'w')
# Go through file line by line
for line in text.readlines():
if 'Lett.' not in line: ### This is the crucial line.
# add line to file if 'Lett.' is not in the line
out.write(line)
# Close the file to save changes
out.close()
I have a general streaming editor framework for this kind of stuff. I load the file into memory, apply changes to the in-memory list of lines, and write out the file if changes were made.
I have boilerplate that looks like this:
from sed_util import delete_range, insert_range, append_range, replace_range
def sed(filename):
modified = 0
# Load file into memory
with open(filename) as f:
lines = [line.rstrip() for line in f]
# magic here...
if modified:
with open(filename, "w") as f:
for line in lines:
f.write(line + "\n")
And in the # magic here
section, I have either:
modifications to individual lines, like:
lines[i] = change_line(lines[i])
calls to my sed utilities for inserting, appending, and replacing lines, like:
lines = delete_range(lines, some_range)
The latter uses primitives like these:
def delete_range(lines, r):
"""
>>> a = list(range(10))
>>> b = delete_range(a, (1, 3))
>>> b
[0, 4, 5, 6, 7, 8, 9]
"""
start, end = r
assert start <= end
return [line for i, line in enumerate(lines) if not (start <= i <= end)]
def insert_range(lines, line_no, new_lines):
"""
>>> a = list(range(10))
>>> b = list(range(11, 13))
>>> c = insert_range(a, 3, b)
>>> c
[0, 1, 2, 11, 12, 3, 4, 5, 6, 7, 8, 9]
>>> c = insert_range(a, 0, b)
>>> c
[11, 12, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> c = insert_range(a, 9, b)
>>> c
[0, 1, 2, 3, 4, 5, 6, 7, 8, 11, 12, 9]
"""
assert 0 <= line_no < len(lines)
return lines[0:line_no] + new_lines + lines[line_no:]
def append_range(lines, line_no, new_lines):
"""
>>> a = list(range(10))
>>> b = list(range(11, 13))
>>> c = append_range(a, 3, b)
>>> c
[0, 1, 2, 3, 11, 12, 4, 5, 6, 7, 8, 9]
>>> c = append_range(a, 0, b)
>>> c
[0, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> c = append_range(a, 9, b)
>>> c
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 12]
"""
assert 0 <= line_no < len(lines)
return lines[0:line_no+1] + new_lines + lines[line_no+1:]
def replace_range(lines, line_nos, new_lines):
"""
>>> a = list(range(10))
>>> b = list(range(11, 13))
>>> c = replace_range(a, (0, 2), b)
>>> c
[11, 12, 2, 3, 4, 5, 6, 7, 8, 9]
>>> c = replace_range(a, (8, 10), b)
>>> c
[0, 1, 2, 3, 4, 5, 6, 7, 11, 12]
>>> c = replace_range(a, (0, 10), b)
>>> c
[11, 12]
>>> c = replace_range(a, (0, 10), [])
>>> c
[]
>>> c = replace_range(a, (0, 9), [])
>>> c
[9]
"""
start, end = line_nos
return lines[:start] + new_lines + lines[end:]
def find_line(lines, regex):
for i, line in enumerate(lines):
if regex.match(line):
return i
if __name__ == '__main__':
import doctest
doctest.testmod()
The tests work on arrays of integers, for clarity, but the transformations work for arrays of strings, too.
Generally, I scan the list of lines to identify changes I want to apply, usually with regular expressions, and then I apply the changes on matching data. Today, for example, I ended up making about 2000 line changes across 150 files.
This works better thansed
when you need to apply multiline patterns or additional logic to identify whether a change is applicable.
return [l for l in open(fname) if 'Lett' not in l]
result = ''
for line in open('in.txt').readlines():
if 'lett' not in line:
result += line
f = open('out.txt', 'a')
f.write(result)
精彩评论