python - Read file from and to specific lines of text
I'm not talking about specific line numbers because i'm reading multiple files with the same format but vary in length.
Say i have this text file:Something here...
... ... ...
Start #I want this block of text
a b c d e f g
h i j k l m n
End #until this line of the开发者_运维技巧 file
something here...
... ... ...
I hope you know what i mean. i was thinking of iterating through the file then search using regular expression to find the line number of "Start" and "End" then use linecache to read from Start line to End line. But how to get the line number? what function can i use?
If you simply want the block of text between Start
and End
, you can do something simple like:
with open('test.txt') as input_data:
# Skips text before the beginning of the interesting block:
for line in input_data:
if line.strip() == 'Start': # Or whatever test is needed
break
# Reads text until the end of the block:
for line in input_data: # This keeps reading the file
if line.strip() == 'End':
break
print line # Line is extracted (or block_of_lines.append(line), etc.)
In fact, you do not need to manipulate line numbers in order to read the data between the Start and End markers.
The logic ("read until…") is repeated in both blocks, but it is quite clear and efficient (other methods typically involve checking some state [before block/within block/end of block reached], which incurs a time penalty).
Here's something that will work:
data_file = open("test.txt")
block = ""
found = False
for line in data_file:
if found:
block += line
if line.strip() == "End": break
else:
if line.strip() == "Start":
found = True
block = "Start"
data_file.close()
You can use a regex pretty easily. You can make it more robust as needed, below is a simple example.
>>> import re
>>> START = "some"
>>> END = "Hello"
>>> test = "this is some\nsample text\nthat has the\nwords Hello World\n"
>>> m = re.compile(r'%s.*?%s' % (START,END), re.S)
>>> m.search(test).group(0)
'some\nsample text\nthat has the\nwords Hello'
This should be a start for you:
started = False
collected_lines = []
with open(path, "r") as fp:
for i, line in enumerate(fp.readlines()):
if line.rstrip() == "Start":
started = True
print "started at line", i # counts from zero !
continue
if started and line.rstrip()=="End":
print "end at line", i
break
# process line
collected_lines.append(line.rstrip())
The enumerate
generator takes a generator and enumerates the iterations.
Eg.
print list(enumerate("a b c".split()))
prints
[ (0, "a"), (1,"b"), (2, "c") ]
UPDATE:
the poster asked for using a regex to match lines like "===" and "======":
import re
print re.match("^=+$", "===") is not None
print re.match("^=+$", "======") is not None
print re.match("^=+$", "=") is not None
print re.match("^=+$", "=abc") is not None
print re.match("^=+$", "abc=") is not None
精彩评论