Using python, how to read a file starting at the seventh line ?
I have a text file structure as:
date
downland
user
date data1 date2
201102 foo bar 200 50
201101 foo bar 300 35
So first six lines of file are not needed. filename:dnw.txt
f = open('dwn.txt', 'rb')
开发者_Python百科How do I "split" this file starting at line 7 to EOF?
with open('dwn.txt') as f:
for i in xrange(6):
f, next()
for line in f:
process(line)
Update: use next(f)
for python 3.x.
Itertools answer!
from itertools import islice
with open('foo') as f:
for line in islice(f, 6, None):
print line
Python 3:
with open("file.txt","r") as f:
for i in range(6):
f.readline()
for line in f:
# process lines 7-end
with open('test.txt', 'r') as fo:
for i in xrange(6):
fo.next()
for line in fo:
print "%s" % line.strip()
In fact, to answer precisely at the question as it was written
How do I "split" this file starting at line 7 to EOF?
you can do
:
in case the file is not big:
with open('dwn.txt','rb+') as f:
for i in xrange(6):
print f.readline()
content = f.read()
f.seek(0,0)
f.write(content)
f.truncate()
in case the file is very big
with open('dwn.txt','rb+') as ahead, open('dwn.txt','rb+') as back:
for i in xrange(6):
print ahead.readline()
x = 100000
chunk = ahead.read(x)
while chunk:
print repr(chunk)
back.write(chunk)
chunk = ahead.read(x)
back.truncate()
The truncate() function is essential to put the EOF you asked for. Without executing truncate() , the tail of the file, corresponding to the offset of 6 lines, would remain.
.
The file must be opened in binary mode to prevent any problem to happen.
When Python reads '\r\n' , it transforms them in '\n' (that's the Universal Newline Support, enabled by default) , that is to say there are only '\n' in the chains chunk even if there were '\r\n' in the file.
If the file is from Macintosh origin , it contains only CR = '\r' newlines before the treatment but they will be changed to '\n' or '\r\n' (according to the platform) during the rewriting on a non-Macintosh machine.
If it is a file from Linux origin, it contains only LF = '\n' newlines which, on a Windows OS, will be changed to '\r\n' (I don't know for a Linux file processed on a Macintosh ). The reason is that the OS Windows writes '\r\n' whatever it is ordered to write , '\n' or '\r' or '\r\n'. Consequently, there would be more characters rewritten than having been read, and then the offset between the file's pointers ahead and back would diminish and cause a messy rewriting.
In HTML sources , there are also various newlines.
That's why it's always preferable to open files in binary mode when they are so processed.
Alternative version
You can direct use the command read()
if you know the character position pos
of the separating (header part from the part of interest) linebreak, e.g. an \n
, in the text at which you want to break your input text:
with open('input.txt', 'r') as txt_in:
txt_in.seek(pos)
second_half = txt_in.read()
If you are interested in both halfs, you could also investigate the following method:
with open('input.txt', 'r') as txt_in:
all_contents = txt_in.read()
first_half = all_contents[:pos]
second_half = all_contents[pos:]
You can read the entire file into an array/list and then just start at the index appropriate to the line you wish to start reading at.
f = open('dwn.txt', 'rb')
fileAsList = f.readlines()
fileAsList[0] #first line
fileAsList[1] #second line
#!/usr/bin/python
with open('dnw.txt', 'r') as f:
lines_7_through_end = f.readlines()[6:]
print "Lines 7+:"
i = 7;
for line in lines_7_through_end:
print " Line %s: %s" % (i, line)
i+=1
Prints:
Lines 7+:
Line 7: 201102 foo bar 200 50 Line 8: 201101 foo bar 300 35
Edit:
To rebuild dwn.txt
without the first six lines, do this after the above code:
with open('dnw.txt', 'w') as f:
for line in lines_7_through_end:
f.write(line)
I have created a script used to cut an Apache access.log file several times a day. It's not original topic of question, but I think it can be useful, if you have store the file cursor position after the 6 first lines reading.
So I needed the set a position cursor on last line parsed during last execution.
To this end, I used file.seek()
and file.seek()
methods which allows the storage of the cursor in file.
My code :
ENCODING = "utf8"
CURRENT_FILE_DIR = os.path.dirname(os.path.abspath(__file__))
# This file is used to store the last cursor position
cursor_position = os.path.join(CURRENT_FILE_DIR, "access_cursor_position.log")
# Log file with new lines
log_file_to_cut = os.path.join(CURRENT_FILE_DIR, "access.log")
cut_file = os.path.join(CURRENT_FILE_DIR, "cut_access", "cut.log")
# Set in from_line
from_position = 0
try:
with open(cursor_position, "r", encoding=ENCODING) as f:
from_position = int(f.read())
except Exception as e:
pass
# We read log_file_to_cut to put new lines in cut_file
with open(log_file_to_cut, "r", encoding=ENCODING) as f:
with open(cut_file, "w", encoding=ENCODING) as fw:
# We set cursor to the last position used (during last run of script)
f.seek(from_position)
for line in f:
fw.write("%s" % (line))
# We save the last position of cursor for next usage
with open(cursor_position, "w", encoding=ENCODING) as fw:
fw.write(str(f.tell()))
Just do f.readline() six times. Ignore the returned value.
Solutions with readlines() are not satisfactory in my opinion because readlines() reads the entire file. The user will have to read again the lines (in file or in the produced list) to process what he wants, while it could have been done without having read the intersting lines already a first time. Moreover if the file is big, the memory is weighed by the file's content while a for line in file
instruction would have been lighter.
Doing repetition of readline() can be done like that
nb = 6
exec( nb * 'f.readline()\n')
It's short piece of code and nb is programmatically adjustable
精彩评论