开发者

Efficiently prepending text to a very large text file in Python

I have to prepend some arbitrary text to an existing, but very large (2 - 10 GB range) text file. With the file being so large, I'm trying to avoid reading the enti开发者_如何转开发re file in to memory. But am I being too conservative with a line-by-line iteration? Would moving to a readlines(sizehint) approach give me much of a performance advantage over my current approach?

The delete-and-move at the end is less than ideal but, as far as I know, there's no way to do this sort of manipulation with linear data, in place. But I'm not so well versed in Python -- maybe there's something unique to Python I can exploit to do this better?

import os
import shutil
def prependToFile(f, text):
    f_temp = generateTempFileName(f)
    inFile  = open(f, 'r')
    outFile = open(f_temp, 'w')    
    outFile.write('# START\n')
    outFile.write('%s\n' % str(text))
    outFile.write('# END\n\n')
    for line in inFile:
        outFile.write(line)
    inFile.close()
    outFile.close()
    os.remove(f)
    shutil.move(f_temp, f)


If this is on Windows NTFS, you can insert into the middle of a file. (Or so I'm told, I'm not a Windows developer).

If this is on a POSIX (Linux or Unix) system, you should use "cat" as someone else said. cat is wickedly efficient, using every trick in the book to get optimal performance (ie. voids copying buffers, etc.)

However, if you must do it in python, the code you presented could be improved by using shutil.copyfileobj() (which takes 2 file handles) and tempfile.TemporaryFile (create a file that automatically gets deleted on close):

import os
import shutil
import tempfile

def prependToFile(f, text):
    outFile = tempfile.NamedTemporaryFile(dir='.', delete=False)
    outFile.write('# START\n')
    outFile.write('%s\n' % str(text))
    outFile.write('# END\n\n')
    shutil.copyfileobj(file(f, 'r'), outFile)
    os.remove(f)
    shutil.move(outFile.name, f)
    outFile.close()

I think the os.remove(f) isn't needed as shutil.move() will delete f. However, you should test that. Also, the "delete=False" may not be needed but may be safe to leave it.


What you want to do is read the file in large (anywhere from 64k to several MB) blocks and write the blocks out. In other words, instead of individual lines, use huge blocks. That way you do the fewest I/Os possible and hopefully your process is I/O-bound instead of CPU-bound.


You can use tools better suited to the job os.system("cat file1 file2 > file3")


To be honest, I would recommend you just write this in C instead if you're worried about execution time. Doing system calls from Python can be quite slow, and since you'll have to do a lot of them whether you do the line-by-line or raw block read approach, that will really drag things down.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜