Most efficient way to concatenate and rearrange files

2022-12-24 17:59 问答作者：

I am reading from several files, each file is divided into 2 pieces, first a header section of a few thousand lines follo开发者_Go百科wed by a body of a few thousand. My problem is I need to concatenate these files into one file where all the headers are on the top followed by the body.

Currently I am using two loops: one to pull out all the headers and write them, and the second to write the body of each file (I also include a tmp_count variable to limit the number of lines to be loading into memory before dumping to file).

This is pretty slow - about 6min for 13gb file. Can anyone tell me how to optimize this or if there is a faster way to do this in python ?

Thanks!

Here is my code:

def cat_files_sam(final_file_name,work_directory_master,file_count):

    final_file = open(final_file_name,"w")

    if len(file_count) > 1:
        file_count=sort_output_files(file_count)

    # only for @ headers    
    for bowtie_file in file_count:
        #print bowtie_file
        tmp_list = []

        tmp_count = 0
        for line in open(os.path.join(work_directory_master,bowtie_file)):
            if line.startswith("@"):

            if tmp_count == 1000000:
                final_file.writelines(tmp_list)
                tmp_list = []
                tmp_count = 0

            tmp_list.append(line)
            tmp_count += 1

        else:
            final_file.writelines(tmp_list)
            break

    for bowtie_file in file_count:
        #print bowtie_file
        tmp_list = []

        tmp_count = 0
        for line in open(os.path.join(work_directory_master,bowtie_file)):
            if line.startswith("@"):
            continue
        if tmp_count == 1000000:
            final_file.writelines(tmp_list)
            tmp_list = []
            tmp_count = 0

        tmp_list.append(line)
        tmp_count += 1
        final_file.writelines(tmp_list)

    final_file.close()

How fast would you expect it to be to move 13Gb of data around? This problem is I/O bound and not a problem with Python. To make it faster, do less I/O. Which means that you are either (a) stuck with the speed you've got or (b) should retool later elements of your toolchain to handle the files in-place rather than requiring one giant 13 Gb file.

You can save the time it takes the 2nd time to skip the headers, as long as you have a reasonable amount of spare disk space: as well as the final file, also open (for 'w+') a temporary file temp_file, and do:

import shutil

hdr_list = []
bod_list = []
dispatch = {True: (hdr_list, final_file), 
            False: (bod_list, temp_file)}

for bowtie_file in file_count:
    with open(os.path.join(work_directory_master,bowtie_file)) as f:
        for line in f:
            L, fou = dispatch[line[0]=='@']
            L.append(f)
            if len(L) == 1000000:
                fou.writelines(L)
                del L[:]

# write final parts, if any
for L, fou in dispatch.items():
    if L: fou.writelines(L)

temp_file.seek(0)
shutil.copyfileobj(temp_file, final_file)

This should enhance your program's performance. Fine-tuning that now-hard-coded 1000000, or even completely doing away with the lists and writing each line directly to the appropriate file (final or temporary), are other options you should benchmark (but if you have unbounded amounts of memory, then I expect that they won't matter much -- however, intuitions about performance are often misleading, so it's best to try and measure!-).

There are two gross inefficiencies in the code you meant to write (which is not the code presented):

You are building up huge lists of header lines in the first major for block instead of just writing them out.
You are skipping the headers of the files again in the second major for block line by line when you've already determined where the headers end in (1). See file.seek and file.tell

继续阅读：concatenation file python

Most efficient way to concatenate and rearrange files

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？