How to avoid using readlines()?

2023-03-31 00:40 问答作者：

I need to deal with super large txt input files, and I usually use 开发者_StackOverflow.readlines() to first read the whole file, and turn it into a list.

I know it's really memory-cost and can be quite slow, but I also need to make use of LIST characteristics to manipulate the specific lines, like below:

#!/usr/bin/python

import os,sys
import glob
import commands
import gzip

path= '/home/xxx/scratch/'
fastqfiles1=glob.glob(path+'*_1.recal.fastq.gz')

for fastqfile1 in fastqfiles1:
    filename = os.path.basename(fastqfile1)
    job_id = filename.split('_')[0]
    fastqfile2 = os.path.join(path+job_id+'_2.recal.fastq.gz') 

    newfastq1 = os.path.join(path+job_id+'_1.fastq.gz') 
    newfastq2 = os.path.join(path+job_id+'_2.fastq.gz') 

    l1= gzip.open(fastqfile1,'r').readlines()
    l2= gzip.open(fastqfile2,'r').readlines()
    f1=[]
    f2=[]
    for i in range(0,len(l1)):
        if i % 4 == 3:
           b1=[ord(x) for x in l1[i]]
           ave1=sum(b1)/float(len(l1[i]))
           b2=[ord(x) for x in str(l2[i])]
           ave2=sum(b2)/float(len(l2[i]))
           if (ave1 >= 20 and ave2>= 20):
              f1.append(l1[i-3])
              f1.append(l1[i-2])
              f1.append(l1[i-1])
              f1.append(l1[i])
              f2.append(l2[i-3])
              f2.append(l2[i-2])
              f2.append(l2[i-1])
              f2.append(l2[i])
    output1=gzip.open(newfastq1,'w')
    output1.writelines(f1)
    output1.close()
    output2=gzip.open(newfastq2,'w')
    output2.writelines(f2)
    output2.close()

In general, I'm trying to read every 4th line of the whole text, but if the 4th line meets the desired condition, I'll append these 4 lines into the text. So can I avoid readlines() to achieve this? thx

EDIT: Hi, actually I myself found a better way:

import commands
 l1=commands.getoutput('zcat ' + fastqfile1).splitlines(True)
 l2=commands.getoutput('zcat ' + fastqfile2).splitlines(True)

I think 'zcat' is super fast.... It took around 15min to readlines, while only 1 minute to just zcat...

If you can refactor your code to read through the file linearly, then you can just say for line in file to iterate through each line of the file without reading it all into memory at once. But, since your file access looks more complicated, you could use a generator to replace readlines(). One way to do this would be to use itertools.izip or itertools.izip_longest:

def four_at_a_time(iterable):
    """Returns an iterator that returns a 4-tuple of objects at a time from the
       given iterable"""
    args = [iter(iterable) * 4]
    return itertools.izip(*args)
...
l1 = four_at_a_time(gzip.open(fastqfile1, 'r'))
l2 = four_at_a_time(gzip.open(fastqfile2, 'r'))
for i, x in enumerate(itertools.izip(l1, l2))
    # x is now a 2-tuple of 4-tuples of lines (one 4-tuple of lines from the first file,
    # and one 4-tuple of lines from the second file).  Process accordingly.

A simple way would be to,

(pseudocode, may contain errors, for illustrative purposes only)

    a=gzip.open()
    b=gzip.open()

    last_four_a_lines=[]
    last_four_b_lines=[]

    idx=0

    new_a=[]
    new_b=[]

    while True:
      la=a.readline()
      lb=b.readline()
      if (not la) or (not lb):
        break

      if idx % 4==3:
        a_calc=sum([ something ])/len(la)
        b_calc=sum([ something ])/len(lb)
        if a_calc and b_calc:
          for line in last_four_a_lines:
          new_a.append(line)
          for line in last_four_b_lines:
          new_b.append(line)

      last_four_a_lines.append(la)
      del(last_four_a_lines[0])
      last_four_b_lines.append(lb)
      del(last_four_b_lines[0])
      idx+=1
a.close()
b.close()

You could use enumerate to iterate over the lines in the file, which would return a count and a line each iteration:

with open(file_name) as f:
    for i, line in enumerate(f):
        if i % 4 == 3:
            print i, line

Here is how to print all lines containing foo and the previous 3 lines:

f = open(...)
prevlines = []
for line in f:
  prevlines.append(line)
  del prevlines[:-4]
  if 'foo' in line:
    print prevlines

If you are reading 2 files at a time (with an equal number of lines), do it like this:

f1 = open(...)
f2 = open(...)
prevlines1 = []
for line1 in f1:
  prevlines1.append(line1)
  del prevlines1[:-4]
  line2 = f2.readline()
  prevlines2.append(line2)
  del prevlines2[:-4]
  if 'foo' in line1 and 'bar' in line2:
    print prevlines1, prevlines2

Tricky, because you actually have two files you are processing simultaneously.

You can use the fileinput module to efficiently parse a file one line at a time. It can be used to parse a list of files too, and you can use the fileinput.nextfile() method within the block to alternate through several files in parallel, consuming one line from each file at a time.

The fileinput.lineno() method will even give you the current line number in the current file. You could use temporary lists in the loop body to keep track of your 4-line blocks.

Totally untested ad-hoc code, possibly based on a misunderstanding of what your code does, follows:

f1 = []
f2 = []
for line in fileinput(filename1, filename2):
    if fileinput.filename() = filename1:
        f1.append(line)
    else:
        f2.append(line)
        if fileinput.lineno() % 4 == 3:
            doMyProcesing()
            f1 = []; f2 = []
    fileinput.nextfile()

I think that improving the obtention of l1 and l2 isn't sufficient: you must improve your code globally

I propose:

#!/usr/bin/python

import os
import sys
import gzip

path= '/home/xxx/scratch/'

def gen(gfa,gfb):
    try:
        a = (gfa.readline(),gfa.readline(),gfa.readline(),gfa.readline())
        b = (gfb.readline(),gfb.readline(),gfb.readline(),gfb.readline())
        if sum(imap(ord,a[3]))/float(len(a[3])) >= 20 \
           and sum(imap(ord,b[3]))/float(len(b[3])) >= 20:
            yield (a,b)
    except:
        break

for fastqfile1 in glob.glob(path + '*_1.recal.fastq.gz') :
    pji = path + os.path.basename(fastqfile1).split('_')[0] # pji = path + job_id

    gf1= gzip.open(fastqfile1,'r')
    gf2= gzip.open(os.path.join(pji + '_2.recal.fastq.gz'),'r')

    output1=gzip.open(os.path.join(pji + '_1.fastq.gz'),'w')
    output2=gzip.open(os.path.join(pji + '_2.fastq.gz'),'w')

    for lines1,lines2 in gen(gf1,gf2):
        output1.writelines(lines1)
        output2.writelines(lines2)

    output1.close()
    output2.close()

It should diminish the execution's time by 30 %. Pure guess.

PS:

code

if sum(imap(ord,a[3]))/float(len(a[3])) >= 20 \
   and sum(imap(ord,b[3]))/float(len(b[3])) >= 20:

is executed more rapidly rather than

ave1 = sum(imap(ord,a[3]))/float(len(a[3])) 
ave2 = sum(imap(ord,b[3]))/float(len(b[3]))
if ave1 >= 20 and ave2 >=20:

because if ave1 isn't greater than 20, the object ave2 isn't evaluated.

继续阅读：list python readlines

How to avoid using readlines()?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？