开发者

About python loop

I have two path

path1 = "/home/x/nearline"
path2 = "/home/x/sge_jobs_output"

In path1, I have a bunch of fastq files:

ERR001268_1.recal.fastq.gz
ERR001268_2.recal.fastq.gz
ERR001269_1.recal.fastq.gz
ERR001269_2.recal.fastq.gz
.............

In path2, I have many .txt corresponding to the fastq files in path1:

ERR001268_1.txt
ERR001268_2.txt
ERR001269_1.txt
ERR001269_2.txt
.............

NOw I've made script to calculate fastq_seq_num from fastq files in path1, see below:

for file in os.listdir(path1):
  if re.match('.*\.recal.fastq.gz', file):
    fullpath1 = os.path.join(path1, file)
#To calculate the sequence number in fastq.gz files  
    result = commands.getoutput('zcat ' + fullpath1 + ' |wc -l')
    fastq_seq_num = int(result)/4.0
  print file,fastq_seq_num 

And also calculate num_seq_processed_sai from .txt files in path2, see below:

for file in os.listdir(path2):
  if re.match('.*\.txt', file):
      fullpath2 = os.path.join(path2, file)
#To calculate how many sequences have been processed in .sai file
      linelist = open (fullpath2,'r').readlines
      lastline = linelist[len(linelist)-1]
      num_seq_processed_sai = lastline.split(']')[1].split()[0]
  print file,num_seq_processed_sai

OK, now my problem is : I want to create a loop, in which I calculate fastq_seq_num for the FIRST fastq file in path1; then calcula开发者_StackOverflow社区te num_seq_processed for the FIRST txt file in path2; then compare this two numbers; then end the loop. Then the second loop starts...How can I desgin some loop to achieve this? thanks!!!


Iterate the fastq files, check for the corresponding .txt file and if the pair exists, run your processing and compare the outputs

import commands
import glob
from os import path

dir1 = '/home/x/nearline'
dir2 = '/home/x/sge_jobs_output'

for filepath in glob.glob(path.join(dir1, '*.recal.fastq.gz')):
    filename = path.basename(filepath)
    job_id = filename.split('.', 1)[0]

    ## Look for corresponding .txt file
    txt_filepath = path.join(dir2, '%s.txt' % job_id)
    ## Fail early if corresponding .txt file is missing
    if not path.isfile(txt_filepath):
        print('Missing %s for %s' % (txt_filepath, filepath))
        continue

    ## Both exist, process each
    ## This is from your code snippet
    result = commands.getoutput('zcat ' + fullpath1 + ' |wc -l')
    fastq_seq_num = int(result)/4.0

    linelist = open(txt_filepath).readlines()
    lastline = linelist[len(linelist)-1]
    num_seq_processed_sai = lastline.split(']')[1].split()[0]

    if fastq_seq_num == num_seq_processed_sai:
        print "Sequence numbers match (%d : %d)" % (fastq_seq_num, num_seq_processed_sai)
    else:
        print "Sequence numbers do not match (%d : %d)" % (fastq_seq_num, num_seq_processed_sai)

I suggest using glob.glob() to list your files (as I've done in this snippet).

I also suggest replacing commands with subprocess, it's newer and I think commands is deprecated.

Also, reading all lines of a .txt file to get at the last line is inefficient and may cause memory issues if the files are large. Consider calling tail to get the last line (*nix only I think).


Are the number of files in both directories guaranteed to be the same? If so, you can use the zip function to accomplish this:

for fastqFile, txtFile in zip(glob.glob(path1+'/*.recal.fastq.gz'), glob.glob(path2+'/*.txt')):
    result = commands.getoutput('zcat ' + fastqFile + ' |wc -l')
    fastq_seq_num = int(result)/4.0

    lastline = linelist[-1]
    num_seq_processed_sai = lastline.split(']')[1].split()[0]


    print fastqFile, fastq_seq_num 
    print txtFile, num_seq_processed_sai

EDIT: As side notes, using glob.glob() is almost always preferable to manually filtering the output of os.listdir() and the word file is a built-in type in Python, which you should never use as a variable name. In addition, to get to the last item of a list, you only need to access it with listName[-1]. Accessing it with listName[len(listName)-1] is unpythonic.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜