开发者

Split large .gz files with prefixes

Each of my fastq files is about 20 millions reads (or 20 millions lines). Now I need to split the big fastq files into chunks, each with only 1 million reads (or 1 million lines), for the ease of further analysis. fastq file is just like .txt.

My thought is, just count the line, and print out the lines after counting every 1 million lines. But the input file is .gz compressed form (fastq.gz), do I need to unzip first?

How can I do this with python?

I tried the following command:

zless XXX.fastq.gz |split -l 4000000 prefix

(gzip first then split the file)

However, seems it doesn't work wit开发者_如何学Pythonh prefix (I tried) "-prefix", still it doesn't work. Also, with split command the output is like:

prefix-aa, prefix-ab...

If my prefix is XXX.fastq.gz, then the output will be XXX.fastq.gzab, which will destroy the .fastq.gz format.

So what I need is XXX_aa.fastq.gz, XXX_ab.fastq.gz (ie. suffix). How can I do that?


As posted here

zcat XXX.fastq.gz | split -l 1000000 --additional-suffix=".fastq" --filter='gzip > $FILE.gz' - "XXX_"


...I need to unzip it first.

No you don't, at least not by hand. gzip will allow you to open the compressed file, at which point you read out a certain number of bytes and write them out to a separate compressed file. See the examples at the bottom of the linked documentation to see how to both read and write compressed files.

with gzip.open(infile, 'rb') as inp:
  for <some number of loops>:
    with gzip.open(outslice,'wb') as outp:
      outp.write(inp.read(slicesize))
  else: # only if you're not sure that you got the whole thing
    with gzip.open(outslice,'wb') as outp:
      outp.write(inp.read())

Note that gzip-compressed files are not random-accessible so you will need to perform the operation in one go unless you want to decompress the source file to disk first.


You can read a gzipped file just like an uncompressed file:

>>> import gzip
>>> for line in gzip.open('myfile.txt.gz', 'r'):
...   process(line)

The process() function would handle the specific line-counting and conditional processing logic that you mentioned.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜