Split large .gz files with prefixes
Each of my fastq files is about 20 millions reads (or 20 millions lines). Now I need to split the big fastq files into chunks, each with only 1 million reads (or 1 million lines), for the ease of further analysis. fastq file is just like .txt.
My thought is, just count the line, and print out the lines after counting every 1 million lines. But the input file is .gz compressed form (fastq.gz), do I need to unzip first?
How can I do this with python?
I tried the following command:
zless XXX.fastq.gz |split -l 4000000 prefix
(gzip first then split the file)
However, seems it doesn't work wit开发者_如何学Pythonh prefix (I tried) "-prefix", still it doesn't work. Also, with split command the output is like:
prefix-aa, prefix-ab...
If my prefix is XXX.fastq.gz
, then the output will be XXX.fastq.gzab
, which will destroy the .fastq.gz format.
So what I need is XXX_aa.fastq.gz, XXX_ab.fastq.gz (ie. suffix). How can I do that?
As posted here
zcat XXX.fastq.gz | split -l 1000000 --additional-suffix=".fastq" --filter='gzip > $FILE.gz' - "XXX_"
...I need to unzip it first.
No you don't, at least not by hand. gzip
will allow you to open the compressed file, at which point you read out a certain number of bytes and write them out to a separate compressed file. See the examples at the bottom of the linked documentation to see how to both read and write compressed files.
with gzip.open(infile, 'rb') as inp:
for <some number of loops>:
with gzip.open(outslice,'wb') as outp:
outp.write(inp.read(slicesize))
else: # only if you're not sure that you got the whole thing
with gzip.open(outslice,'wb') as outp:
outp.write(inp.read())
Note that gzip-compressed files are not random-accessible so you will need to perform the operation in one go unless you want to decompress the source file to disk first.
You can read a gzipped file just like an uncompressed file:
>>> import gzip
>>> for line in gzip.open('myfile.txt.gz', 'r'):
... process(line)
The process()
function would handle the specific line-counting and conditional processing logic that you mentioned.
精彩评论