Download, extract and read a gzip file in Python
I'd like to download, extract and iterate over a text file in Python without having to create temporary files.
basically, this pipe, but in python
curl ftp://ftp.theseed.org/genomes/SEED/SEED.fasta.gz | gunzip | processing step
Here's my code:
def main():
import urllib
import gzip
# Download SEED database
print 'Downloading SEED Database'
handle = urllib.urlopen('ftp://ftp.theseed.org/genomes/SEED/SEED.fasta.gz')
with open('SEED.fasta.gz', 'wb') as out:
while True:
data = handle.read(1024)
if len(data) == 0: break
out.write(data)
# Extract SEED database
handle = gzip.open('SEED.fasta.gz')
with open('SEED.fasta', 'w') as out:
for line in handle:
out.write(line)
# Filter SEED database
pass
I don't want to use process.Popen() or anything because I want this script to be platform-independent.
The problem is that the Gzip library only accepts filenames as arguments and not handles. The reason for "piping" is that the download step only uses up ~5% CPU and it would be faster to run the extraction and processing at the same time.
EDIT: This won't work because
"Because of the way gzip compression works, GzipFile needs to save its position and move forwards and backwards through the compressed file. This doesn't work when the “file” is a stream of bytes coming from a remote server; all you 开发者_运维技巧can do with it is retrieve bytes one at a time, not move back and forth through the data stream." - dive into python
Which is why I get the error
AttributeError: addinfourl instance has no attribute 'tell'
So how does curl url | gunzip | whatever
work?
Just gzip.GzipFile(fileobj=handle)
and you'll be on your way -- in other words, it's not really true that "the Gzip library only accepts filenames as arguments and not handles", you just have to use the fileobj=
named argument.
I've found this question while searching for methods to download and unzip a gzip
file from an URL but I didn't manage to make the accepted answer work in Python 2.7.
Here's what worked for me (adapted from here):
import urllib2
import gzip
import StringIO
def download(url):
# Download SEED database
out_file_path = url.split("/")[-1][:-3]
print('Downloading SEED Database from: {}'.format(url))
response = urllib2.urlopen(url)
compressed_file = StringIO.StringIO(response.read())
decompressed_file = gzip.GzipFile(fileobj=compressed_file)
# Extract SEED database
with open(out_file_path, 'w') as outfile:
outfile.write(decompressed_file.read())
# Filter SEED database
# ...
return
if __name__ == "__main__":
download("ftp://ftp.ebi.ac.uk/pub/databases/Rfam/12.0/fasta_files/RF00001.fa.gz")
I changed the target URL since the original one was dead: I just looked for a gzip
file served from an ftp server like in the original question.
A python3
solution which does not require a for
loop & writes the byte
object directly as a binary
stream:
import gzip
import urllib.request
def download_file(url):
out_file = '/path/to/file'
# Download archive
try:
# Read the file inside the .gz archive located at url
with urllib.request.urlopen(url) as response:
with gzip.GzipFile(fileobj=response) as uncompressed:
file_content = uncompressed.read()
# write to file in binary mode 'wb'
with open(out_file, 'wb') as f:
f.write(file_content)
return 0
except Exception as e:
print(e)
return 1
Call the function with retval=download_file(url)
to capture the return code
for python 3.8 here is my code, wrote on 08/05/2020
import re
from urllib import request
import gzip
import shutil
url1 = "https://www.destinationlighting.com/feed/sitemap_items1.xml.gz"
file_name1 = re.split(pattern='/', string=url1)[-1]
r1 = request.urlretrieve(url=url1, filename=file_name1)
txt1 = re.split(pattern=r'\.', string=file_name1)[0] + ".txt"
with gzip.open(file_name1, 'rb') as f_in:
with open(txt1, 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)
精彩评论