开发者

S3 Backup Memory Usage in Python

I currently use WebFaction for my hosting with the basic package that gives us 80MB of RAM. This is more than adequate for our needs at the moment, apart from our backups. We do our own backups to S3 once a day.

The backup process is this: dump the database, tar.gz all the files into one backup named with the correct date of the backup, upload to S3 using the python library provided by Amazon.

Unfortunately, it appears (although I don't know this for certain) that either my code for reading the file or the S3 code is loading the entire file in to memory. As the file is approximately 320MB (for today's backup) it is using about 320MB just for the backup. This causes WebFaction to quit all our processes meaning the backup doesn't happen and our site goes down.

So this i开发者_如何学Gos the question: Is there any way to not load the whole file in to memory, or are there any other python S3 libraries that are much better with RAM usage. Ideally it needs to be about 60MB at the most! If this can't be done, how can I split the file and upload separate parts?

Thanks for your help.

This is the section of code (in my backup script) that caused the processes to be quit:

filedata = open(filename, 'rb').read()
content_type = mimetypes.guess_type(filename)[0]
if not content_type:
    content_type = 'text/plain'
print 'Uploading to S3...'
response = connection.put(BUCKET_NAME, 'daily/%s' % filename, S3.S3Object(filedata), {'x-amz-acl': 'public-read', 'Content-Type': content_type})


It's a little late but I had to solve the same problem so here's my answer.

Short answer: in Python 2.6+ yes! This is because the httplib supports file-like objects as of v2.6. So all you need is...

fileobj = open(filename, 'rb')
content_type = mimetypes.guess_type(filename)[0]
if not content_type:
    content_type = 'text/plain'
print 'Uploading to S3...'
response = connection.put(BUCKET_NAME, 'daily/%s' % filename, S3.S3Object(fileobj), {'x-amz-acl': 'public-read', 'Content-Type': content_type})

Long answer...

The S3.py library uses python's httplib to do its connection.put() HTTP requests. You can see in the source that it just passes the data argument to the httplib connection.

From S3.py...

    def _make_request(self, method, bucket='', key='', query_args={}, headers={}, data='', metadata={}):

        ...

        if (is_secure):
            connection = httplib.HTTPSConnection(host)
        else:
            connection = httplib.HTTPConnection(host)

        final_headers = merge_meta(headers, metadata);
        # add auth header
        self._add_aws_auth_header(final_headers, method, bucket, key, query_args)

        connection.request(method, path, data, final_headers) # <-- IMPORTANT PART
        resp = connection.getresponse()
        if resp.status < 300 or resp.status >= 400:
            return resp
        # handle redirect
        location = resp.getheader('location')
        if not location:
            return resp
        ...

If we take a look at the python httplib documentation we can see that...

HTTPConnection.request(method, url[, body[, headers]])

This will send a request to the server using the HTTP request method method and the selector url. If the body argument is present, it should be a string of data to send after the headers are finished. Alternatively, it may be an open file object, in which case the contents of the file is sent; this file object should support fileno() and read() methods. The header Content-Length is automatically set to the correct value. The headers argument should be a mapping of extra HTTP headers to send with the request.

Changed in version 2.6: body can be a file object.


don't read the whole file into your filedata variable. you could use a loop and then just read ~60 MB and submit them to amazon.

backup = open(filename, 'rb')
while True:
    part_of_file = backup.read(60000000) # not exactly 60 MB....
    response = connection.put() # submit part_of_file here to amazon
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜