Django Python Garbage Collection woes
After 2 days of debug, I nailed down my time-hog: the Python garbage collector.
My application holds a lot of objects in memory. And it works well. The GC does the usual rounds (I have not played with the default thresholds of (700, 10, 10)). Once in a while, in the middle of an important transaction, the 2nd generation sweep kicks in and reviews my ~1.5M generation 2 objects. This takes 2 seconds! The nominal transaction takes less than 0.1 seconds.My question is what should I do?
I can turn off generation 2 sweeps (by setting 开发者_JAVA百科a very high threshold - is this the right way?) and the GC is obedient. When should I turn them on? We implemented a web service using Django, and each user request takes about 0.1 seconds. Optimally, I will run these GC gen 2 cycles between user API requests. But how do I do that? My view ends withreturn HttpResponse()
, AFTER which I would like to run a gen 2 GC sweep.
How do I do that? Does this approach even make sense?
Can I mark the object that NEVER need to be garbage collected so the GC will not test them every 2nd gen cycle?
How can I configure the GC to run full sweeps when the Django server is relatively idle?Python 2.6.6 on multiple platforms (Windows / Linux).
We did something like this for gunicorn. Depending on what wsgi server you use, you need to find the right hooks for AFTER the response, not before. Django has a request_finished
signal but that signal is still pre response.
For gunicorn, in the config you need to define 2 methods like so:
def pre_request(worker, req):
# disable gc until end of request
gc.disable()
def post_request(worker, req, environ, resp):
# enable gc after a request
gc.enable()
The post_request
here runs after the http response has been delivered, and so is a very good time for garbage collection.
I believe one option would be to completely disable garbage collection and then manually collect at the end of a request as suggested here: How does the Garbage Collection mechanism work?
I imagine that you could disable the GC in your settings.py
file.
If you want to run GarbageCollection on every request I would suggest developing some Middleware that does it in the process response method:
import gc
class GCMiddleware(object):
def process_response(self, request, response):
gc.collect()
return response
An alternative might be to disable GC altogether, and configure mod_wsgi (or whatever you're using) to kill and restart processes more frequently.
My view ends with return HttpResponse(), AFTER which I would like to run a gen 2 GC sweep.
// turn off GC
// do stuff
resp = HttpResponse()
// turn on GC
return resp
I'm not sure, but instead of //turn on GC
you might be able to // spawn thread to turn on GC in 0.1 sec
.
In order to make sure that GC doesn't happen until after the request is processed, if the thread spawning doesn't work, you would need to modify django itself or use some sort of django hook, as dcurtis suggested.
If you're dealing with performance-critical code, you might also want to consider using a manual memory management language like C/C++ for that part, and using Python simply to invoke/query it.
Building on the approach from @milkypostman you can use gevent. You want one call to garbage collection per request but the problem with the @milkypostman suggestion is that the call to gc.collect() will still block the returning of the request. Gevent lets us return immediately and have the GC run proceed after the process is returned from.
First in your wsgi file be sure to monkey patch all with gevent magic stuff and disable garbage collection. You can set gc.disable()
but some libraries have context managers that turn it on after disabling it (messagepack for instance), so the 0 threshold is more sticky.
import gc
from gevent import monkey
# Disable garbage collection runs
gc.set_threshold(0)
# Apply gevent monkey magic
monkey.patch_all()
Then create some middleware for Django like this:
from gc import collect
import gevent
class BaseMiddleware:
def __init__(self, get_response):
self.get_response = get_response
class GcCollectMiddleware(BaseMiddleware):
"""Middleware which performs a non-blocking gc.collect()"""
def __call__(self, request):
response = self.get_response(request)
gevent.spawn(collect)
return response
You'll see the main difference here vs the previously suggested approach is that gc.collect()
is wrapped in gevent.spawn
which will not block returning the HttpResponse
and your users will get a snappier response!
精彩评论