开发者

Django/Apache freezing with mod_wsgi

I have a Django application that is running behind 2 load balanced mod_wsgi/Apache servers behind Nginx (static files, reverse proxy/load balance).

Every few days, my site becomes completely unresponsive. My guess is that a bunch of clients are requesting URLs that are blocking.

Here is my config

WSGIDaemonProcess web1 user=web1 group=web1 processes=8 threads=15 maximum-requests=500 python-path=/home/web1/django_env/lib/python2.6/site-packages display-name=%{GROUP}
WSGIProcessGroup web1
WSGIScriptAlias / /home/web1/django/wsgi/wsgi_handler.py

I've tried experimenting wi开发者_如何学JAVAth only using a single thread and more processes, and more threads and a single process. Pretty much everything I try sooner or later results in timed out page loads.

Any suggestions for what I might try? I'm willing to try other deployment options if that will fix the problem.

Also, is there a better way to monitor mod_wsgi other than the Apache status module? I've been hitting:

 curl http://localhost:8080/server-status?auto

And watching the number of busy workers as an indicator for whether I'm about to get into trouble (I assume the more busy workers I have, the more blocking operations are currently taking place).

NOTE: Some of these requests are to a REST web service that I host for the application. Would it make sense to rate limit that URL location through Nginx somehow?


Use:

http://code.google.com/p/modwsgi/wiki/DebuggingTechniques#Extracting_Python_Stack_Traces

to embed functionality that you can trigger at a time where you expect stuck requests and find out what they are doing. Likely the requests are accumulating over time rather than happening all at once, so you could do it periodically rather than wait for total failure.

As a fail safe, you can add the option:

inactivity-timeout=600

to WSGIDaemonProcess directive.

What this will do is restart the daemon mode process if it is inactive for 10 minutes.

Unfortunately at the moment this happens in two scenarios though.

The first is where there have been no requests at all for 10 minutes, the process will be restarted.

The second, and the one you want to kick in, is if all request threads are blocked and none of them has read any input from wsgi.input, nor have any yielded any response content, in 10 minutes, the process will again be restarted automatically.

This will at least mean your process should recover automatically and you will not be called out of bed. Because you are running so many processes, chances are that they will not all get stuck at the same time so restart shouldn't be noticed by new requests as other processes will still handle the requests.

What you should work out is how low you can make that timeout. You don't want it so low that processes will restart because of no requests at all as it will unload the application and next request if lazy loading being used will incur slow down.

What I should do is actually add a new option blocked-timeout which specifically checks for all requests being blocked for the defined period, therefore separating it from restarts due to no requests at all. This would make this more flexible as having it restart due to no requests brings its own issues with loading application again.

Unfortunately one can't easily implement a request-timeout which applies to a single request because the hosting configuration could be multithreaded. Injecting Python exceptions into a request will not necessarily unblock the thread and ultimately you would have to kill process anyway and interupt other concurrent requests. Thus blocked-timeout is probably better.

Another interesting thing to do might be for me to add stuff into mod_wsgi to report such forced restarts due to blocked processes into the New Relic agent. That would be really cool then as you would get visibility of them in the monitoring tool. :-)


We had a similar problem at my work. Best we could ever figure out was race/deadlock issues with the app, causing mod_wsgi to get stuck. Usually killing one or more mod_wsgi processes would un-stick it for a while.

Best solution was to move to all-processes, no-threads. We confirmed with our dev teams that some of the Python libraries they were pulling in were likely not thread-safe.

Try:

WSGIDaemonProcess web1 user=web1 group=web1 processes=16 threads=1 maximum-requests=500 python-path=/home/web1/django_env/lib/python2.6/site-packages display-name=%{GROUP}

Downside is, processes suck up more memory than threads do. Consequently we usually end up with fewer overall workers (hence 16x1 instead of 8x15). And since mod_wsgi provides virtually nothing for reporting on how busy the workers are, you're SOL apart from just blindly tuning how many you have.

Upside is, this problem never happens anymore and apps are completely reliable again.

Like with PHP, don't use a threaded implementation unless you're sure it's safe... that means the core (usually ok), the framework, your own code, and anything else you import. :)


If I've understood your problem properly, you may try the following options:

  • move URL fetching out of the request/response cycle (using e.g. celery);
  • increase thread count (they can handle such blocks better than processes because they consume less memory);
  • decrease timeout for the urllib2.urlopen;
  • try gevent or eventlet (they will magically solve your problem but can introduce another subtle issues)

I don't think this is a deployment issue, this is more of a code issue and there is no apache configuration solving it.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜