NGINX + PHP5-FPM segfaults under high load
I have been dealing with this problem all day and it is driving me insane. All Google results and searches here lead to dead ends. I hope someone can work with me to provide a solution for myself and future victims. Here we go.
I am running a very popular website with over 3M page views a day. On average that is 34 page views per second, but more realistically, during peak hours, it gets to over 300 page views per second. Think of these as requests.
I am running a Ubuntu 10.04 64-bit server with 2 E5620 CPUs, 12GB RAM, and a Micron P300 6Gb/s SSD. During the peak hours the CPU and memory load is average (20-30% CPU and half of memory is used).
The software that powers this site is: NGINX, MySQL, PHP5-FPM, PHP-APC, and Memcached. Ok, now finally the meat of the post, here are my error logs. There a bunch of these errors logged.
/var/log/php5-fpm
Jul 20 14:49:47.289895 [NOTICE] fpm is running, pid 29373
Jul 20 14:49:47.337092 [NOTICE] ready to handle connections
Jul 20 14:51:23.957504 [ERROR] [pool www] unable to retrieve process activity of one or more child(ren). Will try again later.
Jul 20 14:51:41.846439 [WARNING] [pool www] child 29534 exited with code 1 after 114.518174 seconds from start
Jul 20 14:51:41.846797 [NOTICE] [pool www] child 29597 started
Jul 20 14:51:41.896653 [WARNING] [pool www] child 29408 exited on signal 11 SIGSEGV after 114.596706 seconds from start
Jul 20 14:51:41.897178 [NOTICE] [pool www] child 29598 started
Jul 20 14:51:41.903286 [WARNING] [pool www] child 29398 exited with code 1 after 114.605761 seconds from start
Jul 20 14:51:41.903719 [NOTICE] [pool www] child 29600 started
Jul 20 14:51:41.907816 [WARNING] [pool www] child 2开发者_StackOverflow9437 exited with code 1 after 114.601417 seconds from start
Jul 20 14:51:41.908253 [NOTICE] [pool www] child 29601 started
Jul 20 14:51:41.916002 [WARNING] [pool www] child 29513 exited with code 1 after 114.592514 seconds from start
Jul 20 14:51:41.916501 [NOTICE] [pool www] child 29602 started
Jul 20 14:51:41.916558 [WARNING] [pool www] child 29494 exited on signal 11 SIGSEGV after 114.597355 seconds from start
Jul 20 14:51:41.916873 [NOTICE] [pool www] child 29603 started
Jul 20 14:51:41.921389 [WARNING] [pool www] child 29502 exited with code 1 after 114.600405 seconds from start
/var/log/nginx/error.log 2011/07/20 15:48:42 [error] 29583#0: *569743 readv() failed (104: Connection reset by peer) while reading upstream, client: 77.223.197.193, server: domain.com, request: "GET /favicon.ico HTTP/1.1", upstream: "fastcgi://127.0.0.1:9000", host: "www.domain.com"
2011/07/20 15:48:42 [error] 29578#0: *571695 readv() failed (104: Connection reset by peer) while reading upstream, client: 150.70.64.196, server: domain.com, request: "GET /page HTTP/1.0", upstream: "fastcgi://127.0.0.1:9000", host: "www.domain.com"
2011/07/20 15:48:42 [error] 29581#0: *571050 readv() failed (104: Connection reset by peer) while reading upstream, client: 110.136.157.66, server: domain.com, request: "GET /page HTTP/1.1", upstream: "fastcgi://127.0.0.1:9000", host: "www.domain.com"
2011/07/20 15:48:42 [error] 29581#0: *564892 readv() failed (104: Connection reset by peer) while reading upstream, client: 110.136.161.214, server: domain.com, request: "GET /page HTTP/1.1", upstream: "fastcgi://127.0.0.1:9000", host: "www.domain.com"
2011/07/20 15:48:42 [error] 29585#0: *456171 readv() failed (104: Connection reset by peer) while reading upstream, client: 93.223.33.135, server: domain.com, request: "GET /favicon.ico HTTP/1.1", upstream: "fastcgi://127.0.0.1:9000", host: "www.domain.com"
2011/07/20 15:48:42 [error] 29585#0: *471192 readv() failed (104: Connection reset by peer) while reading upstream, client: 74.90.33.142, server: domain.com, request: "GET /page HTTP/1.1", upstream: "fastcgi://127.0.0.1:9000", host: "www.domain.com"
2011/07/20 15:48:42 [error] 29580#0: *570132 readv() failed (104: Connection reset by peer) while reading upstream, client: 180.246.182.191, server: domain.com, request: "GET /page HTTP/1.1", upstream: "fastcgi://127.0.0.1:9000", host: "www.domain.com"
Finally, I want to point out that I did try to disable PHP-APC to see if it was a bug with the opt cacher, but the segfaults still persisted. I also have PHP5-SUHOSIN installed and I disabled it too, but the errors still keep happening.
This issue just happend to me.
PHP5-FPM was having segfaults on most of its children. In my case, we had 0bytes available on the harddisk. A quick log shredding stopped the segfaults.
2011/07/20 15:48:42 [error] 29583#0: *569743 readv() failed (104: Connection reset by peer) while reading upstream, client: 77.223.197.193, server: domain.com, request: "GET /favicon.ico HTTP/1.1", upstream: "fastcgi://127.0.0.1:9000", host: "www.domain.com"
thats just some problem with your config for your upstream server / router / client reset? of nginx dropped the request but running a site at 3 times the load you described i never saw that message, the requested resource isnt even handed to a php-fpm process, its a favicon
and for the php-fpm messages the children seem to stop after the 114 sec limit, is that a limit set by your php.ini file? seg faults in php often occur when using high memory, your php scripts could leak memory and will eventually reach the memory limit, having the php-fpm processes serve less requests helps in dealing with memory leaks
See my answer here that's related to your question (about nginx + magento and high load)
NGINX-FPM configuration settings for magento
Its not a direct answer per say, but it may help you configure your nginx + php-fpm to help eliminate the faults.
You are probably using suhosin Disable ths suhosin.ini under /etc/php5/fpm/conf.d and restart the php5-fpm service
Check the suhosin version and try to install another one.
精彩评论