开发者

top reasons why an app server crashes

What are the most likely causes for application server failure?

For example: "out of disk space" is more likely than "2 of the drives in a RAID 4 setup die simultaneously".

My particular environment is Java, so Java-specific answers are welcome, but not required.

EDIT just to clarify, i'm looking for downtime-related crashes (out of memory is a good example) not just one-time issues (lik开发者_运维问答e a temporary network glitch).


If you are trying to keep an application server up, start monitoring it. Nagios, Big Sister, and other Network Monitoring tools can be very useful.

Watch memory availability / usage, disk availability / usage, cpu availability / usage, etc.

The most common reason why a server goes down is rarely the same reason twice. Someone "fixes" the last-most-common-reason, and a new-most-common-reason is born.


Edwin is right - you need monitoring to understand what the problem is. Or better - understand what the problem is AND prevent it from causing downtime.

You should not only track resource consumption but also demand. The difference between the two shows you if you have sized your server correctly.

There are a ton of open source tools like nagios, CollectD, etc. that can give you server specific data - that's only monitoring though, not prevention. Librato Silverline (disclosure: I work there) allows you to monitor individual processes and then throttle the resources they use by placing them in application containers for which you define resource polices. If your server is 8 cores or less you can use it for free.


"Out of Memory" exception due to memory leaks.


All sorts of things can cause a server to crash, ranging from busted hardware (e.g. disk failures) to faulty code (memory leak resulting in an out of memory exception, network failure that got rethrown as a runtime exception and was never caught, in servers that aren't Java servers a SEGFAULT, etc.)


At first, it is usually because of memory leaks, disk space problems, endless loops causing cpu to eat up.

Once you monitor those issues and set up correct logging and warning mechanisms, they turn meta on you... and exploding error handling becomes a possible reason for a full lockup: an error (or more likely: two in an unhappy combination) occurs but when the handler is trying to write to the logfiles or send a warning (by mail or something) it gets another error which it is trying to handle by writing to the logfile or sending a warning or... and this continues until one of the resources gives out: it may lead to skyrocketing server load, memory problems, filling disk space, locking up network traffic which means it won't be accessible for a remote user to correct the problem, etc.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜