开发者

what can cause large discrepancy between minor GC time and total pause time?

We have a latency-sensitive application, and are experiencing some GC-related pauses we don't fully understand. We occasionally have a minor GC that results in application pause times that are much longer than the reported GC time itself. Here is an example log snippet:

485377.257: [GC 485378.857: [ParNew: 105845K->621K(118016K), 0.0028070 secs] 136492K->31374K(1035520K), 0.0028720 secs] [Times: user=0.01 sys=0.00, real=1.61 secs]

Total time for which application threads were stopped: 1.6032830 seconds

The total pause time here is orders of magnitude longer than the reported GC time. These are isolated and occasional events: the immediately preceding and succeeding minor GC events do not show this large discrepancy.

The process is running on a dedicated machine, with lots of free memory, 8 cores, running Red Hat Enterprise Linux ES Release 4 Update 8 with kernel 2.6.9-89.0.1EL-smp. We have observed this 开发者_Python百科with (32 bit) JVM versions 1.6.0_13 and 1.6.0_18.

We are running with these flags:

-server -ea -Xms512m -Xmx512m -XX:+UseConcMarkSweepGC -XX:NewSize=128m -XX:MaxNewSize=128m -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCApplicationStoppedTime -XX:-TraceClassUnloading

Can anybody offer some explanation as to what might be going on here, and/or some avenues for further investigation?


You're positive you're not swapping? Typically seeing:

Times: user=0.01 sys=0.00, real=1.61 secs

(from your trace)

suggests that something has happened in the process that doesn't take CPU but does take wall clock time... and that's usually swap or other I/O. a bit of iostat might help shed light...

Are you using a lot of native memory outside the Java heap? (possibly via DirectByteBuffer, nio, etc..) that may be eating into your "lots of free memory" statement (much to your surprise). 'top' or vmstat might also show this.


"Time-to-safepoint" is a wide cause for this sort of thing. Unfortunately, GC logs only the time from when it started doing work (after ALL application thread have been paused at a safepoint) to when if finished (after which the threads will be released from their safepoints). -XX:+PrintGCApplicationStoppedTime (much more correctly) reports the time from telling the first thread to go to a safepoint to the time the last thread was released to run again.

It is unfortunately common to see one thread takes a long time to come to a safe point, and when this happens, all the other nice and polite threads that went to a safepoint and paused there when told will be waiting until the straggler comes in. Examples of such things are long runtime operations. E.g. cloning objects array is done with no internal safepoint opportunities in most JVMs (imagine cloning a 1GB array, and happening to need to take a GC pause in the middle). Optimized counted loops in your own can also end up running very long without internal safepoints.

[Zing has a built-in time-to-safepoint profiler, partly to track and beat down this sort of thing].


You say there's "lots of free memory" but your heap size is capped at 512MB. You might be running out of memory more often/earlier than you think.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜