Interpret differences in prstat vs. 'prstat -m' on Solaris
I've been using prstat and prstat -m a lot to investigate performance issues lately, and I think I've basically understood the differences of sampling vs. microstate accounting in Solaris 10. So I don't expect both to always show the exactly same number.
Today I came across an occasion where the 2 showed so vastly different outputs, that I have problems interpreting them and making sense of the output. The machine is a heavily loaded 8-CPU Solaris 10, with several large WebSphere processes and an Oracle database. The system practically came to a halt today for about 15 minutes (load averages of >700). I had difficulties to get any prstat information, but was able to get some outputs from "prtstat 1 1" and "prtstat -m 1 1", issued shortly one after another.
The top lines of the outputs:
prstat 1 1:
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP 8379 was 3208M 2773M cpu5 60 0 5:29:13 19% java/145 7123 was 3159M 2756M run 59 0 5:26:45 7.7% java/109 5855 app1 1132M 26M cpu2 60 0 0:01:01 7.7% java/18 16503 monitor 494M 286M run 59 19 1:01:08 7.1% java/106 7112 oracle 15G 15G run 59 0 0:00:10 4.5% oracle/1 7124 oracle 15G 15G cpu3 60 0 0:00:10 4.5% oracle/1 7087 app1 15G 15G run 58 0 0:00:09 4.0% oracle/1 7155 oracle 96M 6336K cpu1 60 0 0:00:07 3.6% oracle/1 ... Total: 495 processes, 4581 lwps, load averages: 74.79, 35.35, 23.8
prstat -m 1 1 (some seconds later)
PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP 7087 app1 0.1 56 0.0 0.2 0.4 0.0 13 30 96 2 33 0 oracle/1 7153 oracle 0.1 53 0.0 3.2 1.1 0.0 1.0 42 82 0 14 0 oracle/1 7124 oracle 0.1 47 0.0 0.2 0.2 0.0 0.0 52 77 2 16 0 oracle/1 7112 oracle 0.1 47 0.0 0.4 0.1 0.0 0.0 52 79 1 16 0 oracle/1 7259 oracle 0.1 45 9.4 0.0 0.3 0.0 0.1 45 71 2 32 0 oracle/1 7155 oracle 0.0 42 11 0.0 0.5 0.0 0.1 46 90 1 9 0 oracle/1 7261 oracle 0.0 37 9.5 0.0 0.3 0.0 0.0 53 61 1 17 0 oracle/1 7284 oracle 0.0 32 5.9 0.0 0.2 0.0 0.1 62 53 1 21 0 oracle/1 ... Total: 497 processes, 4576 lwps, load averages: 88.86, 39.93, 25.51
I have a very hard time interpreting the output. prstat seems to tell me that a fair amount of Java processing is g开发者_如何学Coing on, together with some Oracle stuff, just as I would expect in normal situation. prtstat -m shows a machine totally dominated by Oracle, consuming huge amounts of system time, and the overall CPU being heavily overloaded (large numbers in LAT).
I'm inclined to believe the output of prstat -m, because that matches much mor closely to what the system felt like during this time. Totally sluggish, almost no more user request processing going on from WebSphere, etc. But why would prstat show so heavily differing numbers?
Any explanation of this would be welcome!
CU, Joe
There's a known problem with prstat -m
on Solaris in the way cpu usage figures are calculated - the value you see has been averaged over all threads (LWPs) in a process, and hence is far far too low for heavily multithreaded processes - such as your Java app servers, which can have hundreds of threads each (see your NLWP). Less than a dozen of them are probably CPU hogs hence the CPU usage by java looks "low". You'd need to call it with prstat -Lm
to get the per-LWP (thread) breakdown to see that effect. Reference:
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6780169
Without further performance monitoring data it's hard to give non-speculative explanations of what you've seen there. I assume lock contention within java. One particular workload that can cause this is heavily multithreaded memory mapped I/O, they'll all pile up on the process address space lock. But it could be a purely java userside lock of course. A plockstat
on one of the java processes, and/or simple dtrace profiling would be helpful.
精彩评论